Is there an efficient way to read line input in bash?

Question

I want to split large, compressed CSV files into multiple smaller gzip file, split on line boundary.

I'm trying to pipe gunzip to a bash script with a while read LINE. That script writes to a named pipe where a background gzip process is recompressing it. Every X characters read I close the FD and restart a new gzip process for the next split.

But in this scenario the script, with while read LINE, is consuming 90% of the cpu because read is so inefficient here (I understand that it makes a system call to read 1 char at a time).

Any thoughts on doing this efficiently? I would expect gzip to consume the majority cpu.

peterph · Accepted Answer · 2012-12-07 16:46:11Z

2

Use split with the -l option to specify how many lines you want. Use --filter option $FILE is the name split would have used for output to file (and has to be quoted with single quotes to prevent expanding by the shell too early:

zcat doc.gz | split -l 1000 --filter='gzip > $FILE.gz'

If you need any additional processing, just pen a script, that will accept the filename as argument and process standard input accordingly, and use that instead of plain gzip.

edited Dec 7, 2012 at 16:46

answered Dec 7, 2012 at 16:36

peterph

9806 silver badges11 bronze badges

I would prefer to do it without the numerous writes to disk that split requires. These are huge files.
– David Parks
Commented Dec 7, 2012 at 16:40
I also have a curl upload step following the recompress. So it will be much more efficient to do that in line with the split procedure
– David Parks
Commented Dec 7, 2012 at 16:42
--filter is fantastic. But for future people to note, split does weird things with the filter command, it doesn't stream it continuously, so if you try to include an upload step in the --filter command it is likely to have timeout errors when the splits are larger than ~50MB in my experience. Split dumps data to the filter process (on stdin) in bulk, but doesn't generate the EOF until the next split has been fully read in and ready to be processed.
– David Parks
Commented Dec 11, 2012 at 9:12
@DavidParks any chance -u (--unbuffered) could help?
– peterph
Commented Dec 11, 2012 at 12:03

Add a comment |

yasu · Accepted Answer · 2012-12-07 16:41:07Z

0

How about using split command with -l option?

gzcat large.csv.gz | split -l 1000 - xxx
gzip xxx*

answered Dec 7, 2012 at 16:41

yasu

1,3648 silver badges16 bronze badges

Trying to avoid going to disk so I can upload after the recompress inline
– David Parks
Commented Dec 7, 2012 at 16:46

Add a comment |

Collectives™ on Stack Overflow

Is there an efficient way to read line input in bash?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
linux
bash
shell
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged linuxbashshell or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
linux
bash
shell
or ask your own question.