1

I want to split large, compressed CSV files into multiple smaller gzip file, split on line boundary.

I'm trying to pipe gunzip to a bash script with a while read LINE. That script writes to a named pipe where a background gzip process is recompressing it. Every X characters read I close the FD and restart a new gzip process for the next split.

But in this scenario the script, with while read LINE, is consuming 90% of the cpu because read is so inefficient here (I understand that it makes a system call to read 1 char at a time).

Any thoughts on doing this efficiently? I would expect gzip to consume the majority cpu.

2 Answers 2

2

Use split with the -l option to specify how many lines you want. Use --filter option $FILE is the name split would have used for output to file (and has to be quoted with single quotes to prevent expanding by the shell too early:

zcat doc.gz | split -l 1000 --filter='gzip > $FILE.gz'

If you need any additional processing, just pen a script, that will accept the filename as argument and process standard input accordingly, and use that instead of plain gzip.

4
  • I would prefer to do it without the numerous writes to disk that split requires. These are huge files. Commented Dec 7, 2012 at 16:40
  • I also have a curl upload step following the recompress. So it will be much more efficient to do that in line with the split procedure Commented Dec 7, 2012 at 16:42
  • --filter is fantastic. But for future people to note, split does weird things with the filter command, it doesn't stream it continuously, so if you try to include an upload step in the --filter command it is likely to have timeout errors when the splits are larger than ~50MB in my experience. Split dumps data to the filter process (on stdin) in bulk, but doesn't generate the EOF until the next split has been fully read in and ready to be processed. Commented Dec 11, 2012 at 9:12
  • @DavidParks any chance -u (--unbuffered) could help?
    – peterph
    Commented Dec 11, 2012 at 12:03
0

How about using split command with -l option?

gzcat large.csv.gz | split -l 1000 - xxx
gzip xxx*
1
  • Trying to avoid going to disk so I can upload after the recompress inline Commented Dec 7, 2012 at 16:46

Not the answer you're looking for? Browse other questions tagged or ask your own question.