2

I am going to be combining 15 different gzip files. Ranging in size from 2 gigs to 15 gigs each so the files are relatively large. I have done research on the best way to do it, but I still have some questions.

Summary:

Starting with 15 different gzip files I want to finish with one sorted, duplicate free file in the gzip format.

For sake of conversation I will label the files as follows: file1, file2 ... file15

I am planning to use the sort command with the -u option. According to the man page for sort this means:

-u, --unique with -c, check for strict ordering; without -c, output only the first of an equal run

So what I am thinking of doing is this:

sort -u file* > sortedFile

From my understanding I would have one file that is sorted and does not have any duplicates. From my test files I created this seems to be the case but just want to verify this is correct?

Now another wrinkle to my dilemma:

Because all of the files are in the gzip format is there a way to use zcat or another method to pipe the output to sort, without first having to convert from gzip to a text file, combine and then compress them back into gzip? This would save a huge amount of time. Any input is appreciated. I'm looking for advice on this; I am not against research nor am I married to my method, I would like some insight before I start running these commands against 120 gigs of data.

Thanks peoples!

2
  • You mention sorting several times. Perhaps you could mention something about the format of your files? Commented Mar 11, 2015 at 6:45
  • The files are in the gzip format with the internal data being comprised of alpha numerical strings. One string per line terminated with the unix style new line character.
    – Dylan
    Commented Mar 11, 2015 at 7:03

1 Answer 1

2

The problem is that the individual files are unsorted, i.e. if you used something like sort -u file* > sortedFile, sort would have to load the contents of all files and then sort them. I assume this is inefficient given that you probably do not have more than 120 gigs of ram.

I would suggest that you first sort all files individually, and the merge them using sort -m, something along these lines (this code is untested!):

for f in file*; do
  gzip -dc "$f" | sort > sorted.$f.bak
done
sort -m -u sorted.file*.bak > sortedFile
rm -f sorted.file*.bak

Relevant part of the sort man page (e.g. http://unixhelp.ed.ac.uk/CGI/man-cgi?sort):

-m, --merge merge already sorted files; do not sort

Update: After reading https://stackoverflow.com/questions/930044/how-could-the-unix-sort-command-sort-a-very-large-file, I think that your original command might be just as fast, since sort splits up its input into manageable chunks anyway. Your command line would then look like:

 sort <(zcat file1) <(zcat file2) ... <(zcat file15) > sortedFile

This would also enable the use of more than one core of your machine.

4
  • You are correct I am just shy on the amount of RAM needed by about 116 gigs... ;) So you think it would still be faster to sort them independently and then combine them. Would I better better off to use merge once they are sorted instead of sort -m. Also how do I handle the gzip compression, should I use something like sort <(gunzip -c file1.gz) >sorted or zcat file1.gz | sort > sorted
    – Dylan
    Commented Mar 11, 2015 at 6:56
  • The merge command does something completely different; sort -m is the way to go. As to how to unzip: I don't think it matters much, as long as you don't start more than 2 processes per file: one to decompress, one to sort. Commented Mar 11, 2015 at 7:03
  • Last question and I'll mark your answer as accepted. In the sort man page an option is --parallel=N change the number of sorts run concurrently to N Should I use the number of cores I have or just use default? Grrr never mind I am getting lazy, I will research that before I execute it. Ok sounds good, I am gonna write up a little script to time execution and let it go. I'll update tomorrow after it has finished running. Thanks @daniel kullmann
    – Dylan
    Commented Mar 11, 2015 at 7:24
  • The man page says that that the default is the number of processors. But a quick test sorting a few log files suggests that giving the parameter does make things faster. So: yes, use that parameter! Commented Mar 11, 2015 at 7:39

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .