Combining, sorting and deleting duplicates in numerous gzip files

Question

I am going to be combining 15 different gzip files. Ranging in size from 2 gigs to 15 gigs each so the files are relatively large. I have done research on the best way to do it, but I still have some questions.

Summary:

Starting with 15 different gzip files I want to finish with one sorted, duplicate free file in the gzip format.

For sake of conversation I will label the files as follows: file1, file2 ... file15

I am planning to use the sort command with the -u option. According to the man page for sort this means:

-u, --unique with -c, check for strict ordering; without -c, output only the first of an equal run

So what I am thinking of doing is this:

sort -u file* > sortedFile

From my understanding I would have one file that is sorted and does not have any duplicates. From my test files I created this seems to be the case but just want to verify this is correct?

Now another wrinkle to my dilemma:

Because all of the files are in the gzip format is there a way to use zcat or another method to pipe the output to sort, without first having to convert from gzip to a text file, combine and then compress them back into gzip? This would save a huge amount of time. Any input is appreciated. I'm looking for advice on this; I am not against research nor am I married to my method, I would like some insight before I start running these commands against 120 gigs of data.

Thanks peoples!

You mention sorting several times. Perhaps you could mention something about the format of your files? — Faheem Mitha, Commented Mar 11, 2015 at 6:45
The files are in the gzip format with the internal data being comprised of alpha numerical strings. One string per line terminated with the unix style new line character. — Dylan, Commented Mar 11, 2015 at 7:03

Community · Accepted Answer · 2017-05-23 11:33:33Z

2

The problem is that the individual files are unsorted, i.e. if you used something like sort -u file* > sortedFile, sort would have to load the contents of all files and then sort them. I assume this is inefficient given that you probably do not have more than 120 gigs of ram.

I would suggest that you first sort all files individually, and the merge them using sort -m, something along these lines (this code is untested!):

for f in file*; do
  gzip -dc "$f" | sort > sorted.$f.bak
done
sort -m -u sorted.file*.bak > sortedFile
rm -f sorted.file*.bak

Relevant part of the sort man page (e.g. http://unixhelp.ed.ac.uk/CGI/man-cgi?sort):

-m, --merge merge already sorted files; do not sort

Update: After reading https://stackoverflow.com/questions/930044/how-could-the-unix-sort-command-sort-a-very-large-file, I think that your original command might be just as fast, since sort splits up its input into manageable chunks anyway. Your command line would then look like:

 sort <(zcat file1) <(zcat file2) ... <(zcat file15) > sortedFile

This would also enable the use of more than one core of your machine.

edited May 23, 2017 at 11:33

CommunityBot

1

answered Mar 11, 2015 at 6:43

daniel kullmann

9,55711 gold badges39 silver badges46 bronze badges

You are correct I am just shy on the amount of RAM needed by about 116 gigs... ;) So you think it would still be faster to sort them independently and then combine them. Would I better better off to use merge once they are sorted instead of sort -m. Also how do I handle the gzip compression, should I use something like sort <(gunzip -c file1.gz) >sorted or zcat file1.gz | sort > sorted
– Dylan
Commented Mar 11, 2015 at 6:56
The merge command does something completely different; sort -m is the way to go. As to how to unzip: I don't think it matters much, as long as you don't start more than 2 processes per file: one to decompress, one to sort.
– daniel kullmann
Commented Mar 11, 2015 at 7:03
Last question and I'll mark your answer as accepted. In the sort man page an option is --parallel=N change the number of sorts run concurrently to N Should I use the number of cores I have or just use default? Grrr never mind I am getting lazy, I will research that before I execute it. Ok sounds good, I am gonna write up a little script to time execution and let it go. I'll update tomorrow after it has finished running. Thanks @daniel kullmann
– Dylan
Commented Mar 11, 2015 at 7:24
The man page says that that the default is the number of processors. But a quick test sorting a few log files suggests that giving the parameter does make things faster. So: yes, use that parameter!
– daniel kullmann
Commented Mar 11, 2015 at 7:39

Add a comment |

Stack Exchange Network

Combining, sorting and deleting duplicates in numerous gzip files

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
sort
gzip
.

Hot Network Questions

Combining, sorting and deleting duplicates in numerous gzip files

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged sortgzip.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
sort
gzip
.