I am going to be combining 15 different gzip files. Ranging in size from 2 gigs to 15 gigs each so the files are relatively large. I have done research on the best way to do it, but I still have some questions.
Summary:
Starting with 15 different gzip files I want to finish with one sorted, duplicate free file in the gzip format.
For sake of conversation I will label the files as follows: file1
, file2
... file15
I am planning to use the sort
command with the -u
option. According to the man page for sort this means:
-u, --unique
with -c, check for strict ordering; without -c, output only the first of an equal run
So what I am thinking of doing is this:
sort -u file* > sortedFile
From my understanding I would have one file that is sorted and does not have any duplicates. From my test files I created this seems to be the case but just want to verify this is correct?
Now another wrinkle to my dilemma:
Because all of the files are in the gzip format is there a way to use zcat or another method to pipe the output to sort, without first having to convert from gzip to a text file, combine and then compress them back into gzip? This would save a huge amount of time. Any input is appreciated. I'm looking for advice on this; I am not against research nor am I married to my method, I would like some insight before I start running these commands against 120 gigs of data.
Thanks peoples!