1

Consider a hypothetical archive format that does the following under the covers, given a list of files to pack:

  1. gzip each file individually
  2. Tar the gzips together

Contrast this with traditional tar followed by gzip.

Under what circumstances, if any, would the former method result in better compression than the latter?

A friendly implementation of the former method would allow quicker access to individual packed files, as well as immediate access to the index. So I am wondering about the conditions under which these advantages are offset by a potential reduction in compression due to not considering the stream of content as a whole.

2 Answers 2

1
  1. Individual compressed files each have there own headers, and thus reduces compression by increasing the number of headers. Each header is small, maybe a few hundred bytes but they add up.

  2. Compression techniques use dictionaries, placing dictionaries in each file also increases overhead. Multiple files will use parts of the same dictionary to reduce the total file size.

  3. Data that either can't be compressed or has a very small ratio will have a negligible effect if compressed individually.

  4. The time it takes to compress will be a bit longer as it has to stop, flush everything to disk, and start a new file(new header,dictionary,etc) for each file instead just appending data to 1 file.

  5. A large number of similar files, such as weeks of log files being compressed, will share dictionaries and save space.

  6. Each file system uses even size storage units, in many cases 4k, and some part of that 4k is wasted on each file.

Until you are dealing with thousands or tens of thousands of files the amounts saved or lost will not be a lot with either technique.

0

A much more important reason to do what you suggest--despite the loss in compression--is recovery of a corrupt archive. If you compress the entire archive together (ie., tar cf - * | gzip > foo.tar.gz), then let the file sit on the disk for awhile (or transmit it somewhere far away), then a single bit of corruption can cause the loss of every file in the archive beyond the corrupted bit.

Compressing them first individually first, and then tarring together the result, is far more robust to bit corruption since tar was designed from the get go decades ago to recover from such errors. It can't do that if gzip corrupts the entire stream because it can't recover its dictionary due to a bit error.

In fact I'm thinking of implementing your suggestion precisely I've lost entire archives in the past due to single-bit corruption. Right now I have millions of files occupying several TB of space, that I'd prefer take up only a few hundred GB. Although I'll lose maybe a factor of 2 by gzipp'ing files individually, I'd rather have a 600GB archive that loses a few individual files instead of a 300GB archive where I lose everything.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .