3

I have an ASCII-encoded text file where each line has the following structure:

XYplorer nn.nn.nnnn [yyyy-mm-dd hh.mm.ss] [S256 S256].zip
         ↑↑ ↑↑ ↑↑↑↑  ↑↑↑↑ ↑↑ ↑↑ ↑↑ ↑↑ ↑↑   ↑64× ↑64×

so a line is 177 characters long, 27 characters don't change and the other 150 do, and the two hashes make up 128 of such characters. I also assume that the hashes are basically random text, thus difficult to compress, so

27/177 = 15.3% fixed text

22/177 = 12.4% changing text

128/177 = 72.3% random text

Yet, zipping such file (1854 lines) the standard (right click) way on Windows I achieve a 49% compression ratio, which baffles me because it seems too high/efficient.

Can you explain to me how the random part could be compressed so much?

2
  • 3
    You assume that zip compresses text. It doesn't, it compresses bytes. see this link for the deflate method.
    – Mixxiphoid
    Commented Sep 1, 2015 at 16:38
  • 3
    Another point to make is that a SHA hash has a very limited alphabet (only 16 different characters are ever used) so that helps with compression. Just because the characters are in a random order doesn't mean they can't be compressed at all.
    – heavyd
    Commented Sep 1, 2015 at 16:44

1 Answer 1

5

The key element here is that this is an ascii encoded file.

Thus each character is encoded using 8 bits. 177 × 8 = 1416 bits each line. However 177 characters dosen't count line endings, in windows a line ending is encoded as "\r\n" (carriage return, line feed) so will use 179 characters per line lending 1432 bits per line.

Your SHA256's are 64 hex digits each. A hex digit could be trivially packed down to use only 4 bits (2^4 = 16) which is half the size.

Let's break it down:

  • (27+2)/179 = 16.2% fixed text (assuming infinitly commpressable)
  • 22/179 = 12.3% changing text
  • 128/179 = 71.5% text that could be encoded using %50 size.

Using that mapping alone I get 128/2 + 22 = 86 bytes or 688 bits.

  • 688/1432 = 48% of the original size.

This isn't taking into account any additional compression that could be performed on the changing text, it looks like they are usually ascii numbers, which suffer the same packing losses as ascii hex digits.

To be 100% honest I'm surprised windows zipping dosen't do a better job.

0

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .