I am trying to understand the deflate algorithm, and I have read up on Huffman codes as well as LZ77 compression. I was toying around with compression sizes of different strings, and I stumbled across something I could not explain. The string aaa
when compressed, both through zlib and gzip, turns out to be the same size as aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
(36 a
s).
Before reading about this I would have assumed the compressor does something like storing 36*a instead of each character individually, but I could not find anywhere in the specifications where that is mentioned.
Using fixed Huffman code yielded the same result, so I assume the space-saving lies in LZ77, but that only uses distance-length pairs. How would that allow a 3 length string to expand by 12 times without increasing in size?
Interrupting the string of a
s with one or several b
s in the middle drastically increases the size. If distance-length pairs is what's doing the job, why could it not just skip over the b
s when searching backwards? Or is Huffman codes being utilized and I misunderstood what fixed Huffman codes implies?