1

I want to compress my data, using compress zlib function, so, code, looks like following:

ifs.read(srcBuf,srcLen) // std::ifstream, srcLen = 256kb
compress(dstBuf, &dstLen, srcBuf, srcLen); // casts are omitted
ofs.write(dstBuf, dstLen); // std::ofstream
dstLen = dstBufSize;

Result file is ~4% smaller, than original (380mb vs 360mb), which is, actually, awful. Meanwhile, Winrar compress this file to 70mb file. I've tried bzip2 and zlib, and both provide similar result. I guess the problem is that 256KB buffer is too small, but I'd like to understand how it works, and how I can use zlib, to achieve better compression. Overall, I want to make lowlevel facility to compress several files to 1 big one for internal use, and compress() looks very suited for it, but...

Deep explanations are very welcome. Thanks in advance.

1
  • Try something like: pastebin.com/nWJedn82 and see if it works well.. Use Z_BEST_COMPRESSION as the compression level.
    – Brandon
    Commented Nov 24, 2014 at 14:57

2 Answers 2

2

I believe your problem is that by using the compress() function (rather than deflateInit()/deflate()/deflateEnd()), you are underutilizing zlib's compression abilities.

The key insight here is that zlib compression is implemented by building up a Huffman tree, which is a dictionary-type data structure that specifies short "tokens" that will succinctly represent longer sequences of input bytes. That way, whenever those longer sequences are repeated later on in the input stream, they can be replaced by their equivalent tokens in the output stream, greatly reducing the total size of the compressed data.

However, the efficiency of that process depends a lot of the persistence of that built-up Huffman tree, which in turn depends on your program keeping the deflate-algorithm's state for the entire duration of the compression process. But your code is calling compress(), which is meant to be a single-shot convenience function for small amounts of data, and as such compress() does not provide any way for your program to retain state across multiple calls to it. With each call to compress(), a brand-new Huffman tree is generated, written to the output stream used for the remainder of the data passed to that call, and then forgotten -- it will be inaccessible to any subsequent compress() calls. That is likely the source of the poor efficiency you are seeing.

The fix is not to use compress() in cases where you need to compress the data in more than one step. Instead, call deflateInit() (to allocate the state for the algorithm), then call deflate() multiple times (to compress data using, and updating that state), and then finally call deflateEnd() to clean up.

1

Use deflateInit(), deflate(), and deflateEnd() instead of compress(). I don't know whether or not that will improve the compression, since you provided no information on the data, and only the slightest clue as to what your program does (are those lines inside a loop?). However if you are compressing something large that you are not loading into memory all at once, then you do not use compress().

Not the answer you're looking for? Browse other questions tagged or ask your own question.