6

We have some really big repositories in git, in these we have observed how remote/server compression is a bottleneck when cloning/pulling. Given how pervasive git has become and that is uses zlib, has this zlib compression been optimized?

An Intel paper details how they can speedup the DEFLATE compression with a factor of about ~4 times although with a smaller compression ratio:

http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-deflate-compression-paper.pdf

Another paper indicates a speed up of ~1.8 times where compression ratios are preserved for most compression 'levels' (1-9):

http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/zlib-compression-whitepaper-copy.pdf

This latter optimization is it seems available on github: https://github.com/jtkukunas/zlib

zlib seems to be quite old (in this fast paced industry) latest release is from april, 2013. Have there been any attempts to SIMD optimize zlib for new processor generations? Or are there alternatives to using zlib in git?

I do understand you can specify a compression level in git that will impact speed and compression ratio. However, the above indicates there can be made quite big performance improvements on zlib without hurting compression ratios.

So to recap, are there any existing git implementation that uses a highly optimized zlib or zlib alternative?

PS: It seems a lot of devs/servers would benefit from this (even green house gas emission ;)).

2 Answers 2

5

There are in fact contributions to zlib's deflate from Intel that have yet to be integrated. You can look at this fork of zlib that has some experimental integrations of Intel and Cloudfare improvements to compression. You could try compiling that with git to see how it does.

zlib is older than you think. Most of the compression code is relatively unchanged from 20 years ago. The decompression was rewritten about 12 years ago.

1
  • Thanks! This does seem interesting. I there any reason why this does not receive more interest or why effort is not made to integrate this into the main zlib library?
    – nietras
    Commented Aug 8, 2015 at 9:23
1

I don't know of any git implementations using optimized zlib or alternatives. I've done a bit of investigation of compression and the tradeoffs between compression levels and speed however and if you are aiming to improve performance significantly you generally will have better results coming up with a new algorithm designed with speed in mind than trying to optimize an existing algorithm. LZ4 is a good example of a compression algorithm designed with speed as a priority over compression ratio.

The nature of compression algorithms means that they don't tend to parallelize or SIMDify (which is really a type of parallelism) very effectively, particularly if they were not designed with that as a goal. Compression by its very nature involves serial data dependencies on a stream.

Another thing to consider with compression algorithms is whether to prioritize compression or decompression speed. If your bottleneck is the time it takes the server to compress data then you want to focus on fast compression but in other situations where you compress once and decompress often (loading game assets or fetching a static web page for example) then you likely want to prioritize decompression speed.

2
  • Thanks, I understand, related to this I could ask why are the "deltas" not pre-compressed in git? As far as I understand the git repository stores only compressed objects, why does it then have the decompress and the compress when sending to a client?
    – nietras
    Commented Aug 8, 2015 at 9:16
  • However, as the links show with some modifications the compression/decompression can be made faster with SIMD instructions, which is what Intel has shown.
    – nietras
    Commented Aug 8, 2015 at 9:19

Not the answer you're looking for? Browse other questions tagged or ask your own question.