2

I have a series of gzip files which I wish to store more efficiently using xz, without losing traceability to a set of checksums of the gzip files.

I believe this amounts to being able to recreate the gzip files from the xz files, though I'm open to other suggestions.

To elaborate... If I have a gzip file named target.txt.gz, and I decompress it to target.txt and discard the compressed file, I want to exactly recreate the original compressed file target.txt.gz. By exactly, I mean a cryptographic checksum of the file should indicate that it is exactly the same as the original.

I initially thought this must be impossible, because a gzip file contains metadata such as original file name and timestamp, which might not be preserved upon decompression, and metadata such as a comment, the source operating system, and compression flags, which are almost certainly not preserved upon decompression.

But then I thought to modify my question: is there a minimal amount of header information that I could extract from the gzip file that, in combination with the uncompressed data, would allow me to recreate the original gzip file.

And then I thought that the answer might still be no due to the existence of tools such as Zopfli and 7-zip, which can create gzip-compatible streams which are better (therefore different) from the standard gzip program. As far as I am aware, the gzip file format does not record which of these compressors created it.

So my question becomes: are there other options I haven't thought of that might mean I can achieve my goal as set out in the first paragraph after all?

5
  • 3
    It would be so much easier if you just cared about the MD5 checksum of the uncompressed contents of the file...
    – Kusalananda
    Commented Apr 17, 2017 at 17:13
  • Hopefully not MD5!
    – ilkkachu
    Commented Apr 17, 2017 at 17:21
  • @Kusalananda: Indeed, but that's not the case unfortunately.
    – jl6
    Commented Apr 17, 2017 at 17:30
  • @ilkkachu Well, whatever type of checksum.
    – Kusalananda
    Commented Apr 17, 2017 at 17:55
  • 1
    Debian's advice on reproducible builds might help. Commented Apr 17, 2017 at 22:12

1 Answer 1

4

This may be helpful: https://github.com/google/grittibanzli

Grittibanzli is a tool to compress a deflate stream to a smaller file, which can be decoded to the original deflate stream again. That is, it compresses not only the data inside the deflate stream, but also the deflate-related information such as LZ77 symbols and Huffman trees, to reproduce a gzip, png, ... file exactly.

1
  • According to the rest of README, what it actually does is split a DEFLATE-based file into multiple files, which contain the uncompressed data plus any metadata needed to re-create the compressed file exactly. So you get to choose what compression tool to make it smaller; It doesn't do that part for you. (Originally I skimmed too quickly and was put off because I thought it could only shrink single DEFLATE files.)
    – Will Chen
    Commented Feb 26 at 0:11

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .