36

I have a C / C++ program which needs to read in a file that may or may not be gzip compressed. I know we can use gzread() from zlib to read in both compressed and uncompressed files - however, I want to use the zlib functions ONLY if the file is gzip compressed (for performance reasons).

So is there any way to programatically detect or check if a certain file is gzipped from C / C++?

1
  • 2
    @Rob Kennedy: There is huge difference - 1min (fread) vs 20mins (gzread) for uncompressed files. Might have to do with us using an older version of zlib, but right now I'm not in a position to use the latest version - so have to do the conditional read. Commented May 19, 2011 at 13:48

4 Answers 4

71

There is a magic number at the beginning of the file. Just read the first two bytes and check if they are equal to 0x1f8b.

1
  • 49
    Beware endianness and byte width. Compare individual values rather than a composite: (byte1 == 0x1f) && (byte2 == 0x8b) versus first2bytes == 0x1f8b.
    – pmg
    Commented May 19, 2011 at 13:42
14

Do you prefer false positives, false negatives, or no false results at all (there goes performance down the drain...)?

The RFC 1952: GZIP file format specification version 4.3 states the first 2 bytes (of each member and therefore) of the file are '\x1F' and '\x8B'. Use that for a first check that can result in false positives.

3

What is the difference in performance between reading compressed and uncompressed files using gzread()?

Anyway, in order to detect if a file is gzipped, you can read the magic number at the beginning of the file, which is 1f 8b according to the link.

1
  • Regarding performance: There is huge difference - 1min (fread) vs 20mins (gzread) for uncompressed files. Might have to do with us using an older version of zlib, but right now I'm not in a position to use the latest version - so have to do the conditional read to work around this. Commented May 19, 2011 at 13:52
1

You can test for the signatures described in the RFCs 1951 and 1952 to get an idea. For GZIP files the second one is the relevant and it is definitive. There are some false positives on other formats, so you should check as much of the header for plausible values.

For just zlib streams it's somewhat harder, because they are even more prone to false positives. But you would rarely encounter those in the wild on their own.

Not the answer you're looking for? Browse other questions tagged or ask your own question.