13
$\begingroup$

I'm writing an importer for the GIFTI file format. The details of the format are not particularly important, but the basic idea is that it is a relatively simple XML file which includes binary arrays of 32-bit floating point numbers that are represented as "consecutive ASCII characters that are a Base64 text representation of the gzipped binary data". Accordingly, I thought that, given a string (variable name: data) containing the base64 ascii characters, the proper way to extract these would be:

(* 1 *) extractedData = ImportString[data, {"Base64", "GZIP", "Real32"}];

I know that the data string is correctly encoded because I can read it in correctly using other programs, including Matlab and other widely-used GIFTI readers. The code in (1), however, produces incorrect numbers, including Indeterminate values.

On closer inspection, I discovered that the base64 import works fine:

(* 2 *) decodedData64 = ImportString[data, {"Base64", "Binary"}];

The code in (2) produces the same sequence of bytes as many other programs, including the Mac OSX command 'base64 --decode ...') but the GZIP import seems to do nothing:

(* 3 *) decodedData64 == ImportString[FromCharacterCode[decodedData64], {"GZIP","Binary"}]
        (* Out[]= True *)

What really has me confused is that the following code produces the correct results (albeit very slowly):

(* 4 *) <<JLink`
        InstallJava[];
        JavaDecode[data_] := With[
          {iis = JavaNew[
             "java.util.zip.InflaterInputStream",
             JavaNew[
               "java.io.ByteArrayInputStream",
               data]]},
          Most @ NestWhileList[(iis@read[]) &, iis@read[], # != -1 &]];
        decodedData = JavaDecode[decodedData64];
        reals = ImportString[FromCharacterCode[decodedData], "Real32"];

Execution of the code block (4) puts the correct list of real numbers in the reals variable. However, writing out the decodedData64 variable's contents as a binary file and attempting to gunzip it on the terminal fails (not gzip format). Note, also, that I've tried many combinations of writing the data to a file and importing it directly (rather than importing from a string), so I do not believe this is an ImportString issue.

It seems likely to me that either (A) the GIFTI file spec incorrectly names the gzip compression algorithm as that used in the format, or (B) Mathematica is not correctly unzipping the data. B seems pretty unlikely considering that gunzip itself fails.

According to the documentation for java.util.zip.InflaterInputStream, the compression algorithm used is the "deflate" algorithm, and the class is the bases for the GZIPInputStream class. My questions are these:

Does anyone know what the InflaterInputStream is doing, since it is apparently not gunzipping the data?

Does anyone know the (most elegant) correct way to unpack this data in Mathematica from a string?

As a test case, the following data has been gzipped and base64 encoded using an external GIFTI-compatible program; it should correctly decode into the list Range[0.0, 1.0, 0.05]:

data = "eJwNylEVgDAMQ9EIwAAGMNDvNQiYAQzUAAZmY7MxG60NdJC/vHsCAJW9VWZb83RtB4avOd1sq9MjPhlYeVAfRlw0MwK3rMseWche2eAPY5cekQ==";
decoded = JavaDecode[data];
ImportString[FromCharacterCode[decoded], "Real32"]
(* Out[]= {0., 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.} *)

Edit:

Please see Mark Adler's answer below for an explanation of why the decoding of these data fail. Given the nature of the failure, I've decided to fix up the JLink code that performs the ZLib decoding; this function runs reasonably quickly and relies on only JLink and the Java core classes:

Base64ZLibDecode[string_String] := With[
  {x = Apply[
     Join,
     First @ Last @ Reap @ JavaBlock[
       With[
         {inflater = JavaNew[
            "java.util.zip.InflaterInputStream",
            JavaNew[
              "java.io.ByteArrayInputStream",
              ImportString[string, {"Base64","Binary"}]]],
          ar = JavaNew["[B", 1024]},
         While[
           inflater@available[] != 0,
           With[
             {k = inflater@read[ar, 0, 1024]},
             If[k > 0, Sow[JavaObjectToExpression[ar][[1 ;; k]]]]]]]]]},
  (* Java bytes can be negative, but we need positives for FromCharacterCode *)
  (-Sign[x] + 1)/2 * (x + 256) + (Sign[x] + 1)/2 * x]; 
$\endgroup$

2 Answers 2

12
$\begingroup$

The Base64 string you provided as an example is not an encoding of a gzip stream (RFC 1952). It is an encoding of a zlib stream (RFC 1950). For background, those are different wrappers around the raw "deflate" compressed data format (RFC 1951), where the wrappers are headers and trailers proving information on the compressed data and integrity check values.

The Mathematica GZIP importer does not see a gzip stream, so it is rejected. The Mathematica import list in 10.0 does not include zlib, nor raw deflate.

The documentation for the format calls the format "GZipBase64Binary". If the example you gave is a valid element in that format, then the documentation is very misleading.

The documentation also says: "The third encoding,Base64GzipBinary, compresses the binary data using ZLIB and then converts the data to a Base64 representation." Here they misname their own format (swapping the Gzip and the Base64), and then say it is compressed using ZLIB. The zlib compression library can compress to any of: the zlib format, the gzip format, or raw deflate. The specification should, but does not, specify how the zlib library is to be used.

The specification is poorly written.

The Java Inflater class (misspelled, should be "Inflator") in fact decodes the zlib format, which is why that works. The Java documentation is also not clear, in some places saying it operates on deflate data, and in others that it operates on zlib data. In fact, it operates on zlib data, unless the nowrap parameter is true in the Inflater constructor, in which case it will operate on raw deflate data.

Update:

I thought I would be able to trick Mathematica into decompressing zlib streams by embedding them as comments in PNG files. The PNG format compresses images and comments using the zlib format, and Mathematica will decompress them. Alas, Mathematica refuses to decompress comments with arbitrary binary data that contains, for example, zeros. Your decompressed data starts with zeros. It looks like other byte values are dropped as well.

You will need to use external code to decompress zlib streams, until such time as Wolfram Language includes "ZLIB" as an Import format.

$\endgroup$
8
  • $\begingroup$ Thanks, this helps a lot, at least in reassuring me that I'm not crazy :-). I suspect that the confusion in the spec is a consequence of confusion in the Matlab library most commonly used to read these files (the authors of the spec, and most neuroscientists, are strict Matlab users, and the library for decoding GIFTI files does not have much documentation for its compression methods). I believe I can use JLink to fix up the slow code I have in the original post for reading the data currently, but if you know of a more elegant way to decode it, I'd love to see it! $\endgroup$
    – nben
    Commented Feb 20, 2015 at 18:20
  • 1
    $\begingroup$ By the way, I am a Rocket Scientist. So I can easily see how Brain Surgeons would mess up the specification. :-) $\endgroup$
    – Mark Adler
    Commented Feb 20, 2015 at 21:42
  • 1
    $\begingroup$ @MarkAdler there are actually two undocumented built-in functions to work with zlib streams in Mathematica: Developer`RawCompress[] and Developer`RawUncompress[]. Their syntax is described in the answer below. $\endgroup$
    – Ray Shadow
    Commented Mar 21, 2017 at 0:22
  • 1
    $\begingroup$ I have just realized that you are the author of zlib and your code can be found basically in every app! (On my PC there are 210 executables containing string Mark Adler :) It is a pleasure to meet you here! $\endgroup$
    – Ray Shadow
    Commented Mar 21, 2017 at 23:38
  • 1
    $\begingroup$ Those functions were discovered by accident. First version of auto-complete feature in Mathematica suggested symbols not only from Global`* context, but also from other contexts including Developer`*. I was curious to find out what these compress functions do. After some experiments it became clear they are just wrappers around zlib. $\endgroup$
    – Ray Shadow
    Commented Mar 21, 2017 at 23:38
10
$\begingroup$

A simple, efficient, but undocumented way to use zlib deflate algorithm in Mathematica is to utilize the functions Developer`RawCompress[] and Developer`RawUncompress[].

They have the following syntax:

zlibStreamBytes = Developer`RawCompress[uncompressedDataBytes]
uncompressedDataBytes = Developer`RawUncompress[zlibStreamBytes]

Input and output of both functions are list of bytes, where each byte is an integer from 0 to 255.

uncompressedDataBytes represents a list of bytes one wants to compress.

zlibStreamBytes is a list of bytes representing zlib stream.

Simple example:

uncompressedDataBytes = ConstantArray[42, 30];
zlibStreamBytes = Developer`RawCompress[uncompressedDataBytes]

{120, 156, 211, 210, 194, 7, 0, 76, 104, 4, 237}

uncompressedDataBytes = Developer`RawUncompress[zlibStreamBytes]

{42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42}

Example from your question:

data = "eJwNylEVgDAMQ9EIwAAGMNDvNQiYAQzUAAZmY7MxG60NdJC/vHsCAJW9VWZb83RtB4avOd1sq9MjPhlYeVAfRlw0MwK3rMseWche2eAPY5cekQ==";
decoded = Developer`RawUncompress[ImportString[data, {"Base64", "Binary"}]];
ImportString[FromCharacterCode[decoded], "Real32"]

{0., 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.}

Remark:

If you want to use only documented features you can write your own LibraryLink wrapper around zlib. Such method will give the same high performance as Developer`* functions, but will require compilation of a dynamic library for each operating system you use.

$\endgroup$
1
  • 4
    $\begingroup$ Users should note that the Raw in the names is a bit of a misnomer, as the functions are producing and consuming the zlib format, not the raw deflate format. The zlib format has a two-byte header and four-byte trailer around the deflate format. $\endgroup$
    – Mark Adler
    Commented Mar 21, 2017 at 1:00

Not the answer you're looking for? Browse other questions tagged or ask your own question.