14
$\begingroup$

Some backgrouds of my problem:

Recently, I had a heavy data analysis work to do. Each day I got thousands of csv Files(colums with different datatype) with total several Gigabytes. I have to frequently analysis these csv data within arbitary periods(like a month or half a year)��and that is dozens of csv files to load and analysis. So problems comes:

  1. Mathematica sucks with csv loading(try loading a 10MB csv), though it had claimed improved a lot. ReadList helps not much. Dozens of times slower than python Pandas in terms of loading csv file shows Mathematica doesn't take serious at all about data loading user experience.
  2. Too much data eat up my HDD space, I need some good compression, at the same time decompression must be fast for loading data.
  3. data analysis should be done within Mathematica.

So the problems comes down to transform my huge collection of csv files into huge collection of well compressed data files that can be loading into Mathematica super fast

My Attemps:

First, I tried using hdf5 as data bridge. I use pandas to load csv and exporting hdf5 using internal compression. Pandas provides many compression algorithm, including Zstandard(zstd), but unfortunately Mathematica seems only support loading zlib compressed hdf5. Anyway, loadingh hdf5 is much better. But I found Import ["x.h5","Data"] is still two times slower than Pandas! Why Mathematica lags behind in loading every format?! What is more, considering after loading hdf5, I have to transform data to Association form, that makes loading even longer.

Then I turned to Mathematica ways. I would like to save association data with compression. It seems that currently the best way is to use BinarySerialize and BinaryDeserialize. Unfortunately, According to https://mathematica.stackexchange.com/a/141930/4742, BinarySerialize only uses zlib as compression tool.

Let us prepare a sample data and done some benchmark

data=RandomInteger[{1,1000},10000000];

I compare two case:

First case,

BinarySerialize with compression, and DumpSave

dataBS=BinarySerialize[data,"PerformanceGoal"->"Size"];
DumpSave["R:\\dataBS.mx",dataBS];

loading data back

Get["R:\\dataBS.mx"];//AbsoluteTiming
BinaryDeserialize[dataBS];//AbsoluteTiming

takes

{0.0408615, Null}
{0.553527, Null}

Second case,

BinarySerialize without compression, and DumpSave

dataBS=BinarySerialize[data,"PerformanceGoal"->"Speed"];
DumpSave["R:\\dataBS.mx",dataBS];

loading data back

Get["R:\\dataBS.mx"];//AbsoluteTiming
BinaryDeserialize[dataBS];//AbsoluteTiming

takes

{0.189236, Null}
{0.14339, Null}

By comparing 0.553527 and 0.14339, I think we could infer the decompression take about 0.4sec. That is quite a lot of time for a single file considering I have to load dozens of files at a time.

So I googled if any better compression scheme than zlib exists, and it turns out there are really amazing algorithms out there. As pointed out in this article better-compression-with-zstandard , DEFLATE algorithm is really too old, Zstandard(zstd) and some other algorithms are better than zlib in every way.

I tried to use Zstandard as external program to compress and decompress data, but RunProcess has too much overhead and involves additional file writing.

Is there an undocumented feature enabling us to use Zstandard or some other modern algorithm? If not, is it possible to use zstd dll files in https://github.com/facebook/zstd/releases via librarylink to speedup data compression and decompression on the fly?

$\endgroup$
2
  • 1
    $\begingroup$ Could these csv files be put into a database? I'm thinking something like Mongo which supports csv import and where you can control the data structure and nesting via the headers in the csv. Mathematica has good Mongo support/linking that would allow you to then use the sat within Mathematica $\endgroup$ Commented Nov 24, 2019 at 2:04
  • $\begingroup$ @MikeHoneychurch Thank you so much for introducing Mongo. I definitely need to learn more about database later which I currently know little. But that is a lot to learn :) According to docs.mongodb.com/manual/core/wiredtiger mongo 4.2 support zstd, that is nice. So zstd is really popular right now, while mathematica lags behind... $\endgroup$
    – matheorem
    Commented Nov 25, 2019 at 11:14

0

Browse other questions tagged or ask your own question.