10

Can anybody advise an alternative to importing a couple of GByte of numeric data (in .mx form) from a list of 60 .mx files, each about 650 MByte?

The - too large to post here - research-problem involved simple statistical operations with double as much GB of data (around 34) than RAM available (16). To handle the data size problem I just split things up and used a Get / Clear strategy to do the math.

It does work, but calling Get["bigfile.mx"] takes quite some time, so I was wondering if it would be quicker to use BLOBs or whatever with PostgreSQL or MySQL or whatever database people use for GB of numeric data.

So my question really is: What is the most efficient way to handle truly large data set imports in Mathematica?

I have not tried it yet, but I think that SQLImport from DataBaseLink will be slower than Get["bigfile.mx"].

Anyone has some experience to share?

(Sorry if this is not a very specific programming question, but it would really help me to move on with that time-consuming finding-out-what-is-the-best-of-the-137-possibilities-to-tackle-a-problem-in-Mathematica).

10
  • 1
    Does this help? Another related question.
    – abcd
    Commented Dec 20, 2011 at 22:38
  • @yoda Note that Rolf is using MX which is the native binary format of Mathematica, and in my exeprience faster than anything else when using Import/ReadList. I don't know about BinaryReadList ...
    – Szabolcs
    Commented Dec 21, 2011 at 8:29
  • @Rolf +1, very relevant question. It doesn't answer you, but you'll surely be interested in this presentation. It seems Mathematica 9 is bringing significant improvements in this area.
    – Szabolcs
    Commented Dec 21, 2011 at 8:30
  • @Rolf I just tested with a 370 MB mx file, and it imports in less than a second here. I did rr = RandomReal[1, {100,100,100,50}]; DumpSave["rr.mx",rr]; Timing[Get["rr.mx"];]. I wonder why our experiences differ. What kind of data are you reading? Is my version only fast because it only has a single large packed array?
    – Szabolcs
    Commented Dec 21, 2011 at 8:38
  • @Szabolcs Yes, I have a list of different length packed arrays, and that list itself cannot be packed. Commented Dec 21, 2011 at 9:16

2 Answers 2

3

Here's an idea:

You said you have a ragged matrix, i.e. a list of lists of different lengths. I'm assuming floating point numbers.

You could flatten the matrix to get a single long packed 1D array (use Developer`ToPackedArray to pack it if necessary), and store the starting indexes of the sublists separately. Then reconstruct the ragged matrix after the data has been imported.


Here's a demonstration that within Mathematica (i.e. after import), extracting the sublists from a big flattened list is fast.

data = RandomReal[1, 10000000];

indexes = Union@RandomInteger[{1, 10000000}, 10000];    
ranges = #1 ;; (#2 - 1) & @@@ Partition[indexes, 2, 1];

data[[#]] & /@ ranges; // Timing

{0.093, Null}

Alternatively store a sequence of sublist lengths and use Mr.Wizard's dynamicPartition function which does exactly this. My point is that storing the data in a flat format and partitioning it in-kernel is going to add negligible overhead.


Importing packed arrays as MX files is very fast. I only have 2 GB of memory, so I cannot test on very large files, but the import times are always a fraction of a second for packed arrays on my machine. This will solve the problem that importing data that is not packed can be slower (although as I said in the comments on the main question, I cannot reproduce the kind of extreme slowness you mention).


If BinaryReadList were fast (it isn't as fast as reading MX files now, but it looks like it will be significantly sped up in Mathematica 9), you could store the whole dataset as one big binary file, without the need of breaking it into separate MX files. Then you could import relevant parts of the file like this:

First make a test file:

In[3]:= f = OpenWrite["test.bin", BinaryFormat -> True]

In[4]:= BinaryWrite[f, RandomReal[1, 80000000], "Real64"]; // Timing
Out[4]= {9.547, Null}

In[5]:= Close[f]

Open it:

In[6]:= f = OpenRead["test.bin", BinaryFormat -> True]    

In[7]:= StreamPosition[f]

Out[7]= 0

Skip the first 5 million entries:

In[8]:= SetStreamPosition[f, 5000000*8]

Out[8]= 40000000

Read 5 million entries:

In[9]:= BinaryReadList[f, "Real64", 5000000] // Length // Timing    
Out[9]= {0.609, 5000000}

Read all the remaining entries:

In[10]:= BinaryReadList[f, "Real64"] // Length // Timing    
Out[10]= {7.782, 70000000}

In[11]:= Close[f]

(For comparison, Get usually reads the same data from an MX file in less than 1.5 seconds here. I am on WinXP btw.)


EDIT If you are willing to spend time on this, and write some C code, another idea is to create a library function (using Library Link) that will memory-map the file (link for Windows), and copy it directly into an MTensor object (an MTensor is just a packed Mathematica array, as seen from the C side of Library Link).

3
  • Have you tried my dynamicPartition (or just the core dynP) function in the toolbag post? I believe it should be a little faster that what you proposed. If it is, will you include a link?
    – Mr.Wizard
    Commented Dec 21, 2011 at 17:08
  • @Mr.Wizard My point here was merely to show that storing the data in a flat format and partitioning it in-kernel is not going to add noticeable overhead (not to find the best way to partition). Link added, of course.
    – Szabolcs
    Commented Dec 22, 2011 at 10:44
  • @Mr.Wizard Why not, your function does exactly the same thing what I was showing here! I just pointed out it was not the main point of the answer (which is really just some ideas and not a complete answer)
    – Szabolcs
    Commented Dec 22, 2011 at 11:07
1

I think the two best approaches are either:

1) use Get on the *.mx file,

2) or read in that data and save it in some binary format for which you write a LibraryLink code and then read the stuff via that. That, of course, has the disadvantage that you'd need to convert your MX stuff. But perhaps this is an option.

Generally speaking Get with MX files is pretty fast.

Are sure this is not a swapping problem?

Edit 1: You could then use also write in an import converter: tutorial/DevelopingAnImportConverter

6
  • It is not a swapping problem. The problem is that I have more data than fit in RAM so I have to read in sequentially parts of the data and this is done multiple times, so if it takes half a minute to read in such an MX file it is noticeable. Things do work, it just needs more than a day of CPU time everything (there is an outer optimization loop), so I was thinking about how to speed up things. Commented Dec 21, 2011 at 9:19
  • Could I read in a chunk of data with LibraryLink code and swap it out to disk by command? Right now I need to Get / Clear the same MX file multiple times and basically I want to speed that up. Commented Dec 21, 2011 at 9:23
  • I have never done this, so I am a little cautious but I think this should be possible.
    – user1054186
    Commented Dec 21, 2011 at 9:54
  • Before coding in C, though, I'd make sure that the file reading is the bottle neck, perhaps the the optimization can be tweaked as well.
    – user1054186
    Commented Dec 21, 2011 at 9:55
  • @ruebenko Thanks for sharing that this is possible at all! I didn't know we could write custom importers.
    – Szabolcs
    Commented Dec 21, 2011 at 9:59

Not the answer you're looking for? Browse other questions tagged or ask your own question.