178

I am looking for a fast way to preserve large numpy arrays. I want to save them to the disk in a binary format, then read them back into memory relatively fastly. cPickle is not fast enough, unfortunately.

I found numpy.savez and numpy.load. But the weird thing is, numpy.load loads a npy file into "memory-map". That means regular manipulating of arrays really slow. For example, something like this would be really slow:

#!/usr/bin/python
import numpy as np;
import time; 
from tempfile import TemporaryFile

n = 10000000;

a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5

file = TemporaryFile()
np.savez(file,a = a, b = b, c = c);

file.seek(0)
t = time.time()
z = np.load(file)
print "loading time = ", time.time() - t

t = time.time()
aa = z['a']
bb = z['b']
cc = z['c']
print "assigning time = ", time.time() - t;

more precisely, the first line will be really fast, but the remaining lines that assign the arrays to obj are ridiculously slow:

loading time =  0.000220775604248
assining time =  2.72940087318

Is there any better way of preserving numpy arrays? Ideally, I want to be able to store multiple arrays in one file.

11
  • 4
    By default, np.load should not mmap the file.
    – Fred Foo
    Commented Mar 8, 2012 at 14:32
  • 6
    What about pytables?
    – dsign
    Commented Mar 8, 2012 at 14:34
  • 1
    It would be nice if we there were a little more information in your question, like the kind of array which is stored in ifile and its size, or if they are several arrays in different files, or how exactly do you save them. By your question, I have got the impression that the first line does nothing and that the actual loading happens after, but those are only guesses.
    – dsign
    Commented Mar 8, 2012 at 15:07
  • 23
    @larsmans - For what it's worth, for an "npz" file (i.e. multiple arrays saved with numpy.savez), the default is to "lazily load" the arrays. It isn't memmapping them, but it doesn't load them until the NpzFile object is indexed. (Thus the delay the OP is referring to.) The documentation for load skips this, and is therefore a touch misleading... Commented Mar 8, 2012 at 16:08
  • 1
    @JoeKington Thanks Joe. But how do I "not lazily load" a npz file? Commented Mar 8, 2012 at 16:11

7 Answers 7

311

I've compared performance (space and time) for a number of ways to store numpy arrays. Few of them support multiple arrays per file, but perhaps it's useful anyway.

benchmark for numpy array storage

Npy and binary files are both really fast and small for dense data. If the data is sparse or very structured, you might want to use npz with compression, which'll save a lot of space but cost some load time.

If portability is an issue, binary is better than npy. If human readability is important, then you'll have to sacrifice a lot of performance, but it can be achieved fairly well using csv (which is also very portable of course).

More details and the code are available at the github repo.

14
  • 3
    Could you explain why binary is better than npy for portability? Does this also apply for npz?
    – daniel451
    Commented Jun 1, 2017 at 12:47
  • 4
    @daniel451 Because any language can read binary files if they just know the shape, data type and whether it's row or column based. If you're just using Python then npy is fine, probably a little easier than binary.
    – Mark
    Commented Jun 2, 2017 at 19:36
  • 1
    Thank you! One more question: do I overlook something or did you leave out HDF5? Since this is pretty common, I would be interested how it compares to the other methods.
    – daniel451
    Commented Jun 2, 2017 at 21:15
  • 1
    I tried to use png and npy to save a same image. png only takes 2K space while the npy takes 307K. This result is really different from your work. Am I doing something wrong? This image is a greyscale image and only 0 and 255 are inside. I think this is a sparse data correct? Then I also used npz but the size is totally same.
    – York Yang
    Commented Aug 6, 2017 at 1:51
  • 6
    Why is h5py missing? Or am I missing something?
    – daniel451
    Commented Feb 26, 2018 at 0:23
77

I'm a big fan of hdf5 for storing large numpy arrays. There are two options for dealing with hdf5 in python:

http://www.pytables.org/

http://www.h5py.org/

Both are designed to work with numpy arrays efficiently.

5
  • 51
    would you be willing to provide some example code using these packages to save an array?
    – abcd
    Commented Apr 13, 2015 at 23:36
  • 18
    h5py example and pytables example Commented Sep 23, 2016 at 13:15
  • 2
    From my experiences, hdf5 performances very slow reading and writing with chunk storage and compression enabled. For example, I've two 2-D arrays with shape (2500,000 * 2000) with chunk size (10,000 * 2000). A single write operation of a array with shape (2000 * 2000) will take about 1 ~ 2s to complete. Do you have any suggestion on improving the performance? thx.
    – Simon. Li
    Commented Mar 28, 2017 at 9:48
  • 1
    1 to 2 s doesn't look so long for such a big array. What is the performance compared to .npy format ? Commented Sep 26, 2020 at 10:10
  • Does hdf5 have problem with CPU memory consumption?I encountered some problem with multi worker training when the hdf5 is large. While npz can use memory map to avoid.
    – ToughMind
    Commented Mar 29, 2022 at 7:47
54

There is now a HDF5 based clone of pickle called hickle!

https://github.com/telegraphic/hickle

import hickle as hkl 

data = {'name': 'test', 'data_arr': [1, 2, 3, 4]}

# Dump data to file
hkl.dump(data, 'new_data_file.hkl')

# Load data from file
data2 = hkl.load('new_data_file.hkl')

print(data == data2)

EDIT:

There also is the possibility to "pickle" directly into a compressed archive by doing:

import pickle, gzip, lzma, bz2

pickle.dump(data, gzip.open('data.pkl.gz', 'wb'))
pickle.dump(data, lzma.open('data.pkl.lzma', 'wb'))
pickle.dump(data, bz2.open('data.pkl.bz2', 'wb'))

compression


Appendix

import numpy as np
import matplotlib.pyplot as plt
import pickle, os, time
import gzip, lzma, bz2, h5py

compressions = ['pickle', 'h5py', 'gzip', 'lzma', 'bz2']
modules = dict(
    pickle=pickle, h5py=h5py, gzip=gzip, lzma=lzma, bz2=bz2
)

labels = ['pickle', 'h5py', 'pickle+gzip', 'pickle+lzma', 'pickle+bz2']
size = 1000

data = {}

# Random data
data['random'] = np.random.random((size, size))

# Not that random data
data['semi-random'] = np.zeros((size, size))
for i in range(size):
    for j in range(size):
        data['semi-random'][i, j] = np.sum(
            data['random'][i, :]) + np.sum(data['random'][:, j]
        )

# Not random data
data['not-random'] = np.arange(
    size * size, dtype=np.float64
).reshape((size, size))

sizes = {}

for key in data:

    sizes[key] = {}

    for compression in compressions:
        path = 'data.pkl.{}'.format(compression)

        if compression == 'pickle':
            time_start = time.time()
            pickle.dump(data[key], open(path, 'wb'))
            time_tot = time.time() - time_start
            sizes[key]['pickle'] = (
                os.path.getsize(path) * 10**-6, 
                time_tot.
            )
            os.remove(path)

        elif compression == 'h5py':
            time_start = time.time()
            with h5py.File(path, 'w') as h5f:
                h5f.create_dataset('data', data=data[key])
            time_tot = time.time() - time_start
            sizes[key][compression] = (os.path.getsize(path) * 10**-6, time_tot)
            os.remove(path)

        else:
            time_start = time.time()
            with modules[compression].open(path, 'wb') as fout:
                pickle.dump(data[key], fout)
            time_tot = time.time() - time_start
            sizes[key][labels[compressions.index(compression)]] = (
                os.path.getsize(path) * 10**-6, 
                time_tot,
            )
            os.remove(path)


f, ax_size = plt.subplots()
ax_time = ax_size.twinx()

x_ticks = labels
x = np.arange(len(x_ticks))

y_size = {}
y_time = {}
for key in data:
    y_size[key] = [sizes[key][x_ticks[i]][0] for i in x]
    y_time[key] = [sizes[key][x_ticks[i]][1] for i in x]

width = .2
viridis = plt.cm.viridis

p1 = ax_size.bar(x - width, y_size['random'], width, color = viridis(0))
p2 = ax_size.bar(x, y_size['semi-random'], width, color = viridis(.45))
p3 = ax_size.bar(x + width, y_size['not-random'], width, color = viridis(.9))
p4 = ax_time.bar(x - width, y_time['random'], .02, color='red')

ax_time.bar(x, y_time['semi-random'], .02, color='red')
ax_time.bar(x + width, y_time['not-random'], .02, color='red')

ax_size.legend(
    (p1, p2, p3, p4), 
    ('random', 'semi-random', 'not-random', 'saving time'),
    loc='upper center', 
    bbox_to_anchor=(.5, -.1), 
    ncol=4,
)
ax_size.set_xticks(x)
ax_size.set_xticklabels(x_ticks)

f.suptitle('Pickle Compression Comparison')
ax_size.set_ylabel('Size [MB]')
ax_time.set_ylabel('Time [s]')

f.savefig('sizes.pdf', bbox_inches='tight')
3
  • one warning that some ppl might care about is that pickle can execute arbitrary code which makes it less secure than other protocols for saving data. Commented Jul 13, 2020 at 19:05
  • This is great! Can you also provide the code for reading the files pickled directly into compression using lzma or bz2? Commented Jul 30, 2020 at 12:54
  • 3
    @ErnestSKirubakaran It's basically the same: If you saved it using pickle.dump( obj, gzip.open( 'filename.pkl.gz', 'wb' ) ), you can load it using pickle.load( gzip.open( 'filename.pkl.gz', 'r' ) )
    – Suuuehgi
    Commented Jul 23, 2021 at 18:50
18

savez() save data in a zip file, It may take some time to zip & unzip the file. You can use save() & load() function:

f = file("tmp.bin","wb")
np.save(f,a)
np.save(f,b)
np.save(f,c)
f.close()

f = file("tmp.bin","rb")
aa = np.load(f)
bb = np.load(f)
cc = np.load(f)
f.close()

To save multiple arrays in one file, you just need to open the file first, and then save or load the arrays in sequence.

0
8

Another possibility to store numpy arrays efficiently is Bloscpack:

#!/usr/bin/python
import numpy as np
import bloscpack as bp
import time

n = 10000000

a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5
tsizeMB = sum(i.size*i.itemsize for i in (a,b,c)) / 2**20.

blosc_args = bp.DEFAULT_BLOSC_ARGS
blosc_args['clevel'] = 6
t = time.time()
bp.pack_ndarray_file(a, 'a.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(b, 'b.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(c, 'c.blp', blosc_args=blosc_args)
t1 = time.time() - t
print "store time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)

t = time.time()
a1 = bp.unpack_ndarray_file('a.blp')
b1 = bp.unpack_ndarray_file('b.blp')
c1 = bp.unpack_ndarray_file('c.blp')
t1 = time.time() - t
print "loading time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)

and the output for my laptop (a relatively old MacBook Air with a Core2 processor):

$ python store-blpk.py
store time = 0.19 (1216.45 MB/s)
loading time = 0.25 (898.08 MB/s)

that means that it can store really fast, i.e. the bottleneck is typically the disk. However, as the compression ratios are pretty good here, the effective speed is multiplied by the compression ratios. Here are the sizes for these 76 MB arrays:

$ ll -h *.blp
-rw-r--r--  1 faltet  staff   921K Mar  6 13:50 a.blp
-rw-r--r--  1 faltet  staff   2.2M Mar  6 13:50 b.blp
-rw-r--r--  1 faltet  staff   1.4M Mar  6 13:50 c.blp

Please note that the use of the Blosc compressor is fundamental for achieving this. The same script but using 'clevel' = 0 (i.e. disabling compression):

$ python bench/store-blpk.py
store time = 3.36 (68.04 MB/s)
loading time = 2.61 (87.80 MB/s)

is clearly bottlenecked by the disk performance.

1
  • 2
    To whom it may concern: Although Bloscpack and PyTables are different projects, the former focusing only on disk dump and not stored arrays slicing, I tested both and for pure "file dump projects" Bloscpack is almost 6x faster than PyTables. Commented Mar 23, 2015 at 0:48
5

The lookup time is slow because when you use mmap to does not load content of array to memory when you invoke load method. Data is lazy loaded when particular data is needed. And this happens in lookup in your case. But second lookup won`t be so slow.

This is nice feature of mmap when you have a big array you do not have to load whole data into memory.

To solve your can use joblib you can dump any object you want using joblib.dump even two or more numpy arrays, see the example

firstArray = np.arange(100)
secondArray = np.arange(50)
# I will put two arrays in dictionary and save to one file
my_dict = {'first' : firstArray, 'second' : secondArray}
joblib.dump(my_dict, 'file_name.dat')
1
  • The library is no longer available. Commented Mar 25, 2020 at 9:31
0

'Best' depends on what your goal is. As others have said, a binary is maximally portable, but the problem is that you need to know about how the data is stored.

Darr saves your numpy array in a self-documented way based on flat binary and text files. This maximizes wide readability. It also automatically includes code on how to read your array in a variety of data science languages, such as numpy itself, but also R, Matlab, Julia etc.

Disclosure: I wrote the library.

Not the answer you're looking for? Browse other questions tagged or ask your own question.