What is the advantage of saving `.npz` files instead of `.npy` in python, regarding speed, memory and look-up?

Question

The python documentation for the numpy.savez which saves an .npz file is:

The .npz file format is a zipped archive of files named after the variables they contain. The archive is not compressed and each file in the archive contains one variable in .npy format. [...]

When opening the saved .npz file with load a NpzFile object is returned. This is a dictionary-like object which can be queried for its list of arrays (with the .files attribute), and for the arrays themselves.

My question is: what is the point of numpy.savez?

Is it just a more elegant version (shorter command) to save multiple arrays, or is there a speed-up in the saving/reading process? Does it occupy less memory?

What's the point of file archives?
– hpaulj
Commented Jan 17, 2019 at 16:55 — hpaulj, Commented Jan 17, 2019 at 16:55

Majid Hajibaba · Accepted Answer · 2021-11-02 13:31:33Z

There are two parts of explanation for answering your question.

I. NPY vs. NPZ

As we already read from the doc, the .npy format is:

the standard binary file format in NumPy for persisting a single arbitrary NumPy array on disk. ... The format is designed to be as simple as possible while achieving its limited goals. (sources)

And .npz is only a

simple way to combine multiple arrays into a single file, one can use ZipFile to contain multiple “.npy” files. We recommend using the file extension “.npz” for these archives. (sources)

So, .npz is just a ZipFile containing multiple “.npy” files. And this ZipFile can be either compressed (by using np.savez_compressed) or uncompressed (by using np.savez).

It's similar to tarball archive file in Unix-like system, where a tarball file can be just an uncompressed archive file which containing other files or a compressed archive file by combining with various compression programs (gzip, bzip2, etc.)

II. Different APIs for binary serialization

And Numpy also provides different APIs to produce these binary file output:

np.save ---> Save an array to a binary file in NumPy .npy format
np.savez --> Save several arrays into a single file in uncompressed .npz format
np.savez_compressed --> Save several arrays into a single file in compressed .npz format
np.load --> Load arrays or pickled objects from .npy, .npz or pickled files

If we skim the source code of Numpy, under the hood:

def _savez(file, args, kwds, compress, allow_pickle=True, pickle_kwargs=None):
    ...
    if compress:
        compression = zipfile.ZIP_DEFLATED
    else:
        compression = zipfile.ZIP_STORED
    ...


def savez(file, *args, **kwds):
    _savez(file, args, kwds, False)


def savez_compressed(file, *args, **kwds):
    _savez(file, args, kwds, True)

Then back to the question:

If only use np.save, there is no more compression on top of the .npy format, only just a single archive file for the convenience of managing multiple related files.
If use np.savez_compressed, then of course less memory on disk because of more CPU time to do the compression job (i.e. a bit slower).

why do I need torch when load npz sometimes???
– Nicholas Jela
Commented Apr 5, 2022 at 9:20 — Nicholas Jela, Commented Apr 5, 2022 at 9:20

user2699 · Accepted Answer · 2019-01-17 16:15:13Z

10

The main advantage is that the arrays are lazy loaded. That is, if you have an npz file with 100 arrays you can load the file without actually loading any of the data. If you request a single array, only the data for that array is loaded.

A downside to npz files is they can't be memory mapped (using load(<file>, mmap_mode='r')), so for large arrays they may not be the best choice. For data where the arrays have a common shape I'd suggest taking a look at structured arrays. These can be memory mapped, allow accessing data with dict-like syntax (i.e., arr['field']), and are very efficient memory wise.

answered Jan 17, 2019 at 16:15

user2699

3,08717 silver badges33 bronze badges

Are they compressed or not though? do they occupy less memory when you save them?
– SuperCiocia
Commented Jan 18, 2019 at 14:13
@SuperCiocia, No, from the docs you've included: "The archive is not compressed".
– user2699
Commented Jan 18, 2019 at 14:31
1

Yeah exactly, hence why i don't see the point. The memory of npz is the same as npy then no?
– SuperCiocia
Commented Jan 18, 2019 at 14:39
3

But what's the difference between saving my 100 arrays as 100 .npy arrays and only loading the one I want, and saving them as 1 .npz file and requesting a single of the those arrays?
– SuperCiocia
Commented Jan 18, 2019 at 19:40
2

@SuperCiocia adding to the comment from @Eureka, I wanted to add that the .npz-format has basically 2 advantages over single files. First: All stored sub-arrays are indexed in the header, which makes random read on multiple arrays faster than having to query the filesystem index for each file. Second as @Eureka said, if you have a lot of small files, moving them or batch updating will be a lot faster. However, if you only always need a single array from certain context or want to do a high amount of individual updates which cannot be batched, you're better off using single files.
– Corsair
Commented Dec 4, 2021 at 1:31

| Show 5 more comments

Collectives™ on Stack Overflow

What is the advantage of saving `.npz` files instead of `.npy` in python, regarding speed, memory and look-up?

2 Answers 2

I. NPY vs. NPZ

II. Different APIs for binary serialization

Not the answer you're looking for? Browse other questions tagged
python
numpy
serialization
archive
npz-file
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

I. NPY vs. NPZ

II. Different APIs for binary serialization

Not the answer you're looking for? Browse other questions tagged pythonnumpyserializationarchivenpz-file or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
numpy
serialization
archive
npz-file
or ask your own question.