Fastest way to grow a numpy numeric array

Question

Requirements:

I need to grow an array arbitrarily large from data.
I can guess the size (roughly 100-200) with no guarantees that the array will fit every time
Once it is grown to its final size, I need to perform numeric computations on it, so I'd prefer to eventually get to a 2-D numpy array.
Speed is critical. As an example, for one of 300 files, the update() method is called 45 million times (takes 150s or so) and the finalize() method is called 500k times (takes total of 106s) ... taking a total of 250s or so.

Here is my code:

def __init__(self):
    self.data = []

def update(self, row):
    self.data.append(row)

def finalize(self):
    dx = np.array(self.data)

Other things I tried include the following code ... but this is waaaaay slower.

def class A:
    def __init__(self):
        self.data = np.array([])

    def update(self, row):
        np.append(self.data, row)

    def finalize(self):
        dx = np.reshape(self.data, size=(self.data.shape[0]/5, 5))

Here is a schematic of how this is called:

for i in range(500000):
    ax = A()
    for j in range(200):
         ax.update([1,2,3,4,5])
    ax.finalize()
    # some processing on ax

Does it need to be a numpy array before it is finished? If not, use a list of lists and then convert when you're finished. — Andrew Jaffe, Commented Aug 20, 2011 at 20:11
@AndrewJaffe Do lists of lists match the memory efficiency of numpy? — AturSams, Commented Jun 16, 2015 at 16:01
There's another method using list of numpy array and np.concatenate — user202729, Commented Oct 25, 2021 at 0:59

Eric · Accepted Answer · 2016-04-06 14:57:02Z

121

I tried a few different things, with timing.

import numpy as np

The method you mention as slow: (32.094 seconds)

class A:

    def __init__(self):
        self.data = np.array([])

    def update(self, row):
        self.data = np.append(self.data, row)

    def finalize(self):
        return np.reshape(self.data, newshape=(self.data.shape[0]/5, 5))

Regular ol Python list: (0.308 seconds)

class B:

    def __init__(self):
        self.data = []

    def update(self, row):
        for r in row:
            self.data.append(r)

    def finalize(self):
        return np.reshape(self.data, newshape=(len(self.data)/5, 5))

Trying to implement an arraylist in numpy: (0.362 seconds)

class C:

    def __init__(self):
        self.data = np.zeros((100,))
        self.capacity = 100
        self.size = 0

    def update(self, row):
        for r in row:
            self.add(r)

    def add(self, x):
        if self.size == self.capacity:
            self.capacity *= 4
            newdata = np.zeros((self.capacity,))
            newdata[:self.size] = self.data
            self.data = newdata

        self.data[self.size] = x
        self.size += 1

    def finalize(self):
        data = self.data[:self.size]
        return np.reshape(data, newshape=(len(data)/5, 5))

And this is how I timed it:

x = C()
for i in xrange(100000):
    x.update([i])

So it looks like regular old Python lists are pretty good ;)

edited Apr 6, 2016 at 14:57

Eric

96.9k54 gold badges252 silver badges380 bronze badges

answered Aug 20, 2011 at 19:08

Owen

39.2k14 gold badges96 silver badges127 bronze badges

2

I think the comparison is clearer with 60M updates and 500K finalizes calls. It looks like you've not called finalize in this example.
– fodon
Commented Aug 20, 2011 at 23:29
2

@fodon I actually did call finalize -- once per run (so I guess not really much of an impact). But this makes me think maybe I misunderstood how your data is growing: if you get 60M in on an update, I was thinking this would provide at least 60M data for the next finalize?
– Owen
Commented Aug 20, 2011 at 23:32
@Owen 60M and 500K mean 60 million and 500 thousand calls to update and finalize respectively. See my revised timing which tests a 100:1 ratio of update to finalize
– Prashant Kumar
Commented Aug 21, 2011 at 3:37
4

Note that the third option is superior when you're running out of memory. The second option requires lots of memory. The reason is that Python's lists are arrays of references to values, whereas NumPy's arrays are actual arrays of values.
– Fabianius
Commented Jul 18, 2016 at 12:50
1

You could make the update part of the second by replacing the for loop with self.data.extend(row) don’t think will be performance difference but looks nicer too.
– Imanol Luengo
Commented Oct 6, 2020 at 10:25

| Show 3 more comments

Community · Accepted Answer · 2017-05-23 12:03:04Z

24

np.append() copy all the data in the array every time, but list grow the capacity by a factor (1.125). list is fast, but memory usage is larger than array. You can use array module of the python standard library if you care about the memory.

Here is a discussion about this topic:

How to create a dynamic array

edited May 23, 2017 at 12:03

CommunityBot

11 silver badge

answered Aug 20, 2011 at 23:10

HYRY

96.6k27 gold badges191 silver badges191 bronze badges

3

is there a way to change the factor by which the list grows?
– fodon
Commented Aug 21, 2011 at 0:53
1

np.append() consuming time exponentially increase with the elements' number.
– Clock ZHONG
Commented May 4, 2017 at 9:38
3

^ linear (i.e. total accumulated time is quadric), not exponential.
– user1111929
Commented Mar 18, 2019 at 23:57

Add a comment |

Paul · Accepted Answer · 2020-08-17 10:16:14Z

Using the class declarations in Owen's post, here is a revised timing with some effect of the finalize.

In short, I find class C to provide an implementation that is over 60x faster than the method in the original post. (apologies for the wall of text)

The file I used:

#!/usr/bin/python
import cProfile
import numpy as np

# ... class declarations here ...

def test_class(f):
    x = f()
    for i in xrange(100000):
        x.update([i])
    for i in xrange(1000):
        x.finalize()

for x in 'ABC':
    cProfile.run('test_class(%s)' % x)

Now, the resulting timings:

A:

     903005 function calls in 16.049 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.000    0.000   16.049   16.049 <string>:1(<module>)
100000    0.139    0.000    1.888    0.000 fromnumeric.py:1043(ravel)
  1000    0.001    0.000    0.003    0.000 fromnumeric.py:107(reshape)
100000    0.322    0.000   14.424    0.000 function_base.py:3466(append)
100000    0.102    0.000    1.623    0.000 numeric.py:216(asarray)
100000    0.121    0.000    0.298    0.000 numeric.py:286(asanyarray)
  1000    0.002    0.000    0.004    0.000 test.py:12(finalize)
     1    0.146    0.146   16.049   16.049 test.py:50(test_class)
     1    0.000    0.000    0.000    0.000 test.py:6(__init__)
100000    1.475    0.000   15.899    0.000 test.py:9(update)
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
100000    0.126    0.000    0.126    0.000 {method 'ravel' of 'numpy.ndarray' objects}
  1000    0.002    0.000    0.002    0.000 {method 'reshape' of 'numpy.ndarray' objects}
200001    1.698    0.000    1.698    0.000 {numpy.core.multiarray.array}
100000   11.915    0.000   11.915    0.000 {numpy.core.multiarray.concatenate}

B:

     208004 function calls in 16.885 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.001    0.001   16.885   16.885 <string>:1(<module>)
  1000    0.025    0.000   16.508    0.017 fromnumeric.py:107(reshape)
  1000    0.013    0.000   16.483    0.016 fromnumeric.py:32(_wrapit)
  1000    0.007    0.000   16.445    0.016 numeric.py:216(asarray)
     1    0.000    0.000    0.000    0.000 test.py:16(__init__)
100000    0.068    0.000    0.080    0.000 test.py:19(update)
  1000    0.012    0.000   16.520    0.017 test.py:23(finalize)
     1    0.284    0.284   16.883   16.883 test.py:50(test_class)
  1000    0.005    0.000    0.005    0.000 {getattr}
  1000    0.001    0.000    0.001    0.000 {len}
100000    0.012    0.000    0.012    0.000 {method 'append' of 'list' objects}
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1000    0.020    0.000    0.020    0.000 {method 'reshape' of 'numpy.ndarray' objects}
  1000   16.438    0.016   16.438    0.016 {numpy.core.multiarray.array}

C:

     204010 function calls in 0.244 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.000    0.000    0.244    0.244 <string>:1(<module>)
  1000    0.001    0.000    0.003    0.000 fromnumeric.py:107(reshape)
     1    0.000    0.000    0.000    0.000 test.py:27(__init__)
100000    0.082    0.000    0.170    0.000 test.py:32(update)
100000    0.087    0.000    0.088    0.000 test.py:36(add)
  1000    0.002    0.000    0.005    0.000 test.py:46(finalize)
     1    0.068    0.068    0.243    0.243 test.py:50(test_class)
  1000    0.000    0.000    0.000    0.000 {len}
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1000    0.002    0.000    0.002    0.000 {method 'reshape' of 'numpy.ndarray' objects}
     6    0.001    0.000    0.001    0.000 {numpy.core.multiarray.zeros}

Class A is destroyed by the updates, class B is destroyed by the finalizes. Class C is robust in the face of both of them.

The update is done a n times then finalize is called once. This whole process is done m times (otherwise there is no data to finalize). Also, when comparing with original post ... do you mean the first (array.append + numpy conversion) or (numpy.append + reshape)? — fodon, Commented Aug 21, 2011 at 14:27
cProfile. It's the first import and the last line invoked in my code snippet. — Prashant Kumar, Commented Jul 12, 2012 at 5:47

Luca Fiaschi · Accepted Answer · 2013-03-06 12:23:40Z

there is a big performance difference in the function that you use for finalization. Consider the following code:

N=100000
nruns=5

a=[]
for i in range(N):
    a.append(np.zeros(1000))

print "start"

b=[]
for i in range(nruns):
    s=time()
    c=np.vstack(a)
    b.append((time()-s))
print "Timing version vstack ",np.mean(b)

b=[]
for i in range(nruns):
    s=time()
    c1=np.reshape(a,(N,1000))
    b.append((time()-s))

print "Timing version reshape ",np.mean(b)

b=[]
for i in range(nruns):
    s=time()
    c2=np.concatenate(a,axis=0).reshape(-1,1000)
    b.append((time()-s))

print "Timing version concatenate ",np.mean(b)

print c.shape,c2.shape
assert (c==c2).all()
assert (c==c1).all()

Using concatenate seems to be twice as fast as the first version and more than 10 times faster than the second version.

Timing version vstack  1.5774928093
Timing version reshape  9.67419199944
Timing version concatenate  0.669512557983

Sam Spencer · Accepted Answer · 2023-04-10 19:53:17Z

Multiple Dimension Numpy Arrays

Adding to Owen's and Prashant Kumar's answers, here is a version using multiple dimensional numpy arrays (aka. shape) that speeds up code for the numpy solutions. This is especially helpful if you need to access (finalize()) the data often.

Version	Prashant Kumar	row_length=1	row_length=5
Class A - np.append	2.873 s	2.776 s	0.682 s
Class B - python list	6.693 s	80.868 s	22.012 s
Class C - arraylist	0.095 s	0.180 s	0.043 s

The column Prashant Kumar is his example executed on my machine to give a comparison. With row_length=5 it is the example of the initial question. The dramatic increase in the python list, comes from {built-in method numpy.array}, which means numpy needs a lot more time to convert a multiple dimensional list of lists to an array in respect to a 1D list and reshape it where both have the same number entries, e.g. np.array([[1,2,3]*5]) vs. np.array([1]*15).reshape((-1,3)).

And this is the code:

import cProfile
import numpy as np

class A:
    def __init__(self,shape=(0,), dtype=float):
        """First item of shape is ingnored, the rest defines the shape"""
        self.data = np.array([], dtype=dtype).reshape((0,*shape[1:]))

    def update(self, row):
        self.data = np.append(self.data, row)

    def finalize(self):
        return self.data
    
    
class B:
    def __init__(self, shape=(0,), dtype=float):
        """First item of shape is ingnored, the rest defines the shape"""
        self.shape = shape
        self.dtype = dtype 
        self.data = []

    def update(self, row):
        self.data.append(row)

    def finalize(self):
        return np.array(self.data, dtype=self.dtype).reshape((-1, *self.shape[1:]))
    
    
class C:
    def __init__(self, shape=(0,), dtype=float):
        """First item of shape is ingnored, the rest defines the shape"""
        self.shape = shape
        self.data = np.zeros((100,*shape[1:]),dtype=dtype)
        self.capacity = 100
        self.size = 0

    def update(self, x):
        if self.size == self.capacity:
            self.capacity *= 4
            newdata = np.zeros((self.capacity,*self.data.shape[1:]))
            newdata[:self.size] = self.data
            self.data = newdata

        self.data[self.size] = x
        self.size += 1

    def finalize(self):
        return self.data[:self.size]
    

def test_class(f):
    row_length = 5
    x = f(shape=(0,row_length))
    for i in range(int(100000/row_length)):
        x.update([i]*row_length)
    for i in range(1000):
        x.finalize()

for x in 'ABC':
    cProfile.run('test_class(%s)' % x)

And another option to add to the post above from Luca Fiaschi.

b=[]
for i in range(nruns):
    s=time.time()
    c1=np.array(a, dtype=int).reshape((N,1000))
    b.append((time.time()-s))
    
print("Timing version array.reshape ",np.mean(b))

The timing result for me is:

Timing version vstack         0.6863266944885253
Timing version reshape        0.505419111251831
Timing version array.reshape  0.5052066326141358
Timing version concatenate    0.5339600563049316

4th method: growing io.BytesIO() by .write( np.float32(1).data.obj ). Thus you don't care about capacity. Finalization is just assign bytes_io.getbuffer() to read-only numpy ndarray — iperov, Commented Feb 25 at 10:40
iperov: thanks for the 4th solution. Happy to compare it with the others. Can you make an example with ndarrays? — kho, Commented Feb 27 at 9:01

joaonrb · Accepted Answer · 2014-11-22 10:32:58Z

1

If you want improve performance with list operations, have a look to blist library. It is a optimized implementation of python list and other structures.

I didn't benchmark it yet but the results in their page seem promising.

answered Nov 22, 2014 at 10:32

joaonrb

99111 silver badges32 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Fastest way to grow a numpy numeric array

6 Answers 6

Multiple Dimension Numpy Arrays

Not the answer you're looking for? Browse other questions tagged
python
performance
numpy
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Multiple Dimension Numpy Arrays

Not the answer you're looking for? Browse other questions tagged pythonperformancenumpy or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
performance
numpy
or ask your own question.