165

I expected array.array to be faster than lists, as arrays seem to be unboxed.

However, I get the following result:

In [1]: import array

In [2]: L = list(range(100000000))

In [3]: A = array.array('l', range(100000000))

In [4]: %timeit sum(L)
1 loop, best of 3: 667 ms per loop

In [5]: %timeit sum(A)
1 loop, best of 3: 1.41 s per loop

In [6]: %timeit sum(L)
1 loop, best of 3: 627 ms per loop

In [7]: %timeit sum(A)
1 loop, best of 3: 1.39 s per loop

What could be the cause of such a difference?

4
  • 7
    numpy tools can exploit efficiently your array : %timeit np.sum(A) : 100 loops, best of 3: 8.87 ms per loop
    – B. M.
    Commented Apr 22, 2016 at 11:38
  • 8
    I've never come across a situation where I've needed to use the array package. If you want to do significant amounts of math, Numpy operates at light-speed (i.e. C), and usually better than naive implementations of things like sum()).
    – Nick T
    Commented Apr 22, 2016 at 15:01
  • 42
    Close voters: Why exactly is this opinion-based? OP appears to be asking a specific, technical question about a measurable and repeatable phenomenon.
    – Kevin
    Commented Apr 22, 2016 at 20:05
  • 5
    @NickT Read An optimization anecdote. Turns out array is pretty fast in converting a string of integers (representing ASCII bytes) to a str object. Guido himself only came up with this after lots of other solution and was quite surprised at the performance. Anyway this is the only place where I remember seeing it being useful. numpy is much better for dealing with arrays but it's a 3rd party dependency.
    – Bakuriu
    Commented Apr 23, 2016 at 7:01

5 Answers 5

240
+50

The storage is "unboxed", but every time you access an element Python has to "box" it (embed it in a regular Python object) in order to do anything with it. For example, your sum(A) iterates over the array, and boxes each integer, one at a time, in a regular Python int object. That costs time. In your sum(L), all the boxing was done at the time the list was created.

So, in the end, an array is generally slower, but requires substantially less memory.


Here's the relevant code from a recent version of Python 3, but the same basic ideas apply to all CPython implementations since Python was first released.

Here's the code to access a list item:

PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
    /* error checking omitted */
    return ((PyListObject *)op) -> ob_item[i];
}

There's very little to it: somelist[i] just returns the i'th object in the list (and all Python objects in CPython are pointers to a struct whose initial segment conforms to the layout of a struct PyObject).

And here's the __getitem__ implementation for an array with type code l:

static PyObject *
l_getitem(arrayobject *ap, Py_ssize_t i)
{
    return PyLong_FromLong(((long *)ap->ob_item)[i]);
}

The raw memory is treated as a vector of platform-native C long integers; the i'th C long is read up; and then PyLong_FromLong() is called to wrap ("box") the native C long in a Python long object (which, in Python 3, which eliminates Python 2's distinction between int and long, is actually shown as type int).

This boxing has to allocate new memory for a Python int object, and spray the native C long's bits into it. In the context of the original example, this object's lifetime is very brief (just long enough for sum() to add the contents into a running total), and then more time is required to deallocate the new int object.

This is where the speed difference comes from, always has come from, and always will come from in the CPython implementation.

6
  • Sorry a naive here, I don't get how you guys are using that %timeit operator. Does python has something like %timeit ?
    – OmGanesh
    Commented Oct 11, 2021 at 13:15
  • 2
    @OmGanesh %timeit is an IPython magic command: ipython.readthedocs.io/en/stable/interactive/…
    – Arne
    Commented Oct 16, 2021 at 21:18
  • Your last paragraph means there's no other reason, but that doesn't seem true.
    – no comment
    Commented May 2 at 18:34
  • I'm unclear on what "the reason" is you think I gave, but in fact I gave more than one reason. For example, "boxing" a raw machine int also requires digging into its guts to convert it to Python's internal integer format. In -lst[420], the lst[420] part is already in Python's internal integer format. Negating it just flips the internal sign bit (Python ints do not use 2's complement internally - they store the absolute value with a separate sign bit).
    – Tim Peters
    Commented May 2 at 18:47
  • The reason being the boxing. If you gave another, I'm not seeing it. And negating doesn't just flip the sign inside an existing int object but creates a new int object with the modified data, doesn't it? Sounds like a comparable amount of work as the boxing that the array access has to do. Also, even lst[420]+123456789 is similarly much faster (37 ns versus 51 ns for just arr[420]).
    – no comment
    Commented May 2 at 19:12
94

To add to Tim Peters' excellent answer, arrays implement the buffer protocol, while lists do not. This means that, if you are writing a C extension (or the moral equivalent, such as writing a Cython module), then you can access and work with the elements of an array much faster than anything Python can do. This will give you considerable speed improvements, possibly well over an order of magnitude. However, it has a number of downsides:

  1. You are now in the business of writing C instead of Python. Cython is one way to ameliorate this, but it does not eliminate many fundamental differences between the languages; you need to be familiar with C semantics and understand what it is doing.
  2. PyPy's C API works to some extent, but isn't very fast. If you are targeting PyPy, you should probably just write simple code with regular lists, and then let the JITter optimize it for you.
  3. C extensions are harder to distribute than pure Python code because they need to be compiled. Compilation tends to be architecture and operating-system dependent, so you will need to ensure you are compiling for your target platform.

Going straight to C extensions may be using a sledgehammer to swat a fly, depending on your use case. You should first investigate NumPy and see if it is powerful enough to do whatever math you're trying to do. It will also be much faster than native Python, if used correctly.

13

Tim Peters answered why this is slow, but let's see how to improve it.

Sticking to your example of sum(range(...)) (factor 10 smaller than your example to fit into memory here):

import numpy
import array
L = list(range(10**7))
A = array.array('l', L)
N = numpy.array(L)

%timeit sum(L)
10 loops, best of 3: 101 ms per loop

%timeit sum(A)
1 loop, best of 3: 237 ms per loop

%timeit sum(N)
1 loop, best of 3: 743 ms per loop

This way also numpy needs to box/unbox, which has additional overhead. To make it fast one has to stay within the numpy c code:

%timeit N.sum()
100 loops, best of 3: 6.27 ms per loop

So from the list solution to the numpy version this is a factor 16 in runtime.

Let's also check how long creating those data structures takes

%timeit list(range(10**7))
1 loop, best of 3: 283 ms per loop

%timeit array.array('l', range(10**7))
1 loop, best of 3: 884 ms per loop

%timeit numpy.array(range(10**7))
1 loop, best of 3: 1.49 s per loop

%timeit numpy.arange(10**7)
10 loops, best of 3: 21.7 ms per loop

Clear winner: Numpy

Also note that creating the data structure takes about as much time as summing, if not more. Allocating memory is slow.

Memory usage of those:

sys.getsizeof(L)
90000112
sys.getsizeof(A)
81940352
sys.getsizeof(N)
80000096

So these take 8 bytes per number with varying overhead. For the range we use 32bit ints are sufficient, so we can safe some memory.

N=numpy.arange(10**7, dtype=numpy.int32)

sys.getsizeof(N)
40000096

%timeit N.sum()
100 loops, best of 3: 8.35 ms per loop

But it turns out that adding 64bit ints is faster than 32bit ints on my machine, so this is only worth it if you are limited by memory/bandwidth.

1
  • sys.getsizeof(L) is only counting the memory of the list object, not the memory of all the int objects, which is much larger. You should add that.
    – no comment
    Commented May 2 at 18:31
-1

I noticed that typecode L is faster than l, and it also works in I and Q.

Python 3.8.5

Here is the code of the test. Check it out d_d.

#!/usr/bin/python3
import inspect
from tqdm import tqdm
from array import array


def get_var_name(var):
    """
    Gets the name of var. Does it from the out most frame inner-wards.
    :param var: variable to get name from.
    :return: string
    """
    for fi in reversed(inspect.stack()):
        names = [var_name for var_name, var_val in fi.frame.f_locals.items() if var_val is var]
        if len(names) > 0:
            return names[0]

def performtest(func, n, *args, **kwargs):

    times = array('f')
    times_append = times.append
    for i in tqdm(range(n)):
        st = time.time()
        func(*args, **kwargs)
        times_append(time.time() - st)
    print(
        f"Func {func.__name__} with {[get_var_name(i) for i in args]} run {n} rounds consuming |"
        f" Mean: {sum(times)/len(times)}s | Max: {max(times)}s | Min: {min(times)}s"
    )

def list_int(start, end, step=1):
    return [i for i in range(start, end, step)]

def list_float(start, end, step=1):
    return [i + 1e-1 for i in range(start, end, step)]

def array_int(start, end, step=1):
    return array("I", range(start, end, step)) # speed I > i, H > h, Q > q, I~=H~=Q

def array_float(start, end, step=1):
    return array("f", [i + 1e-1 for i in range(start, end, step)]) # speed f > d


if __name__ == "__main__":

    performtest(list_int, 1000, 0, 10000)
    performtest(array_int, 1000, 0, 10000)

    performtest(list_float, 1000, 0, 10000)
    performtest(array_float, 1000, 0, 10000)

Results

Result of the test

2
  • 1
    Welcome to Stack Overflow. Please read How to Answer. In particular, please make sure your answers actually address the question being asked. OP is asking about the cause of some slowness. You comment about some code being faster than other code, but say nothing about why it might be slow.
    – Chris
    Commented Jul 11, 2022 at 22:46
  • Thanks for your friendly reminder. I will read the guidance and keep that in mind.
    – Ian Lee
    Commented Aug 31, 2022 at 1:27
-2

please note that 100000000 equals to 10^8 not to 10^7, and my results are as the folowwing:

100000000 == 10**8

# my test results on a Linux virtual machine:
#<L = list(range(100000000))> Time: 0:00:03.263585
#<A = array.array('l', range(100000000))> Time: 0:00:16.728709
#<L = list(range(10**8))> Time: 0:00:03.119379
#<A = array.array('l', range(10**8))> Time: 0:00:18.042187
#<A = array.array('l', L)> Time: 0:00:07.524478
#<sum(L)> Time: 0:00:01.640671
#<np.sum(L)> Time: 0:00:20.762153

Not the answer you're looking for? Browse other questions tagged or ask your own question.