Python: Multiprocessing took longer than sequential, why?

Question

I have this code, it generates 2,000,000 points uniformly distributed in a bounding box and does some calculations to partition the points based on some criteria.

import numpy as np

from draw import draw
import time

X = 0
Y = 1
N = 2000000

max_x = -100000
max_y = -100000
min_x = 100000
min_y = 100000

points = np.random.uniform(-10, 10, (N, 2))

start = time.time()
max_x_index = np.argmax(points[:, X])
max_y_index = np.argmax(points[:, Y])
min_x_index = np.argmin(points[:, X])
min_y_index = np.argmin(points[:, Y])

p_right = points[max_x_index]
p_top = points[max_y_index]
p_left = points[min_x_index]
p_bottom = points[min_y_index]

top_right = points[
points[:, X] > ((points[:, Y] - p_top[Y]) / (p_right[Y] - p_top[Y])) * (p_right[X] - p_top[X]) + p_top[X]]
top_left = points[
points[:, X] < ((points[:, Y] - p_top[Y]) / (p_left[Y] - p_top[Y])) * (p_left[X] - p_top[X]) + p_top[X]]
bottom_right = points[
points[:, X] > ((points[:, Y] - p_bottom[Y]) / (p_right[Y] - p_bottom[Y])) * (p_right[X] - p_bottom[X]) + p_bottom[
X]]
bottom_left = points[
points[:, X] < ((points[:, Y] - p_bottom[Y]) / (p_left[Y] - p_bottom[Y])) * (p_left[X] - p_bottom[X]) + p_bottom[X]]

end = time.time()
print(end - start)

The output is usually 0.09, which is in seconds. The most time-consuming section is the last four computations to get top_right, top_left, bottom_right, and bottom_left. I rewrite the code as follows

import numpy as np
from draw import draw
import time
import multiprocessing

N = 2000000
X = 0
Y = 1
points = np.random.uniform(-10, 10, (N, 2))

max_x = -100000
max_y = -100000
min_x = 100000
min_y = 100000

manager = multiprocessing.Manager()
top_right = manager.list()
top_left = manager.list()
bottom_right = manager.list()
bottom_left = manager.list()


def set_top_right():
    global X, Y, points, p_top, p_right, top_right
    top_right.extend(points[
        points[:, X] > ((points[:, Y] - p_top[Y]) / (p_right[Y] - p_top[Y])) * (p_right[X] - p_top[X]) + p_top[X]])


def set_top_left():
    global X, Y, points, p_top, p_left, top_left
    top_left.extend(points[
        points[:, X] < ((points[:, Y] - p_top[Y]) / (p_left[Y] - p_top[Y])) * (p_left[X] - p_top[X]) + p_top[X]])


def set_bottom_right():
    global X, Y, points, p_bottom, p_right, bottom_right
    bottom_right.extend(points[
        points[:, X] > ((points[:, Y] - p_bottom[Y]) / (p_right[Y] - p_bottom[Y])) * (p_right[X] - p_bottom[X]) +
        p_bottom[X]])


def set_bottom_left():
    global X, Y, points, p_bottom, p_left, bottom_left
    bottom_left.extend(points[
        points[:, X] < ((points[:, Y] - p_bottom[Y]) / (p_left[Y] - p_bottom[Y])) * (p_left[X] - p_bottom[X]) +
        p_bottom[X]])


start = time.time()
max_x_index = np.argmax(points[:, X])
max_y_index = np.argmax(points[:, Y])
min_x_index = np.argmin(points[:, X])
min_y_index = np.argmin(points[:, Y])

p_right = points[max_x_index]
p_top = points[max_y_index]
p_left = points[min_x_index]
p_bottom = points[min_y_index]

p1 = multiprocessing.Process(target=set_top_right)
p2 = multiprocessing.Process(target=set_top_left)
p3 = multiprocessing.Process(target=set_bottom_right)
p4 = multiprocessing.Process(target=set_bottom_left)

p1.start()
p2.start()
p3.start()
p4.start()

p1.join()
p2.join()
p3.join()
p4.join()

end = time.time()

print(end - start)

and surprisingly it became worse, about 0.15 seconds. I'm almost new to Python, however, I guess both approaches are a single thread and there are no I.O. operations. My Laptop CPU is core i5 generation 11 and I expected each core to take one of the processes and make it faster. Then why is this slower?

@jabaa Yes. p1.start() p2.start() p3.start() p4.start() takes 0.14 seconds. — M a m a D, Commented Jul 7 at 20:46
It could be, that's the overhead of mulitprocessing. Do you have profiling data for set_top_right, set_top_left, set_bottom_right and set_bottom_left? If it's ~0.01-0.02 seconds, it's probably the overhead. — jabaa, Commented Jul 7 at 21:01
You compare a shared memory list to a numpy ndarray. That has a ton of serialization overhead. Honestly, kudos to the shared memory implementation for only being slightly slower — Homer512, Commented Jul 7 at 21:15
working with numpy is often orders of magnitude faster than working with a list, especially, a manager.List, which requires serialization and deserialization. Note, you are still copying points and all the other state in each subprocess. Using a manager.List is unlikely to beat serial numpy. — juanpa.arrivillaga, Commented Jul 7 at 23:07

juanpa.arrivillaga · Accepted Answer · 2024-07-08 00:37:35Z

Multiprocessing with a multiprocessing.Manager.List is not going to be faster, you are going to incur the IPC overhead from copying state. Now, you may be able to use multiprocessing.SharedMemory with numpy arrays efficiently, however, I would also suggest trying numexpr, which is supposed to help for this particular use case of having many operations that create needless intermediate arrays. Here is a simple test:

import numpy as np
import numexpr as ne

import time


X = 0
Y = 1
N = 2000000

max_x = -100000
max_y = -100000
min_x = 100000
min_y = 100000

points = np.random.uniform(-10, 10, (N, 2))

max_x_index = np.argmax(points[:, X])
max_y_index = np.argmax(points[:, Y])
min_x_index = np.argmin(points[:, X])
min_y_index = np.argmin(points[:, Y])

p_right = points[max_x_index]
p_top = points[max_y_index]
p_left = points[min_x_index]
p_bottom = points[min_y_index]

start = time.perf_counter()

def pure_numpy():
    top_right = points[
        points[:, X] > ((points[:, Y] - p_top[Y]) / (p_right[Y] - p_top[Y])) * (p_right[X] - p_top[X]) + p_top[X]
    ]
    top_left = points[
        points[:, X] < ((points[:, Y] - p_top[Y]) / (p_left[Y] - p_top[Y])) * (p_left[X] - p_top[X]) + p_top[X]
    ]
    bottom_right = points[
        points[:, X] > ((points[:, Y] - p_bottom[Y]) / (p_right[Y] - p_bottom[Y])) * (p_right[X] - p_bottom[X]) + p_bottom[X]
    ]
    bottom_left = points[
        points[:, X] < ((points[:, Y] - p_bottom[Y]) / (p_left[Y] - p_bottom[Y])) * (p_left[X] - p_bottom[X]) + p_bottom[X]
    ]

for _ in range(100):
    pure_numpy()

print(time.perf_counter() - start)

start = time.perf_counter()

def using_numexpr():
    top_right = points[
        ne.evaluate(
            # points[:, X] > ((points[:, Y] - p_top[Y]) / (p_right[Y] - p_top[Y])) * (p_right[X] - p_top[X]) + p_top[X]
            "pointsX > ((pointsY - p_topY) / (p_rightY - p_topY)) * (p_rightX - p_topX) + p_topX",
            dict(pointsX=points[:, X], pointsY=points[:, Y], p_topY=p_top[Y], p_rightY=p_right[Y], p_rightX=p_right[X], p_topX=p_top[X])
        )
    ]
    top_left = points[
        # points[:, X] < ((points[:, Y] - p_top[Y]) / (p_left[Y] - p_top[Y])) * (p_left[X] - p_top[X]) + p_top[X]
        ne.evaluate(
            "pointsX < ((pointsY - p_topY) / (p_leftY - p_topY)) * (p_leftX - p_topX) + p_topX",
            dict(pointsX=points[:, X], pointsY=points[:, Y], p_topY=p_top[Y], p_leftY=p_left[Y], p_leftX=p_left[X], p_topX=p_top[X])
        )
    ]
    bottom_right = points[
        # points[:, X] > ((points[:, Y] - p_bottom[Y]) / (p_right[Y] - p_bottom[Y])) * (p_right[X] - p_bottom[X]) + p_bottom[X]
        ne.evaluate(
            "pointsX > ((pointsY - p_bottomY) / (p_rightY - p_bottomY)) * (p_rightX - p_bottomX) + p_bottomX",
            dict(pointsX=points[:, X], pointsY=points[:, Y], p_bottomY=p_bottom[Y], p_rightY=p_right[Y], p_rightX=p_right[X], p_bottomX=p_bottom[X])
        )
    ]
    bottom_left = points[
        #points[:, X] < ((points[:, Y] - p_bottom[Y]) / (p_left[Y] - p_bottom[Y])) * (p_left[X] - p_bottom[X]) + p_bottom[X]
        ne.evaluate(
            "pointsX < ((pointsY - p_bottomY) / (p_leftY - p_bottomY)) * (p_leftX - p_bottomX) + p_bottomX",
            dict(pointsX=points[:, X], pointsY=points[:, Y], p_bottomY=p_bottom[Y], p_leftY=p_left[Y], p_leftX=p_left[X], p_bottomX=p_bottom[X])
            
        )
    ]

for _ in range(100):
    using_numexpr()

print(time.perf_counter() - start)

I get results like:

(py312) Juans-MBP:foo juan$ python testing.py
10.898050098097883
7.235401042038575

So consistently faster.

I'm on my 10-year old laptop with only 4 logical cores, numexpr leverages multiple cores, it could work better on a better machine. You can also look into various ways to improve the performance like using Intel's Vector Math Library.

If you do want to use multiprocessing, you can use multiprocessing.shared_memory but it requires some care on to reconsruct a numpy array object from the shared memory (keeping track of dtypes and shape). See this answer to another question specifically about how to do that

It took only 0.03 seconds. Great! Thank you. Could you please recommend some good references for the method you used? This is just the first part of an algorithm in the field of computational geometry in computer science, and there are still many remaining parts that I need to implement using multiprocessing, so I need to learn — M a m a D, Commented Jul 8 at 8:08
@MamaD I didn't use multiprocessing, I used numexpr numpy-specific library, I linked to the user guide for the numexpr library above, but here it is again. I just translated your code directly. Read more about the library in the link. It's basically designed for this exact use-case — juanpa.arrivillaga, Commented Jul 8 at 8:34
The algorithm must recursively repeat a similar process for points at four regions top_right, top_left, bottom_right, bottom_left. Do you recommend me to use multiprocessing for such recursive process? — M a m a D, Commented Jul 8 at 13:25
@MamaD this question is recommending you use numexpr, not multiprocessing. Multi processing would be another approach altogether, which you definitely shouldn't combine with numpexpr, because numexpr already leverages all the cores. multiprocessing with a shared memory is a bit complex, but feasible. Although, you'd have to test out to see how it compares to this numexpr approach — juanpa.arrivillaga, Commented Jul 8 at 20:04

Collectives™ on Stack Overflow

Python: Multiprocessing took longer than sequential, why?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
numpy
multiprocessing
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonnumpymultiprocessing or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
numpy
multiprocessing
or ask your own question.