Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.130b1 Performance Issue with Free Threading build #120040

Open
xbit18 opened this issue Jun 4, 2024 · 19 comments
Open

3.130b1 Performance Issue with Free Threading build #120040

xbit18 opened this issue Jun 4, 2024 · 19 comments
Labels
performance Performance or resource usage topic-free-threading type-bug An unexpected behavior, bug, or error

Comments

@xbit18
Copy link

xbit18 commented Jun 4, 2024

Bug report

Bug description:

Hello, I'm writing a thesis on free threading python and thus I'm testing the 3.13.0b1 with --disable-gil.
I installed it with pyenv using this command

env PYTHON_CONFIGURE_OPTS='--disable-gil' pyenv install 3.13.0b1

I didn't specify --enable-optimizations and --with-lto because with those the build would fail.
Now, I'm writing a benchmark to compare the free threading python with past versions of normal python and even with the 3.9.10 nogil python.
Here's the problem. The benchmark is a simple matrix-matrix multiplication script that splits the matrix into rows and distributes the rows to a specified number of threads. This is the complete code:

import threading
import time
import random

def multiply_row(A, B, row_index, result):
    # Compute the row result
    num_columns_B = len(B[0])
    num_columns_A = len(A[0])
    for j in range(num_columns_B):
        sum = 0
        for k in range(num_columns_A):
            sum += A[row_index][k] * B[k][j]
        result[row_index][j] = sum

def parallel_matrix_multiplication(a, b, result, row_indices):
    for row_index in row_indices:
        multiply_row(a, b, row_index, result)

def multi_threaded_matrix_multiplication(a, b, num_threads):
    num_rows = len(a)
    result = [[0] * len(b[0]) for _ in range(num_rows)]
    row_chunk = num_rows // num_threads

    threads = []
    for i in range(num_threads):
        start_row = i * row_chunk
        end_row = (i + 1) * row_chunk if i != num_threads - 1 else num_rows
        thread = threading.Thread(target=parallel_matrix_multiplication, args=(a, b, result, range(start_row, end_row)))
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

    return result

# Helper function to create a random matrix
def create_random_matrix(rows, cols):
    return [[random.random() for _ in range(cols)] for _ in range(rows)]

def main():
    size = 500  # Define matrix size
    a = create_random_matrix(size, size)
    b = create_random_matrix(size, size)
    num_threads = 8  # Define number of threads

    start = time.perf_counter()
    
    result = multi_threaded_matrix_multiplication(a, b, num_threads)
    print("Matrix multiplication completed.", time.perf_counter() - start, "seconds.")

if __name__ == "__main__":
    main()

When I ran this code with these versions of python (3.9.10, nogil-3.9.10, 3.10.13, 3.11.8, 3.12.2) the maximum running time is ~13 seconds with normal 3.9.10, the minimum is ~5 seconds with nogil 3.9.10.
When I run it with 3.13.0b1, the time skyrockets to ~48 seconds.
I tried using cProfile to profile the code but it freezes and never outputs anything (with 3.13, with other versions it works), instead the cpu goes to 100% usage, which makes me think it doesn't use multiple cores, since nogil 3.9 goes to >600% usage, and never stops unless I kill the process.

The basic fibonacci test works like a charm, so I know the --disable-gil build succeded.

All of this is done on a Macbook Air M1 with 16 GB of RAM and 8 cpu cores.

CPython versions tested on:

3.9, 3.10, 3.11, 3.12, 3.13

Operating systems tested on:

macOS

@xbit18 xbit18 added the type-bug An unexpected behavior, bug, or error label Jun 4, 2024
@Eclips4
Copy link
Member

Eclips4 commented Jun 4, 2024

Duplicate of #118749

@Eclips4 Eclips4 marked this as a duplicate of #118749 Jun 4, 2024
@Eclips4 Eclips4 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 4, 2024
@xbit18
Copy link
Author

xbit18 commented Jun 4, 2024

Doesn't seem like a duplicate to me. The version is different, he was using 3.13.0a6, mine's beta 1, and he had problems with the fibonacci script, which works ok for me. @Eclips4

@Eclips4 Eclips4 reopened this Jun 4, 2024
@Eclips4 Eclips4 marked this as not a duplicate of #118749 Jun 4, 2024
@colesbury
Copy link
Contributor

Yeah, you are going to encounter contention on the shared lists: both on the per-list locks and the reference count fields.

@xbit18
Copy link
Author

xbit18 commented Jun 4, 2024

Yeah, you are going to encounter contention on the shared lists: both on the per-list locks and the reference count fields.

Ok so just to be clear: this is expected behavior due to the fact that the free threading implementation is still incomplete, or it would behave the same if it was fully implemented?

@colesbury
Copy link
Contributor

This is the expected behavior -- it is not changing.

@mdboom mdboom added the performance Performance or resource usage label Jun 5, 2024
@xbit18
Copy link
Author

xbit18 commented Jun 6, 2024

Ok thank you.
Knowing this I changed the code so that it doesn't use a shared list "result" but thread-local results which are then combined. It doesn't really seem to be having any effect. Am I missing something?

Screen of different timings for the same code execution with different python versions (3.13 is free threading)
image

Also, I don't know how it can help but I've noticed that incrementing the number of threads seems to make thing worse. For example, using 2 threads I got ~20 seconds, using 8 I got 40 seconds and using 16 I got ~50 seconds.
Screen of different timings for different number of threads specified (all with 3.13.0b1)
image

This is the updated code, as you can see it doesn't use shared lists anymore but every thread creates a local list which it returns and then all the lists are combined:

import time
import random
from concurrent.futures import ThreadPoolExecutor

def multiply_row(A, B, row_index, local_result):
    num_columns_B = len(B[0])
    num_columns_A = len(A[0])
    for j in range(num_columns_B):
        sum = 0
        for k in range(num_columns_A):
            sum += A[row_index][k] * B[k][j]
        local_result[row_index][j] = sum

def parallel_matrix_multiplication(a, b, start_row, end_row):
    local_result = [[0] * len(b[0]) for _ in range(len(a))]
    
    for row_index in range(start_row, end_row):
        multiply_row(a, b, row_index, local_result)
    
    return local_result

def multi_threaded_matrix_multiplication(a, b, num_threads):
    num_rows = len(a)
    result = []
    for _ in range(num_rows):
        result.append([0] * len(b[0]))
    row_chunk = num_rows // num_threads

    futures = []
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        for i in range(num_threads):
            start_row = i * row_chunk
            end_row = (i + 1) * row_chunk if i != num_threads - 1 else num_rows
            future = executor.submit(parallel_matrix_multiplication, a, b, start_row, end_row)
            futures.append(future)
    
    results = [future.result() for future in futures]

    # Combine local results into the final result matrix
    for local_result in results:
        for i in range(num_rows):
            for j in range(len(b[0])):
                result[i][j] += local_result[i][j]

    return result

# Helper function to create a random matrix
def create_random_matrix(rows, cols):
    return [[random.random() for _ in range(cols)] for _ in range(rows)]

def main():
    size = 500  # Define matrix size
    
    a = create_random_matrix(size, size)
    b = create_random_matrix(size, size)

    num_threads = 8 # Define number of threads
    
    start = time.perf_counter()
    
    result = multi_threaded_matrix_multiplication(a, b, num_threads)
    print("Matrix multiplication completed.", time.perf_counter() - start, "seconds.")

if __name__ == "__main__":
    main()
@iperov
Copy link

iperov commented Jun 6, 2024

sorry guys,

where I can download JIT+noGIL build for windows for testing? i don't want to mess with the compilation

@xbit18
Copy link
Author

xbit18 commented Jun 6, 2024

sorry guys,

where I can download JIT+noGIL build for windows for testing? i don't want to mess with the compilation

I believe you must download it from here https://www.python.org/downloads/release/python-3130b2/ and select the experimental features you want from the customization options during installation

@iperov
Copy link

iperov commented Jun 6, 2024

@xbit18 any embeddable builds ?

@xbit18
Copy link
Author

xbit18 commented Jun 6, 2024

@xbit18 any embeddable builds ?

I believe there is, just scroll through the options

@mdboom
Copy link
Contributor

mdboom commented Jun 6, 2024

This is the updated code, as you can see it doesn't use shared lists anymore but every thread creates a local list which it returns and then all the lists are combined

It looks like the two random matrices (a and b) are still shared across threads. Even though it's read-only, that would still create lock contention, IIUC.

@xbit18
Copy link
Author

xbit18 commented Jun 6, 2024

It looks like the two random matrices (a and b) are still shared across threads. Even though it's read-only, that would still create lock contention, IIUC.

You are completely right and I realised this half a second before reading your reply, I feel so stupid.
I just had to do "a.copy()" in the executor.submit() call and the problem disappeared.
The performance of the free threading build is still on par and not better than the non free threading ones but at least it's not worse by miles.
I don't know if it matters but the GIL version of 3.13.0b1 performs worse than the other versions:

3.9.10
Matrix multiplication completed. 1.731310041 seconds.
nogil-3.9.10-1
Matrix multiplication completed. 0.590554667 seconds.
3.9.18
Matrix multiplication completed. 1.6376172500000001 seconds.
3.10.13
Matrix multiplication completed. 1.6486839579993102 seconds.
3.11.8
Matrix multiplication completed. 1.0865809580009227 seconds.
3.12.2
Matrix multiplication completed. 1.072277084000234 seconds.
3.13.0b1 without GIL
Matrix multiplication completed. 1.3272077920009906 seconds.
3.13.0b1 with GIL
Matrix multiplication completed. 2.8077587080006197 seconds.
@xbit18
Copy link
Author

xbit18 commented Jun 6, 2024

Oh, something to mention: all the python versions which I'm comparing to 3.13 have been installed with pyenv using the "--with-lto" and "--enable-optimizations" options, while 3.13 hasn't because it wouldn't build with them.
Perhaps it could be the cause of this?

@mdboom
Copy link
Contributor

mdboom commented Jun 6, 2024

I feel so stupid.

This is not stupid! I apologize if I made you feel that way. Showing an interest in this stuff, and discussing it in the open is extremely valuable to making this all better, so thank you.

I don't know if it matters but the GIL version of 3.13.0b1 performs worse than the other versions.

This is an interesting finding, and something my team (the Faster CPython team) tracks pretty closely... but since we are mostly interested in single-threaded performance, we don't have any benchmarks that are CPU-bound, multithreaded-with-a-GIL like this. It's interesting to know there might be a regression there.

Oh, something to mention: all the python versions which I'm comparing to 3.13 have been installed with pyenv using the "--with-lto" and "--enable-optimizations" options, while 3.13 hasn't because it wouldn't build with them. Perhaps it could be the cause of this?

It could be, but the difference in this case seems much larger than I would expect from PGO/LTO. I'm definitely intrigued enough to look into it further, and I'll link back here if anything comes of it.

@colesbury
Copy link
Contributor

I just had to do "a.copy()" in the executor.submit()

a is a list of lists of numbers. a.copy() makes a shallow copy. If you want to avoid reference count contention you need to copy the inner lists and the numbers themselves.

https://gist.github.com/colesbury/e2b0e050556da5cb57987d334df87203

@xbit18
Copy link
Author

xbit18 commented Jun 6, 2024

This is not stupid! I apologize if I made you feel that way. Showing an interest in this stuff, and discussing it in the open is extremely valuable to making this all better, so thank you.

No absolutely you didn't, don't worry! I was joking because I've been struggling with this for like two days and the solution struck me like a brick because it was really obvious!

@xbit18
Copy link
Author

xbit18 commented Jun 6, 2024

a is a list of lists of numbers. a.copy() makes a shallow copy. If you want to avoid reference count contention you need to copy the inner lists and the numbers themselves.

https://gist.github.com/colesbury/e2b0e050556da5cb57987d334df87203

Oh this is really helpful thank you! I'm running some other tests with the new code and it does actually seem to make a difference! For example these are some partial results with only 3 versions (size=1000 and threads=8):

3.11.8
Matrix multiplication completed. 72.75900166700012 seconds.
3.12.2
Matrix multiplication completed. 70.79678879200219 seconds.
3.13.0b1 without GIL
Matrix multiplication completed. 52.69710216600288 seconds.
3.13.0b1 with GIL
Matrix multiplication completed. 151.24306100000103 seconds.

It seems to be clear that the larger the matrix gets, the larger the impact of free threading is.
Interestingly, as I said, the free-threading build WITH the GIL (-Xgil=1) does a lot worse.

@mdboom
Copy link
Contributor

mdboom commented Jun 6, 2024

Interestingly, as I said, the free-threading build WITH the GIL (-Xgil=1) does a lot worse.

This is not surprising. When you have a free-threaded build and then run it with the GIL on (-Xgil=1), it has to check whether to use the GIL at runtime, in addition to adding a bunch of locks that aren't really strictly necessary when the GIL is turned on (unless I'm misunderstanding, and I might be). In other words, a free-threading build with the GIL turned on is intended to be correct, but I don't think it's intended to give the best performance. If you really want to do a performance measurement of 3.13 with the GIL on, you should use a non-free-threading build rather than the -Xgil=1 flag.

@xbit18
Copy link
Author

xbit18 commented Jun 6, 2024

This is not surprising. When you have a free-threaded build and then run it with the GIL on (-Xgil=1), it has to check whether to use the GIL at runtime, in addition to adding a bunch of locks that aren't really strictly necessary when the GIL is turned on (unless I'm misunderstanding, and I might be). In other words, a free-threading build with the GIL turned on is intended to be correct, but I don't think it's intended to give the best performance. If you really want to do a performance measurement of 3.13 with the GIL on, you should use a non-free-threading build rather than the -Xgil=1 flag.

Thank you very much for this, I'll gladly check the non free threading version as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-free-threading type-bug An unexpected behavior, bug, or error
6 participants