-
-
Notifications
You must be signed in to change notification settings - Fork 29.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.130b1 Performance Issue with Free Threading build #120040
Comments
Duplicate of #118749 |
Doesn't seem like a duplicate to me. The version is different, he was using 3.13.0a6, mine's beta 1, and he had problems with the fibonacci script, which works ok for me. @Eclips4 |
Yeah, you are going to encounter contention on the shared lists: both on the per-list locks and the reference count fields. |
Ok so just to be clear: this is expected behavior due to the fact that the free threading implementation is still incomplete, or it would behave the same if it was fully implemented? |
This is the expected behavior -- it is not changing. |
sorry guys, where I can download JIT+noGIL build for windows for testing? i don't want to mess with the compilation |
I believe you must download it from here https://www.python.org/downloads/release/python-3130b2/ and select the experimental features you want from the customization options during installation |
@xbit18 any embeddable builds ? |
I believe there is, just scroll through the options |
It looks like the two random matrices (a and b) are still shared across threads. Even though it's read-only, that would still create lock contention, IIUC. |
You are completely right and I realised this half a second before reading your reply, I feel so stupid. 3.9.10
Matrix multiplication completed. 1.731310041 seconds.
nogil-3.9.10-1
Matrix multiplication completed. 0.590554667 seconds.
3.9.18
Matrix multiplication completed. 1.6376172500000001 seconds.
3.10.13
Matrix multiplication completed. 1.6486839579993102 seconds.
3.11.8
Matrix multiplication completed. 1.0865809580009227 seconds.
3.12.2
Matrix multiplication completed. 1.072277084000234 seconds.
3.13.0b1 without GIL
Matrix multiplication completed. 1.3272077920009906 seconds.
3.13.0b1 with GIL
Matrix multiplication completed. 2.8077587080006197 seconds. |
Oh, something to mention: all the python versions which I'm comparing to 3.13 have been installed with pyenv using the "--with-lto" and "--enable-optimizations" options, while 3.13 hasn't because it wouldn't build with them. |
This is not stupid! I apologize if I made you feel that way. Showing an interest in this stuff, and discussing it in the open is extremely valuable to making this all better, so thank you.
This is an interesting finding, and something my team (the Faster CPython team) tracks pretty closely... but since we are mostly interested in single-threaded performance, we don't have any benchmarks that are CPU-bound, multithreaded-with-a-GIL like this. It's interesting to know there might be a regression there.
It could be, but the difference in this case seems much larger than I would expect from PGO/LTO. I'm definitely intrigued enough to look into it further, and I'll link back here if anything comes of it. |
https://gist.github.com/colesbury/e2b0e050556da5cb57987d334df87203 |
No absolutely you didn't, don't worry! I was joking because I've been struggling with this for like two days and the solution struck me like a brick because it was really obvious! |
Oh this is really helpful thank you! I'm running some other tests with the new code and it does actually seem to make a difference! For example these are some partial results with only 3 versions (size=1000 and threads=8): 3.11.8
Matrix multiplication completed. 72.75900166700012 seconds.
3.12.2
Matrix multiplication completed. 70.79678879200219 seconds.
3.13.0b1 without GIL
Matrix multiplication completed. 52.69710216600288 seconds.
3.13.0b1 with GIL
Matrix multiplication completed. 151.24306100000103 seconds. It seems to be clear that the larger the matrix gets, the larger the impact of free threading is. |
This is not surprising. When you have a free-threaded build and then run it with the GIL on (-Xgil=1), it has to check whether to use the GIL at runtime, in addition to adding a bunch of locks that aren't really strictly necessary when the GIL is turned on (unless I'm misunderstanding, and I might be). In other words, a free-threading build with the GIL turned on is intended to be correct, but I don't think it's intended to give the best performance. If you really want to do a performance measurement of 3.13 with the GIL on, you should use a non-free-threading build rather than the |
Thank you very much for this, I'll gladly check the non free threading version as well! |
Bug report
Bug description:
Hello, I'm writing a thesis on free threading python and thus I'm testing the 3.13.0b1 with --disable-gil.
I installed it with pyenv using this command
env PYTHON_CONFIGURE_OPTS='--disable-gil' pyenv install 3.13.0b1
I didn't specify --enable-optimizations and --with-lto because with those the build would fail.
Now, I'm writing a benchmark to compare the free threading python with past versions of normal python and even with the 3.9.10 nogil python.
Here's the problem. The benchmark is a simple matrix-matrix multiplication script that splits the matrix into rows and distributes the rows to a specified number of threads. This is the complete code:
When I ran this code with these versions of python (3.9.10, nogil-3.9.10, 3.10.13, 3.11.8, 3.12.2) the maximum running time is ~13 seconds with normal 3.9.10, the minimum is ~5 seconds with nogil 3.9.10.
When I run it with 3.13.0b1, the time skyrockets to ~48 seconds.
I tried using cProfile to profile the code but it freezes and never outputs anything (with 3.13, with other versions it works), instead the cpu goes to 100% usage, which makes me think it doesn't use multiple cores, since nogil 3.9 goes to >600% usage, and never stops unless I kill the process.
The basic fibonacci test works like a charm, so I know the --disable-gil build succeded.
All of this is done on a Macbook Air M1 with 16 GB of RAM and 8 cpu cores.
CPython versions tested on:
3.9, 3.10, 3.11, 3.12, 3.13
Operating systems tested on:
macOS
The text was updated successfully, but these errors were encountered: