As part of my data evaluation routine, I am performing 1'000'000 Monte-Carlo simulations on my MacBook Pro M1 Pro (10 Core, 32 GB RAM) using multiple processes in Python 3.10.5 (concurrent. futures.ProcessPoolExecutor)
. The execution times are:
Apple MacBook Pro, M1 Pro 10 Core CPU
macOS Monterey 12.6.2
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:07:06) [Clang 13.0.1 ] on darwin
without setting CPU affinity, single run
Cores Simulations Execution Time
10 1000 00:00:03.114602
10 10000 00:00:16.658438
10 100000 00:02:39.969048
10 1000000 00:26:23.064365
Trying to decrease calculations times and loading work off of my main machine, I decided to perform the calculation on an older Dual Xeon E5-2687Wv4 workstation utilizing 20 Cores (deactivated hyper-threading):
DELL Precision T7810, 2x Xeon E5-2687W v4
Ubuntu 22.04.3 LTS
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0] on linux
without setting CPU affinity, single run
Cores Simulations Execution Time
20 1000 00:00:03.913254
20 10000 00:00:16.684702
20 100000 00:02:31.481626
20 1000000 00:27:44.841615
As per the numbers above, I was not able to see any noticeable increase in performance. However, using only 20 of 24 cores available might produce some overhead as the scheduler tends do switch processor cores. To investigate this potential effect, I manually set the CPU affinity of each process and got the following results:
DELL Precision T7810, 2x Xeon E5-2687W v4
Ubuntu 22.04.3 LTS
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0] on linux
with setting CPU affinity, single run
Cores Simulations Execution Time
20 1000 00:00:03.855061
20 10000 00:00:17.721105
20 100000 00:02:39.870485
20 1000000 00:26:22.462597
Again, no difference in performance was noticeable. To make sure the code scales in general, I tested execution with 10, 16 and 20 cores on the workstation:
DELL Precision T7810, 2x Xeon E5-2687W v4
Ubuntu 22.04.3 LTS
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0] on linux
with setting CPU affinity, single run
Cores Simulations Execution Time
10 1000 00:00:04.274913
10 10000 00:00:30.311358
10 100000 00:04:57.086862
10 1000000 00:50:58.328345
Cores Simulations Execution Time
16 1000 00:00:03.605890
16 10000 00:00:21.139773
16 100000 00:03:25.156981
16 1000000 00:35:11.151080
Cores Simulations Execution Time
20 1000 00:00:03.855061
20 10000 00:00:17.721105
20 100000 00:02:39.870485
20 1000000 00:26:22.462597
The execution times seem to somewhat scale linearly with the number of cores (except for some overhead due to spawning the processes at lower numbers of simulations).
Based on common benchmark numbers between Apple's M1 Pro and Dual Xeon E5-2687Wv4 like
- PassMark for M1 Pro 10 Core
- PassMark for Dual Xeon E5-2687Wv4
- OpenFOAM Benchmark with M1 Pro and Dual Xeon E5-2987Wv4
I expected a performance increase of around 25...30 % (or at least 15 % if we consider these benchmarks as uncertain). However, my Monte-Carlo simulations perform roughly equivalent on both systems.
Based on the above findings, my questions are:
- Is this solely due to the more modern architecture of Apple's M1 Pro?
- What am I missing here (despite that Python itself is rather slow)?
- How could I investigate this scaling issue in more detail?