Skip to main content
The 2024 Developer Survey results are live! See the results
Mod Moved Comments To Chat
added 429 characters in body
Link
albert
  • 283
  • 1
  • 2
  • 10

Porting No performance increase when porting Python multiprocessing calculations from Apple M1 Pro to Dual CPU Xeon E5-2687W v4

added 429 characters in body
Source Link
albert
  • 283
  • 1
  • 2
  • 10
Apple MacBook Pro, M1 Pro 10 Core CPU
macOS Monterey 12.6.2
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:07:06) [Clang 13.0.1 ] on darwin
without setting CPU affinity, single run 

Cores   Simulations     Execution Time
10             1000     00:00:03.114602
10            10000     00:00:16.658438
10           100000     00:02:39.969048
10          1000000     00:26:23.064365
DELL Precision T7810, 2x Xeon E5-2687W v4
Ubuntu 22.04.3 LTS
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0] on linux
without setting CPU affinity, single run

Cores   Simulations     Execution Time
20             1000     00:00:03.913254
20            10000     00:00:16.684702
20           100000     00:02:31.481626
20          1000000     00:27:44.841615
DELL Precision T7810, 2x Xeon E5-2687W v4
Ubuntu 22.04.3 LTS
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0] on linux
with setting CPU affinity, single run

Cores   Simulations     Execution Time
20             1000     00:00:03.855061
20            10000     00:00:17.721105
20           100000     00:02:39.870485
20          1000000     00:26:22.462597
Apple MacBook Pro, M1 Pro 10 Core CPU
macOS Monterey 12.6.2
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:07:06) [Clang 13.0.1 ] on darwin
without setting CPU affinity, single run
Cores   Simulations     Execution Time
10             1000     00:00:03.114602
10            10000     00:00:16.658438
10           100000     00:02:39.969048
10          1000000     00:26:23.064365
Cores   Simulations     Execution Time
20             1000     00:00:03.913254
20            10000     00:00:16.684702
20           100000     00:02:31.481626
20          1000000     00:27:44.841615
Cores   Simulations     Execution Time
20             1000     00:00:03.855061
20            10000     00:00:17.721105
20           100000     00:02:39.870485
20          1000000     00:26:22.462597
Apple MacBook Pro, M1 Pro 10 Core CPU
macOS Monterey 12.6.2
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:07:06) [Clang 13.0.1 ] on darwin
without setting CPU affinity, single run 

Cores   Simulations     Execution Time
10             1000     00:00:03.114602
10            10000     00:00:16.658438
10           100000     00:02:39.969048
10          1000000     00:26:23.064365
DELL Precision T7810, 2x Xeon E5-2687W v4
Ubuntu 22.04.3 LTS
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0] on linux
without setting CPU affinity, single run

Cores   Simulations     Execution Time
20             1000     00:00:03.913254
20            10000     00:00:16.684702
20           100000     00:02:31.481626
20          1000000     00:27:44.841615
DELL Precision T7810, 2x Xeon E5-2687W v4
Ubuntu 22.04.3 LTS
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0] on linux
with setting CPU affinity, single run

Cores   Simulations     Execution Time
20             1000     00:00:03.855061
20            10000     00:00:17.721105
20           100000     00:02:39.870485
20          1000000     00:26:22.462597
Source Link
albert
  • 283
  • 1
  • 2
  • 10

Porting Python calculations from Apple M1 Pro to Dual Xeon E5-2687W v4

As part of my data evaluation routine, I am performing 1'000'000 Monte-Carlo simulations on my MacBook Pro M1 Pro (10 Core, 32 GB RAM) using multiple processes in Python 3.10.5 (concurrent. futures.ProcessPoolExecutor). The execution times are:

Apple MacBook Pro, M1 Pro 10 Core CPU
macOS Monterey 12.6.2
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:07:06) [Clang 13.0.1 ] on darwin
without setting CPU affinity, single run
Cores   Simulations     Execution Time
10             1000     00:00:03.114602
10            10000     00:00:16.658438
10           100000     00:02:39.969048
10          1000000     00:26:23.064365

Trying to decrease calculations times and loading work off of my main machine, I decided to perform the calculation on an older Dual Xeon E5-2687Wv4 workstation utilizing 20 Cores (deactivated hyper-threading):

Cores   Simulations     Execution Time
20             1000     00:00:03.913254
20            10000     00:00:16.684702
20           100000     00:02:31.481626
20          1000000     00:27:44.841615

As per the numbers above, I was not able to see any noticeable increase in performance. However, using only 20 of 24 cores available might produce some overhead as the scheduler tends do switch processor cores. To investigate this potential effect, I manually set the CPU affinity of each process and got the following results:

Cores   Simulations     Execution Time
20             1000     00:00:03.855061
20            10000     00:00:17.721105
20           100000     00:02:39.870485
20          1000000     00:26:22.462597

Again, no difference in performance was noticeable. To make sure the code scales in general, I tested execution with 10, 16 and 20 cores on the workstation:

DELL Precision T7810, 2x Xeon E5-2687W v4
Ubuntu 22.04.3 LTS
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0] on linux
with setting CPU affinity, single run

Cores   Simulations     Execution Time
10             1000     00:00:04.274913
10            10000     00:00:30.311358
10           100000     00:04:57.086862
10          1000000     00:50:58.328345

Cores   Simulations     Execution Time
16             1000     00:00:03.605890
16            10000     00:00:21.139773
16           100000     00:03:25.156981
16          1000000     00:35:11.151080

Cores   Simulations     Execution Time
20             1000     00:00:03.855061
20            10000     00:00:17.721105
20           100000     00:02:39.870485
20          1000000     00:26:22.462597

The execution times seem to somewhat scale linearly with the number of cores (except for some overhead due to spawning the processes at lower numbers of simulations).

Based on common benchmark numbers between Apple's M1 Pro and Dual Xeon E5-2687Wv4 like

I expected a performance increase of around 25...30 % (or at least 15 % if we consider these benchmarks as uncertain). However, my Monte-Carlo simulations perform roughly equivalent on both systems.

Based on the above findings, my questions are:

  • Is this solely due to the more modern architecture of Apple's M1 Pro?
  • What am I missing here (despite that Python itself is rather slow)?
  • How could I investigate this scaling issue in more detail?