0

As part of my data evaluation routine, I am performing 1'000'000 Monte-Carlo simulations on my MacBook Pro M1 Pro (10 Core, 32 GB RAM) using multiple processes in Python 3.10.5 (concurrent. futures.ProcessPoolExecutor). The execution times are:

Apple MacBook Pro, M1 Pro 10 Core CPU
macOS Monterey 12.6.2
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:07:06) [Clang 13.0.1 ] on darwin
without setting CPU affinity, single run

Cores   Simulations     Execution Time
10             1000     00:00:03.114602
10            10000     00:00:16.658438
10           100000     00:02:39.969048
10          1000000     00:26:23.064365

Trying to decrease calculations times and loading work off of my main machine, I decided to perform the calculation on an older Dual Xeon E5-2687Wv4 workstation utilizing 20 Cores (deactivated hyper-threading):

DELL Precision T7810, 2x Xeon E5-2687W v4
Ubuntu 22.04.3 LTS
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0] on linux
without setting CPU affinity, single run

Cores   Simulations     Execution Time
20             1000     00:00:03.913254
20            10000     00:00:16.684702
20           100000     00:02:31.481626
20          1000000     00:27:44.841615

As per the numbers above, I was not able to see any noticeable increase in performance. However, using only 20 of 24 cores available might produce some overhead as the scheduler tends do switch processor cores. To investigate this potential effect, I manually set the CPU affinity of each process and got the following results:

DELL Precision T7810, 2x Xeon E5-2687W v4
Ubuntu 22.04.3 LTS
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0] on linux
with setting CPU affinity, single run

Cores   Simulations     Execution Time
20             1000     00:00:03.855061
20            10000     00:00:17.721105
20           100000     00:02:39.870485
20          1000000     00:26:22.462597

Again, no difference in performance was noticeable. To make sure the code scales in general, I tested execution with 10, 16 and 20 cores on the workstation:

DELL Precision T7810, 2x Xeon E5-2687W v4
Ubuntu 22.04.3 LTS
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0] on linux
with setting CPU affinity, single run

Cores   Simulations     Execution Time
10             1000     00:00:04.274913
10            10000     00:00:30.311358
10           100000     00:04:57.086862
10          1000000     00:50:58.328345

Cores   Simulations     Execution Time
16             1000     00:00:03.605890
16            10000     00:00:21.139773
16           100000     00:03:25.156981
16          1000000     00:35:11.151080

Cores   Simulations     Execution Time
20             1000     00:00:03.855061
20            10000     00:00:17.721105
20           100000     00:02:39.870485
20          1000000     00:26:22.462597

The execution times seem to somewhat scale linearly with the number of cores (except for some overhead due to spawning the processes at lower numbers of simulations).

Based on common benchmark numbers between Apple's M1 Pro and Dual Xeon E5-2687Wv4 like

I expected a performance increase of around 25...30 % (or at least 15 % if we consider these benchmarks as uncertain). However, my Monte-Carlo simulations perform roughly equivalent on both systems.

Based on the above findings, my questions are:

  • Is this solely due to the more modern architecture of Apple's M1 Pro?
  • What am I missing here (despite that Python itself is rather slow)?
  • How could I investigate this scaling issue in more detail?
7
  • It is highly likely you made a programming mistake which renders multithreading (or multi-processing) ineffective. However, Super User isn’t the place to ask about that; Stack Overflow is. If you want to go that way, remember to include the gist of the code and whatnot, according to the rules and customs over there.
    – Daniel B
    Commented Sep 1, 2023 at 11:40
  • @DanielB: Thanks for that advice. I was thinking about asking that at SO, but was not sure if it suits there as it seemed to be more hardware or architecture related. Need to do some further investigations in order to get a small reproducible code snippet for SO.
    – albert
    Commented Sep 1, 2023 at 12:04
  • The M1 has odd mix of performance. It has a significantly higher single core rating but lower multicore performance cpubenchmark.net/compare/4580vs2765.2/… which suggests that there is something weird going on. Likely the M1 is hitting a power or thermal budget. Given that it is 6 years newer the single core performance being better is expected, but the multicore performance seems to be rather poor by comparison. The M1 may have better performance per watt, but seems to perform identically otherwise.
    – Mokubai
    Commented Sep 1, 2023 at 12:51
  • @Mokubai: What do you mean by "[...] but seems to perform identically otherwise." Based on my findings, the M1 Pro seems to perform identically to my Dual Xeon machine, but I would assume the latter should be (at least slightly) better in performance since I am using multiprocessing to run multicore calculations.
    – albert
    Commented Sep 1, 2023 at 13:06
  • 1
    Please note that Xeons have never really been about raw performance. They are used in servers where reliability trumps everything else. The Xeon is also on a 14 nanometer process while the M1 is on 5nm process so the 160W rating on the Xeon is deceptive, it may sound like it should be a more powerful processor but how many watts it takes doesn't really give a good indication of how much work it can do. That apple can give you an M1 that performs nearly identically at a fraction of the power is actually kindof impressive.
    – Mokubai
    Commented Sep 1, 2023 at 13:31

1 Answer 1

2

I expected a performance increase of around 25...30 % (or at least 15 % if we consider these benchmarks as uncertain).

Along with

older Dual Xeon E5-2687Wv4 workstation utilizing 20 Cores (deactivated hyper-threading)

Do no necessarily work well together, especially along with another problem I also highlight below.

A single Xeon should outperform the M1 Pro when hyper threading is enabled.

Many multi-core benchmarks show hyper threading can increase performance anywhere from 15 to 30%.. There are some edge cases where there is no benefit to Hyper Threading, but more often than not Hyper Threading allows the multiple parts of the CPU core to be better loaded by allowing independent threads to use under-utilised parts of the CPU.

By disabling hyperthreading you have essentially reduced the performance of the Xeon by anywhere from 0 to 30%.

From the CPU Benchmark comparing Apple M1 Pro 10 Core 3200 MHz vs Intel Xeon E5-2687W v4 @ 3.00GHz

enter image description here

On a single thread the Apple M1 Pro potentially has a lead on the Xeon, but on a multi-core (CPU Mark) benchmark the Xeon should win by 25%, but by disabling Hyper Threading you have robbed the Xeon of it's advantage.

That, on it's own, shouldn't put a dual Xeon system at a disadvantage, but a system with dual Xeon processors also has other potential problems.

The dual Xeon architecture results in a synchronisation issue. For every thread to run on the second CPU data must be copied over the QPI link between the two processors into the second processors memory, see image below for the architecture. The results must then be copied back across the QPI link and to the master thread to handle. That presents a bottleneck in the system, especially if your threads have small datasets.

Combining that bottleneck with the disabled hyper threading means that your workload has to be tuned and aware of the system. It may be that an old dual-CPU system is actually no faster than a modern processor that already has a significant advantage (note the single core benchmark of the M1 Pro) and does not have the drawbacks of a NUMA memory architecture. I also note that there are different clustering modes for modern Xeons that can alter performance, but those require system tuning.

enter image description here

A single CPU system with significantly higher single core performance can potentially outperform a "more powerful" system depending on the task and architecture of the system.

Both Intel and AMD have been packing more and more cores onto a single package for some time, the resulting system is more predictable and has a more consistent memory interface.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .