1

I am running a program that, among other things, does some matrix multiplications, singular value decompositions and accessing matrix subsets on a very large data set (these are the lines of code that take the most time).

I am running computations on a CPU with 40 cores and 1TB RAM (yes, 1TB). The dataset itself is ~450GB and another matrix in the code is equally big, meaning I use ~95% of the 1TB memory. I've learned that my computations scale quite badly when I increase the dataset size.

My question is: Does it slow my computations that I am using several cores instead of just one when memory is taking up most the time?

2
  • 3
    The specific architecture & topology of your CPUs, interconnects, caches and memory does affect performance. Unfortunately no general analysis of this is likely to fit in an answer here, so you really need to describe all of those things (plus your program's access patterns, how performance varies with dataset size, and what profiling you did) to get anything resembling an answerable question.
    – Useless
    Commented Nov 8, 2022 at 16:56
  • To start with, you can look at things like numastat (on Linux) and whatever perf counters your OS/CPU offer to get an idea of what changes as your dataset grows.
    – Useless
    Commented Nov 8, 2022 at 17:00

3 Answers 3

2

Does it slow my computations that I am using several cores instead of just one when memory is taking up most the time?

Maybe. It depends if your system is NUMA1. This can cause slowdown when one core has to ask another to access its memory


1: Non-uniform memory access

2
  • 1
    A system with 40 cores and 1TB is definitely NUMA. However, the memory is already split up (probably) 5 or 10 or 20 ways. Using more cores means each core is closer to some memory and farther away from some. Commented Nov 8, 2022 at 14:03
  • 1
    Regardless of NUMA, there is also the possibility to be bottlenecked by memory bandwidth. Commented Nov 8, 2022 at 14:05
2

Matrix multiplication with a TB of data, memory bandwidth can easily be your bottleneck. You just have 40 CPUs, probably nicely vectorised code… That will put lots of pressure on your memory subsystem. And the better your code, the higher the pressure.

I would start by making sure that you are partitioning the work into chunks that can be handled inside your cache on each CPU. Splitting each matrix into 256 chunks < 2 MB. You might check how much work per dollar you get out of a M1 Mac. Not that much RAM, but tons of bandwidth (800GB per second), very fast SSD / virtual memory and a lot cheaper than any 1TB machine.

PS your caches may have problems if the distance between rows is a power of two. A 4096x4096 matrix could be a very bad idea. If that’s what you have, try changing it to 4100 x 4096 for example.

2
  • 1
    I bet you the 1TB machine also has 800GB per second memory bandwidth or more. Commented Nov 10, 2022 at 9:12
  • But it also has an enormous price tag. Dell or Apple machines with 1.5TB are between 20,000 and 30,000 dollars. And 800 GB is an awful lot. I wouldn’t bet on it until I have seen the spec.
    – gnasher729
    Commented Nov 10, 2022 at 23:02
0

It really depend on the implementation.

A very important part of optimization is to utilize the caches efficiently. While L1 and L2 is typically per core, L3 is often shared. In the absolute worst case, the increased amount of memory requests could cause L3 cache lines to be evicted just before they are needed again.

A well optimized implementation should try to load a section of data that fits into the L1 cache, and do as much computation as possible before loading the next chunk. And ideally do the same with the L2 cache. This typically results in almost unreadable code, but it can give huge performance gains.

So the first thing I would do is check the library. I would expect common and popular libraries like numpy to perform well. Or libraries where the main selling point is performance, like Intel performance primitives. I would expect internal or smaller libraries to perform worse since optimization requires a huge time investment.

workstation-class processors typically also scale the amount of cache and memory channels with the number of cores. If you have a 40 core processor you likely have far more bandwidth available than a single core will use. 4-8 cores per memory channel seem fairly common today.

A well optimized library should make optimal usage of the resources available. In some cases manual tuning parameters might help, but this will likely be a trial and error process. But multiplying huge matrices is just fundamentally slow, so you will likely not solve your problem of poor scaling.

2
  • Just an idea: If your code is multithreaded (I bet it is), you may be able to control how many threads are used. And determine the optimal number.
    – gnasher729
    Commented Nov 9, 2022 at 16:50
  • My Mac at home: L1 is per core. L2 is per group of 4 cores. L3 is shared between all cores and GPUs. And fast memory with 16 byte lines.
    – gnasher729
    Commented Mar 13, 2023 at 14:09

Not the answer you're looking for? Browse other questions tagged or ask your own question.