2

I had an interesting call with a client today. We have an application which schedules other apps, and normally we have no trouble at all with servers with NUMA configuration with anywhere from 2 to 4 nodes.

On the call, we started up two very CPU hungry applications and both were allocated to Node 0 and so across the whole machine there was only 50% usage. Once we changed the second app instance to the other node, we were using all of the cores (half on one app, half on the other). It seemed impossible to allocate an app to all cores.

Now, the only difference between this machine and the ones I'm used to using is that Windows' task manager listed the nodes in a drop-down instead of a long list of individual cores, so Microsoft knows what this restriction is, but it's a hard problem to research online.

It's pretty clear we're going to have to develop NUMA node affinity, but for now I'm trying to understand the problem. What can cause one style of NUMA machine to allow applications to use both nodes transparently, and what's causing this behaviour now?

I can see this architecture working great for many small applications, but we typically run monolithic ones with many threads.

The server I'm fighting with is an HP Proliant DL388Gen9 with two Intel Xeon E5-2690V3 CPUs.

Thoughts on what's causing this?

2 Answers 2

3

A process can only be assigned to a single NUMA node. That is the short answer. You can't force a single instance to run on more than one NUMA node. And this makes sense, given the purpose of NUMA, and also the secondary purpose of allowing >64 CPU cores on an OS that uses 64-bit CPU affinity bitmasks.

2

I'm not an expert on this matter, but I'll chime in with my opinion.

It's pretty clear we're going to have to develop NUMA node affinity, but for now I'm trying to understand the problem. What can cause one style of NUMA machine to allow applications to use both nodes transparently, and what's causing this behaviour now?

I know Windows calculates a "Node Distance", estimating the amount of time it takes for various NUMA nodes to communicate with each other. I don't know if its latency or bandwidth based (or perhaps both), but its important to know.

Modern machines, such as Skylake-Server, can have "SubNuma Clustering", where different parts of the same chip are reported as different NUMA nodes. However, the difference between nodes within the same chip is ~10 nanoseconds. While a different-socket may have ~200 nanoseconds.

Ex: Two Xeon Golds (20-cores per CPU) with sub-NUMA clustering on would look like 4x NUMA nodes to Windows. 2-NUMA nodes per chip, representing the "left" 10 cores and the "right" 10 cores on each half of the chip. 3x memory controllers on the left, 3x memory controllers on the right. But all 20-cores can talk to either memory controller in ~80 nanoseconds or so. They can just talk to the "closer" memory controller in 70-nanoseconds. A nearly imperceptible difference, so Windows probably prefers to float threads across these two NUMA nodes.

My assumption is that, under the default settings of your setup, Windows has decided that one "Node Distance" was short enough to float threads over, while the other setup the memory-distances were long enough that the default settings of Windows prefers to keep the threads within a NUMA node.


That's not my only theory. My 2nd theory is that something weird is going on with "Processor Groups". Processor Groups are a dirty compatibility hack of the Win32 API, because CPU Affinity masks have been limited to 64-bits for performance reasons. Therefore, 64-logical cores is the maximum default of Windows.

You can access more than 64-logical cores through the Processor Group API. https://msdn.microsoft.com/en-us/library/windows/desktop/dd405503(v=vs.85).aspx

In short: if your processes are on separate "Processor Groups", you'll a programmer to change the program to support Processor Groups.

I haven't done much testing with this stuff personally. But hopefully this is useful information for you.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .