Dual socket systems: NUMA vs. shared memory address space

magellan · Jul 2, 2015

In a NUMA system each socket has a separate memory address space so I'm guessing that each socket would have to have separate memory address/data bus lines to their respective memory pools and each separate CPU's IMM unit in these sockets controls it's own memory pool. But how do dual or triple socket shared memory address space systems work? Is the IMM unit separate and on the motherboard? Or does one socket's CPU's IMM somehow handle the memory access for all the other socket's CPU's? Please note I'm referring to CPU's located in physically separate sockets.

EarthDog · Jul 6, 2015

NUMA systems do have their own separate bus, yes.

IMM unit? Do you mean IMC (Integrated Memory Controller)??

At least on enthusast based motherboards, for example the EVGA SR-2. Each CPU can access each 'bank' of DIMMs. It is a pool AFAIK.

magellan · Jul 7, 2015

EarthDog said:
NUMA systems do have their own separate bus, yes.

IMM unit? Do you mean IMC (Integrated Memory Controller)??

At least on enthusast based motherboards, for example the EVGA SR-2. Each CPU can access each 'bank' of DIMMs. It is a pool AFAIK.

In NUMA mode non-local memory is accessed via QPI
which is very, very, very slow. It can slow the OpenMPI
linpack benchmark by orders of magnitude.

My question was aimed at dual socket systems that didn't have NUMA.
What manages the memory when it's all local (as in a common
memory data bus) to both sockets? I'm guessing it must be managed
by the motherboard chipset.

aftermath · Jul 23, 2015

magellan said:
In NUMA mode non-local memory is accessed via QPI
which is very, very, very slow. It can slow the OpenMPI
linpack benchmark by orders of magnitude.

My question was aimed at dual socket systems that didn't have NUMA.
What manages the memory when it's all local (as in a common
memory data bus) to both sockets? I'm guessing it must be managed
by the motherboard chipset.

Correct the memory controller was the most important part of the motherboard called a "north bridge", anything without on-die IMC (that's way back on AMD's K7 32 bit CPU ; to as recent as the Core2Quad based X3300 and e/l/x/5400 Xeons from Intel) had one with a memory controller. more of the north bridge is now part of the CPU than just the IMC .

Hyper Transport/QPI is not really slow.... Just perhaps not as fast as direct access.

magellan · Jul 23, 2015

aftermath said:
Correct the memory controller was the most important part of the motherboard called a "north bridge", anything without on-die IMC (that's way back on AMD's K7 32 bit CPU ; to as recent as the Core2Quad based X3300 and e/l/x/5400 Xeons from Intel) had one with a memory controller. more of the north bridge is now part of the CPU than just the IMC .

Hyper Transport/QPI is not really slow.... Just perhaps not as fast as direct access.

Non local memory access isn't recommended for OpenMPI, but I think those statements might have only applied to the older QPI architecture, the newer QPI 1.1 architecture might have resolved all those issues. The older Xeons had to go through an IOH to get to another CPU's local memory, which has to add latency not only to getting the data, but to requesting the data. I'm guessing non-local memory access must be uncached as well -- unless they have some way of indicating a cache line is or isn't local.

How exactly would you request data from another CPU's address space? Are there special assembly instructions for NUMA operations? Or is it done using locking software semaphores controlled by the OS (which adds even more latency and complexity)?

magellan · Jul 27, 2015

Here's the viewpoint of one of the developers of OpenMPI (an implementation of the MPI HPC clustering API) on utilizing non-local NUMA memory for a process:

"well the Linux kernel does all its best to allocate memory on the local
NUMA node if it's available, so it is difficult to convince it to do
something harmful in this sense. I think that a way to test such a
situation would be to start mpi processes on a node in an usual way
-reasonably the processes will be bound to a socket or a core-, wait for
the processes to allocate their working memory, then either migrate the
processes on the other NUMA node (usually ==socket) or migrate its
memory pages, the command-line tools distributed with the numactl
package (numactl or migratepages) can probably allow to perform such a
vandalism; this would put your system into a worst-case scenario from
the NUMA point of view.

In our system, I noticed in the past some strong slowdowns related to
NUMA in parallel processes when a single MPI process doing much more I/O
than the others tended to occupy all the local memory as disk cache,
then the processes on that NUMA node were forced to allocate memory on
the other NUMA node rather than reclaiming cache memory on the local
node. I solved this in a brutal way by cleaning the disk cache regularly
on the computing nodes. In my view this is the only case where (recent)
Linux kernel does not have a NUMA-aware behavior, I wonder whether there
are HPC-optimized patches or something has changed in this direction in
recent kernel development."

Dual socket systems: NUMA vs. shared memory address space

magellan

Member

EarthDog

Gulper Nozzle Co-Owner

magellan

Member

aftermath

Member

magellan

Member

magellan

Member

Similar threads