Here's the viewpoint of one of the developers of OpenMPI (an implementation of the MPI HPC clustering API) on utilizing non-local NUMA memory for a process:
"well the Linux kernel does all its best to allocate memory on the local
NUMA node if it's available, so it is difficult to convince it to do
something harmful in this sense. I think that a way to test such a
situation would be to start mpi processes on a node in an usual way
-reasonably the processes will be bound to a socket or a core-, wait for
the processes to allocate their working memory, then either migrate the
processes on the other NUMA node (usually ==socket) or migrate its
memory pages, the command-line tools distributed with the numactl
package (numactl or migratepages) can probably allow to perform such a
vandalism; this would put your system into a worst-case scenario from
the NUMA point of view.
In our system, I noticed in the past some strong slowdowns related to
NUMA in parallel processes when a single MPI process doing much more I/O
than the others tended to occupy all the local memory as disk cache,
then the processes on that NUMA node were forced to allocate memory on
the other NUMA node rather than reclaiming cache memory on the local
node. I solved this in a brutal way by cleaning the disk cache regularly
on the computing nodes. In my view this is the only case where (recent)
Linux kernel does not have a NUMA-aware behavior, I wonder whether there
are HPC-optimized patches or something has changed in this direction in
recent kernel development."