NUMA (Non-Uniform Memory Access - “Uneven Memory Access” or Non-Uniform Memory Architecture - “Uneven Memory Architecture”) is not a new technology. I would even say that it is quite old. That is, in terms of musical instruments, this is not even a button accordion, but rather a
harp .
But despite this, there are no sensible articles explaining what it is, and most importantly, how to work with it effectively. This post, which corrects this situation, is primarily intended for those who know nothing about NUMA, but also contain something interesting for NUM experts, and most importantly, it makes my life easier for me, an Intel engineer, since from now on everyone interested in NUMA Russian-speaking developers will be sent to him.
Three heroes
And we start with the negation of negation. That is, look at Uniform Memory Access (Uniform Memory Access), also known as
SMP (Symmetric Multi Processing - Symmetric Multi-Processing).
SMP is an architecture in which processors are connected to common system memory using a bus or similar connection) symmetrically and have equal access to it. Just as shown in the diagram below (using the example of two CPUs), all Intel multiprocessor machines were arranged when the memory controller (MCH / MGCH), better known as the “North Bridge” (“NorthBridge”) was located in the chipset.
')
The disadvantage of SMP is obvious - as the number of CPUs grows, the bus becomes a bottleneck, significantly limiting the performance of memory-intensive applications. That is why SMP systems almost do not scale, two or three dozens of processors for them is already a theoretical limit.
An alternative to SMP for productive computing is
MPP (Massive Parallel Processing).
MPP is an architecture that divides the system into multiple nodes, in which processors have access exclusively to local resources. MPP scales well, but not so well programmed. Namely, it does not provide a built-in mechanism for exchanging data between nodes. That is, MPP software should implement communications, distribution and scheduling of tasks on nodes, which is not suitable for all tasks and their programmers.
And finally,
NUMA (Non-Uniform Memory Access). This architecture combines the positive features of SMP and MPP. A NUMA system is divided into multiple nodes that have access to both its local memory and to the memory of other nodes (logically called “remote”).
Naturally, access to remote memory is much slower than local . From there, and the name - "non-uniform memory access." This is not only the name, but also a lack of NUMA architecture, which may be mitigated by special software optimization, which is further.
This is how the two-socket NUMA Intel Xeon system (namely, Intel NUMA debuted) with memory controllers integrated into the CPU looks like.
The processors are connected here by QPI -
Intel QuickPath point-to-point connection with high bandwidth and low transmission latency.
The figure does not show the processor cache, but all three levels of the memory cache, of course, are there. This means that there is also a NUMA feature that needs to be said: NUMA, used in Intel systems, supports the
coherence of caches and shared memory (that is, data correspondence between caches of different CPUs), so it is sometimes called
ccNUMA - cache coherent NUMA. This means that there is a special hardware solution for matching the contents of the caches, as well as the memory, when more than one cache stores the same part of it. Of course, such a communication cache reduces the overall system performance, but without it, it would be extremely
interesting to program a system with an unpredictable current data state. To reduce the effect of this effect, you should avoid situations where several processors work with one block of memory at once (not necessarily with one variable!). This is exactly what NUMA products are trying to do.
Thus, from hardware, we smoothly moved to the software and performance of NUMA systems.
So NUMA is supported by the following OS:
Windows Server 2003 ,
Windows XP 64-bit and
Windows Vista - up to 64 logical processors,
Windows 7 ,
Windows Server 2008 R2 - full support.
Linux OS kernel:
2.6 and higher, UNIX OS -
Solaris and
HP-Unix .
If we talk about databases, then NUMA is supported by
Oracle8i, Oracle9i, Oracle10g and
Oracle11g , as well as
SQL Server 2005 and
SQL Server 2008 .
NUMA support is also available in
Java SE 6u2 ,
JVM 1.6 , and also
.NET runtime on the aforementioned versions of Windows.
Fully supports NUMA Intel Mathematical Library -
MKL .
"
NUMA support " means the following - the product is aware of the NUMA topology of the machine on which it is executed, and tries to use it as efficiently as possible, that is, to organize the work of the streams so that they fully use the memory of their node (the one on which this thread is running ) and minimally - aliens. The key word here is “trying”, since in general it is not always possible to do this.
Therefore, it may happen that a product that does not support NUMA, that is, simply does not know about it, which does not prevent it from being launched and executed on NUMA-systems, will show not worse performance than officially supporting NUMA. An example of such a product is the famous
Intel Threading Building Blocks library.
That is why in the BIOS of multi-socket servers with NUMA there is a special item “
Allow / deny NUMA ”. Of course, the system topology will not change in any way from the banning of NUMA in the BIOS - remote memory will not come close. Only the following will happen - the system will not inform the OS and software that it is NUMA, which means that memory allocation and thread layout will be “normal”, such as on symmetric multiprocessor systems.
If the BIOS allows NUMA, then the operating system will be able to learn about the configuration of NUMA nodes from the System Resource Affinity Table (SRAT) in the
Advanced Configuration and Power Interface (ACPI) . Applications can get this information using the
libnuma library on Linux, and you know on what systems the
Windows NUMA interface .
This information is the beginning of NUMA support by your application. It should be followed directly by an attempt to maximize the use of NUMA. Common words on this topic have already been said, for further explanation I will turn to a particular example.
Suppose you allocate memory with
malloc . If it happens in Linux, then malloc only reserves memory, and its physical allocation occurs only when the memory is actually accessed. In this case, the memory is automatically allocated on the node that uses it, which is very good for NUMA. In Windows, malloc works differently, it allocates physical memory directly during allocation, that is, on the node of the thread allocating memory. Therefore, it may well be removed for other threads that use it. But there is a NUMA-friendly memory allocation in Windows. This is
VirtualAlloc , which can work just like malloc in Linux. An even more advanced option is
VirtualAllocExNuma from the Windows NUMA API.
The following simple example using OpenMP,
main() { … #pragma omp parallel { //Parallelized TRIAD loop… #pragma omp parallel for private(j) for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; } //end omp parallel … } //end main
You can make friends with NUMA by ensuring that each stream initializes the data, causing the corresponding physical memory to bind to the node using it:
KMP_AFFINITY=compact,0,verbose main() { … a* = (char *) VirtualAlloc(NULL, //same for b* and c* N*(sizeof(double))+1024, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE); … #pragma omp parallel { #pragma omp for private(i) for(i=0;i<N;i++) { a[i] = 10.0; b[i] = 10.0; c[i] = 10.0;} … //OpenMP on TRIAD loop… #pragma omp parallel for private(j) for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; } //end omp parallel … } //end main
Affinity should be mentioned here as a separate item - the forced binding of threads to specific processors, preventing a possible transfer of threads between the processors by the operating system and which could cause a potential "separation" of threads from its used local memory.
To install Affinity, there are corresponding APIs in both Linux and Windows (standard Windows API, and
NUMA WinAPI ). Also, the functionality for setting the binding is present in many parallel libraries (for example, in the OpenMP example shown above, the environment variable KMP_AFFINITY is responsible for this).
But you have to understand that, firstly, affinity does not always work (for a system, this is rather a hint than an order), and secondly, the positive effect of the Affinity installation will be only when you fully control the system, that is, , your application works exclusively on it, and the OS itself does not heavily load the system. If, as is most often the case, there are several applications, moreover, they intensively use the CPU and memory while trying to attach to the same processor, not knowing anything about each other, and the OS competes for the same resources, then using Affinity can be more harm than good
Performance.
And now the fun part. Let us try to find out how, in reality, memory access in NUMA is heterogeneous, and the performance of real applications, respectively, depends on this heterogeneity.
First of all, let's look at the theoretical data. According to Intel presentations, “the delay of access to remote memory is ~ 1.7x of access to local memory, and the bandwidth of local memory can be up to two times more than remote”
The real server data on the Xeon 5500 is provided in the Dell datasheet - “
local memory access latency is 70 nanoseconds, to the remote, 100 nanoseconds (ie, ~ 1.4 times), the local memory bandwidth exceeds remote memory by 40% ”.
On your real system, these approximate data can be obtained using the free Microsoft Sysinternals utility
CoreInfo , which estimates the relative “cost” of memory access for different NUMA nodes. The result, of course, is very approximate, but some conclusions are worth making.
Coreinfo result example:
Calculating Cross-NUMA Node Access Cost... Approximate Cross-NUMA Node Access Cost (relative to fastest): 00 01 00: 1.0 1.3 01: 1.2 1.0
But the main question is how much the difference in the “cost” of access to NUMA memory will affect the performance of a real application as a whole. In preparing this article, I came across a very interesting
post by SQL specialist Linchi Shea, assessing the impact of NUMA on SQL Server performance.
The measurements were carried out on the HP ProLiant 360 G7 with two Intel Xeon X5690s, giving a total of 12 processors (24 logical CPUs) and were a comparison of two scenarios of Microsoft SQL Server 2008 R2 Enterprise X64:
- Use only local memory (all requests are processed on the first NUMA node, in memory of which lies the test table)
- Using remote memory only (all requests are processed on the second NUMA node, using the same table in the first node's memory.
The test is performed exclusively technically competently, so there is no reason to doubt its authenticity. For details, refer to the
original post Linchi (in English).
Here I will give the results - an estimate of the amount of request processing in time for both scenarios:
As you can see, the difference is just over 5%! The result is pleasantly amazing. And this is the case of the maximum difference achieved with 32 simultaneously running threads with queries (with a different number of threads the difference is even smaller).
So is it necessary to optimize for NUMA? I will come from afar. Although I do not have time to clean the house, but I have time to read the cleaning tips :). And one of the useful tips I have seen is that in order to get out less, you need to avoid potential confusion, for which purpose try to keep all things as close as possible to their place of use.
Now replace “things” with “data”, and “apartment” with “program” and see one of the ways to achieve order in your programs. But this is exactly the NUMA optimization that you have just read about.