⬆️ ⬇️

NUMA and what does vSphere know about it?

I think many have already had time to glance and read this article in English on my blog, but for those who are still more comfortable to read in their native language than a foreign one (as they would say on dirty.ru - in anti-Mongolian), I translate my next an article.



You probably already know that NUMA is uneven memory access. At the moment, this technology is represented in Intel Nehalem and AMD Opteron processors. Honestly, as a networker who practices most of the time, I was always sure that all processors are evenly fighting for access to memory among themselves, but in the case of NUMA processors my idea is very outdated.



image



Approximately it looked like before the advent of a new generation of processors.

')

In the new architecture, each processor socket has direct access only to certain memory slots and forms a NUMA node. That is, with 4 processors and 64 GB of memory, you will have 4 NUMA nodes, each with 16 GB of memory.



image



As I understand it, a new approach to the distribution of access to memory was invented by the fact that modern servers are so crammed with processors and memory that it becomes technologically and economically unprofitable to provide access to memory through a single common bus. That in turn can lead to competition for the bandwidth between processors, and also to a lower performance scalability of the servers themselves. The new approach introduces 2 new concepts - Local memory and Remote memory. While the processor accesses the local memory directly, it has to access the Remote Memory using the old-fashioned method, via a common bus, which means a higher latency. This also means that in order to effectively use the new architecture, our OS understands that it runs on a NUMA node and correctly manages its applications / processes, otherwise the OS simply risks being in a situation where the application runs on a single node processor, while its ( applications) the memory address space is located on another node. A quick search revealed that NUMA architecture is supported by Microsoft since Windows 2003 and VMware — at least with ESX Server 2.



Not sure if you can see the NUMA node data in the GUI, but you can definitely see it in esxtop.



image



So, here we can observe that in our server there are 2 NUMA nodes, and that each of them has 48 GB of memory. This document says that the first value indicates the amount of local memory in the NUMA node, and the second, in parentheses, the amount of free memory. However, a couple of times on my production servers, I watched the second value higher than the first, and could not find any explanation for this.

So, as soon as the ESX server detects that it is running on a server with NUMA architecture, it immediately turns on the NUMA scheduler, which in turn takes care of the virtual machines and that all vCPUs of each machine are within the same NUMA node. In previous versions of ESX (up to 4.1) in order to work effectively on NUMA systems, the maximum number of vCPUs of a virtual machine was always limited by the number of cores on a single processor. Otherwise, the NUMA scheduler simply ignored this VM and the vCPU were evenly distributed on top of all available cores. However, ESX 4.1 introduced a new technology called Wide VM. It allows us to assign more vCPUs in the VM than the cores on the processor. According to the VMware document, the scheduler splits our “wide virtual machine” into several NUMA clients and then each NUMA client is processed according to a standard scheme, within a single NUMA node. However, the memory will still be scattered between the selected NUMA nodes of this Wide VM, on which vCPU virtual machines are running. This happens because it is almost impossible to predict which part of the memory a particular vCPU NUMA client will turn to. Despite this, Wide VMs still provide a significantly improved memory access mechanism compared to the standard “smearing” of a virtual machine on top of all NUMA nodes.



Another great feature of the NUMA scheduler is that it not only decides where to place the virtual machine when it starts up, but also constantly monitors its relationship between local and remote memory. And if this value goes below the threshold (for unconfirmed info - 80%), then the scheduler starts migrating the VM to another NUMA node. Moreover, ESX will control the migration speed to avoid overloading the common bus through which all NUMA nodes communicate.



It is also worth noting that when installing the server in memory, you must install the memory in the correct slots, since it is not the NUMA scheduler that is responsible for allocating memory between NUMA nodes, but the physical architecture of the server.

And finally, some useful information that you can learn from esxtop.



image



A brief description of the values:

NHN NUMA Node Number

NMIG Number of virtual machine migrations between NUMA nodes

NMREM The amount of remote memory used by the VM

NLMEM Amount of local memory used by the VM

N & L Percentage ratio between local and remote memory

GST_ND (X) Amount of allocated memory for VMs on node X

OVD_ND (X) The amount of memory spent on overhead on node X



I would like to note that, as usual, this entire article is just a compilation of what seemed interesting to me from recent blogs read by such comrades as Frank Denneman and Duncan Epping , as well as official Vmware documents.

Source: https://habr.com/ru/post/122535/



All Articles