
Practice shows that any process, to a certain extent, can always be optimized. It can be attributed to virtualization. There are quite a lot of optimization possibilities here, and this task is multifaceted.
Within this article, I want to acquaint you with the methods of sizing virtual machines, as well as about how to optimize their work. The material will be technical and is recommended for familiarization to all vSphere specialists.
First I would like to talk about two technologies that mainly affect the performance of vSphere. This is NUMA technology and ESXi hypervisor shutter technology.
')
There are a lot of detailed articles about NUMA; I see no point in retelling this information, I’ll confine myself to a basic description for the integrity of the material. So NUMA - Non Uniform Memory Access. This can be translated into Russian as Unequal Memory Access.
Modern multi-socket servers, in fact, are several isolated single-socket computers combined on a single motherboard. Each processor exclusively owns its RAM slots, and only it has access to it. Similarly, each processor has its own personal PCI-E buses, to which various devices of the motherboard are connected, as well as PCI-E expansion slots. The processors are interconnected by a high-speed data exchange bus, through which they gain access to "foreign" devices, making a request for this to the corresponding host processor. For obvious reasons, the processor accesses to “its” memory much more with less overhead than to the “alien” one. So far this is all you need to know about this technology.
vSphere knows perfectly well about NUMA and tries to place the virtual kernels of machines on those physical processors, in whose memory the virtual memory of the virtual machine is now located. But there are pitfalls. Server makers love to include NUMA emulation in the BIOS by default. That is, the server appears to the operating system as a non-NUMA device, and the vSphere cannot use its optimization to control this technology. The vSphere documentation recommends disabling this option in the BIOS, this allows vSphere to deal with the issue on its own.
Consider the potential problems that may arise with NUMA. We assume that vSphere sees it and works with it correctly. For example, we will consider a 2-processor system, as the simplest version of NUMA. Suppose that on a physical server 64 GB of RAM, 32 on each socket.
1. We create a virtual machine (VM) with one virtual core (vCPU). Naturally, only one physical processor can execute such a virtual machine. Consider the situation when the VM should be given 48 GB of RAM. 32 GB of memory VM will take away from "its" physical processor, and another 16 it will have to take away from "someone else's". And when accessing these "alien" gigabytes, we get guaranteed high latency, which significantly reduces the speed of the virtual machine and increases the load on the data transfer bus between physical processors. You can correct the situation by giving the virtual machine 2 virtual sockets 1 core each. Then vSphere will distribute 2 vCPU VMs to different physical processors, taking 24 GB of RAM from each. Having given a virtual machine 1 virtual socket, we will deprive it of the ability to correctly use 2 physical processors.
2. Option when we create in addition to the first item the second VM for 10 GB of RAM and one vCPU. If this VM were one, then there are no problems, it can fit into one NUMA node with 32 GB of RAM. But we already have the first machine that both NUMA nodes have eaten by 24 GB of memory, and each node has only 8 GB left. In this situation, either the first or the second VM will begin to use part of the “alien” memory, although it seems that each of them is configured correctly from the point of view of NUMA. This is a very typical mistake and you need to very thoroughly calculate your virtual infrastructure during the design, as well as a comprehensive approach to configuring virtual machines during operation.
I would like to note that vSphere has its own clear logic when working with NUMA and Hyperthreading.
If a VM has only 1 virtual socket, then with an increase in vCPU, the machine will be executed on one physical processor exclusively on its physical cores without using Hyperthreading technology. If the number of vCPUs exceeds the number of physical processor cores, the VM will continue to run within this physical processor, but using Hyperthreading. If the number of vCPUs has exceeded the number of processor cores with Hyperthreading, then the cores of neighboring NUMA nodes (other physical processors) will be used, which will lead to a loss of performance (if you specify an incorrect number of virtual sockets). In the case when the physical processor is heavily loaded, and there are no free physical cores, in any case Hyperthreading technology will be used (unless otherwise specified in the virtual machine configuration). If you look at the numbers, then on average, the VM loses about 30-40% of performance if it runs on pure Hyperthreading compared to pure physical cores. But the physical processor itself has an overall performance of approximately 30% more with Hyperthreading technology than without it (using only physical cores). This indicator is very dependent on the type of load and optimization of VM applications for multi-threaded work.
If a VM has more than one virtual socket, then vSphere will optimize the performance of such a VM by placing executable kernels and RAM on different physical processors of the server.
For obvious reasons, the situation with the load of the physical server is constantly changing. Often, there is an uneven load on the physical processors of the server. vSphere keeps an eye on this. Sheduler analyzes the situation, calculates the overhead of moving a VM or its part to another NUMA node (moving memory, cores). Compares these costs with the potential benefits of moving and makes the decision to keep everything as it is or to move it. In other words, in any situation vSphere tries to optimize the work of virtual machines. Our task is to simplify this task by vSphere and not put it in a stalemate.
What kind of losses can we talk about if machines with NUMA nodes are configured incorrectly? Guided by my own experience, I can say that losses can reach up to 30% of the overall performance of a physical server. Much or less - you decide.
Now that we’ve sorted out a bit of NUMA and Hyperthreading, I’d like to talk a little more about how Scheduler works with the kernels of virtual machines, vCPU.
I will not go into very much depth, it is not very interesting to anyone, but I will try to tell you about the principles and methods of operation of this mechanism. So, the main mechanism of the hypervisor works as follows. In RAM, there is a constantly running process that serves the work of virtual machines. This process can be thought of as a READY status pipeline and a WAIT status storage. Omit the other, less significant state of virtual machines, they are not important now.
For ease of perception, I propose to treat all vCPU virtual machines as a chain. Each link in the chain is the core of the vCPU (world, in terms of vSphere). There are as many world machines on the machine as vCPU. There are 2 more invisible, service world associated with each virtual machine. One is responsible for the maintenance of the machine as a whole, the second for its input-output. The actual consumption of computing power by these service worlds is completely insignificant, and their memory consumption can be estimated by the overhead index of each virtual machine. It is worth noting that when creating a VM the size of the entire physical host, with an equal number of virtual and physical cores, there may be some performance loss. I will touch on this moment just below.
The READY conveyor is a “pipe” into which all such chains are dropped. The timeslot is the time during which virtual machines are executed by a physical processor (their vCPUs are executed). At this time, the virtual machine is almost lossless, similar to a physical server, using physical hardware processors (pCPU). The maximum value of the timeslot is artificially limited to about 1 millisecond. After the end of the timeslot, vCPU VMs are forcibly placed by the scheduler into the READY queue. Sheduler has the ability to change the order of virtual machines in the READY pipeline, the priority of each machine is calculated based on its current actual load, its rights to resources (Shares), how long the machine was on the physical processor and several other less significant parameters. It is logical to assume that if there is one VM on the host, then it will always pass through the READY queue conveyor without delay. it has no competitors for resources from other VMs.
How does a sheduler place the virtual machine worlds on physical cores? Imagine a table that has as many columns as wide as our physical server cores (including Hyperthreading). In height, each row will correspond to the next processor clock (s) of the server. Let's imagine a 2-processor server, 6 physical cores per processor, + Hyperthreading. In total we will receive 24 kernels on the server. Suppose the server has a high load, and vSphere is forced to use Hyperthreading, so it will be easier to read. Suppose we have several virtual machines:
• 4 pieces of 1 core
• 4 pieces of 2 cores
• 2 pieces of 8 cores
• 1 piece of 16 cores
So, the first timeslot comes, the sheduler selects applicants. Let the machine be the first for 16 cores. The first 16 physical cores (pCPU) are occupied, 8 remain, for example, 4 machines with 2 cores each (and the machine with 16 cores must have 2 virtual sockets to work correctly with NUMA). Everything, the timeslot is full, it is an ideal option. We do not lose productivity, pCPU do not stand idle. The rest of the virtual machines do not work at this time and are in the READY queue. Suppose that “happy” virtual machines have received the same largest timeslot (although in reality, all VM timeslots are different and depend on many factors). So, the second timeslot comes, and the sheduler needs to fill it.
Left unattended cars:
• 4 pieces of 1 core
• 2 pieces of 8 cores
We are starting to fill in, 2 machines of 8 cores each, 4 machines of 1 each, 4 free pCPUs are left, you can put there two 2-core machines that have already worked in the first time slot. More we can not fit anything there, does not fit. The timeslot is again full and we do not lose power.

Similarly, the scheduler will continue to fill the timeslots with the virtual machine worlds, trying to do this as efficiently as possible to reduce the “holes” in the filling and increase the efficiency of the virtual environment.
Below is a negative version of the location of virtual machines on the host. One big machine the size of the host, and one small, with 1 vCPU. With equal rights to resources and equal performance needs, these machines will receive the same number of timeslots, that is, they will divide the processor time among themselves. Since both machines will not be able to work at the same time (they will not fit in one timeslot), then they will work in turn, and the small machine will work in an empty timeslot, where there will be no one else except for it (the timeslot is almost wasted). A big machine will not be able to get processor time, even if it is necessary for this machine. For example, at a frequency of 3 GHz central processor, both of these machines will be able to get a maximum of 1.5 GHz.

Virtualization allows you to create multiple virtual machines on a host, and a situation may arise where the total number of vCPUs of all machines will be greater than the number of pCPUs of the physical host. This is quite normal, but you need to clearly realize that at the same time all vCPUs will not be able to get 100% of their withered capacity. In other words, if you have several loaded machines on one virtualization host and their total number of vCPUs is larger than the host pCPUs, then with a high degree of probability these machines will interfere with each other, which will reduce their overall performance.
Now I would like to return to creating a virtual machine the size of the entire host. Well, let such a VM take up the entire first timeslot and work it out. Now we recall the system worlds that accompany this car. They also need to be executed, as it is necessary to execute the hypervisor itself and its own system processes. That is, it is necessary to give them time slots and borrow some amount of pCPU in them. And what at this time to do our big VM? Right, expect the release of ALL pCPUs to fit in there. That is, we are guaranteed to lose performance on the hypervisor service tasks (timeslots), and this is bad (we recall the example above with large and small machines). For example, a software iSCSI initiator under high load consumes up to 6 GHz of processor power. This would not be so noticeable in the case of small VMs. they would work in parallel to service processes (in the same timeslot). And for a large VM, this will not work. it occupies the entire timeslot, all its pCPUs, and cannot fit in the timeslot if at least one of its pCPUs is already occupied by someone, even if by the system process.
What kind of losses can we talk about if the virtual infrastructure is configured incorrectly and the machines are located on the nodes? From zero to infinity (in theory). It all depends on the specific situation.
Separately, I would like to voice the main rule when siding virtual machines: let the virtual machine
MINIMALLY possible resources, under which it can perform its tasks. You do not need to give VM 2 cores if one is enough. You do not need to give 4 if 2 are enough (extra kernels occupy a place in the timeslot). Similarly, with the memory, you should not give the car too much. Perhaps another machine may not be enough, not to mention the problems with live migration (essentially copying the amount of VM memory) and NUMA.
Now, having dealt with the mechanism for placing vCPU virtual machines on the pCPU timeslot, let's recall NUMA and its allocation rules. For the sheduler of the hypervisor, all these rules are important when filling in the timeslot. pCPU timeslot may refer to different NUMA nodes. Now, in addition to the complexity of accounting for NUMA when configuring virtual machines on a host, we have also received limitations imposed by the way the hypervisor scheduler works with timeslots. If we want to get good VM performance, you need to pay attention to all the pitfalls and follow the following rules:
• Try not to create giant machines (compared to the size of the host)
• For large machines, consider the limitations imposed by NUMA technology.
• Do not abuse the number of vCPUs in relation to the host pCPU
• A few small or medium-sized machines will always have the advantage of flexibility and overall performance over huge machines.
Finally, I would like to say about the work of sheduler hypervisor with input-output. When a virtual machine accesses its virtual hardware, the hypervisor suspends the VM, removes it from the pCPU and puts it in the WAIT storage. In this state, the machine does not work, it just waits. At this time, the hypervisor transforms (“forges”) the virtual machine commands of the guest machine into real commands corresponding to the hypervisor commands, after which the hypervisor returns the virtual machine to the READY pipeline. A similar “freezing” of the virtual machine also occurs when the virtual device responds to the machine (the hypervisor needs to transform the response again, but in the opposite direction). The more I / O commands the virtual machine produces, the more often it is in the “freezing” of WAIT, and the lower its performance. The more “old” virtual I / O devices the virtual machine uses, the more difficult it is for the hypervisor to transform commands, and the longer the VM is in the WAIT state.
VMWare does not directly and officially recommend to virtualize applications with hyperactive I / O. You can reduce the negative impact of the WAIT state by using paravirtual devices for the virtual machine. This is a 10 Gigabit VMXNET3 network interface card and a PVSCSI paravirtual SCSI hard disk controller. Similarly, the use of hardware devices designed to speed up the work of virtual machines in physical servers contributes to reducing the impact of WAIT and the overall performance improvement. These are various network and HBA adapters with support for hardware iSCSI offload, direct memory access, network cards with support for virtualization, etc.
I would like to dwell on this. I hope the information in this article was interesting to you, and you can more effectively approach the construction or operation of your virtual infrastructure.