This logic article should have appeared first, before the article about
Transparent Page Sharing , since this is the basis with which the immersion into memory management in vSphere 4.1 should begin.
In my English-language
blog , when I first began to study this topic, I broke it into two parts - it was easier for me to perceive completely new information for me. But since the audience in Habré is serious and past, I decided to combine the material into one article.
We begin with the most basic element, called the
Memory Page . It is given the following definition - a continuous block of data of a fixed size used for memory allocation. Typically, the page size can be 4 KB (Small Page) or 2 MB (Large Page). Each OS application allocates 2 GB of virtual memory, which belongs only to this application. So that the OS could know which physical memory page (Physical Address - PA) a certain virtual memory page (Virtual Address - VA) corresponds to. The OS keeps track of all memory pages using the
Page Table . All correspondences between VA and PA are stored there.
Next, we need some kind of tool that, upon request from the memory application, can find the necessary pair of VA - PA in the Page Table. Such a tool is called a Memory Management Unit (MMU). The search process for a VA-PA pair may not always be fast, considering that a 2 GB virtual address space can have up to 524,288 pages of 4 KB in size. To speed up this kind of search, MMU actively uses the Translation Lookaside Buffer (TLB), which stores recently used VA-PA pairs. Each time an application makes a memory request, the MMU first checks the TLB for the presence of a VA-PA pair. If it is there - great, PA is given to the processor - this is called a TLB hit. If nothing is found in the TLB (TLB miss), then the MMU has to “coat” the entire Page Table and once the desired pair has been found, put it in the TLB and let the application know that you can send the request again.
If the necessary page is in a swap, then first the page is sent back from the swap to memory, then a pair of VA-PAs are written to the TLB and then the application is accessed to the memory. Visually, it looks like this

')
TLB is significantly limited in size. In Nehalem TLB processors of the first level contains 64 records of 4 KB pages or 32 records of 2 MB pages, TLB of the second level can work only with small pages and contains 512 records. Based on this, it can be assumed that the use of large pages will lead to a significantly smaller number of TLB miss - (64x4 = 256) + (512x4 = 2048) = 2304 KB against 32x2 = 64 MB.
I can not refrain from the calculation from Wikipedia of the cost of TLB miss, where a certain average TLB is considered.
Size: 8 - 4,096 entries
Hit time: 0.5 - 1 clock cycle
Miss penalty: 10 - 100 clock cycles
Miss rate: 0.01 - 1%
If TLB hit is executed per CPU cycle, TLB miss takes 30 cycles, and if the average TLB miss frequency is 1%, then the average number of cycles per memory request is 1.30.
All these considerations and calculations are valid for situations when our OS is running on a physical server. But in the case when the OS is running on a virtual machine, we have another level of memory translation. When an application in a virtual machine makes a memory request, it uses the virtual address (VA), which must be translated to the physical address of the virtual machine (PA), which in turn must be translated to the physical address of the ESXi memory of the host (HA). That is, we need two Page Tables: one for VA-PA pairs, the other for PA-HA broadcasts.
If you try to display it graphically, then we get about this picture.

I don’t want to go into the deep historical jungle of memory virtualization technologies at the dawn of ESX, so we’ll look at the last two - Software Memory Virtualization and Hardware Assisted Memory Virtualization. Here I need to introduce into our narrative another element of the Virtual Machine Monitor (VMM). VMM is involved in the execution of all commands that the virtual machine gives to the processor and memory. There is a separate VMM process for each virtual machine.
Now that we've refreshed the basics, you can go to some details.
Software memory virtualizationSo, as the name implies, we have a software implementation of memory virtualization. In doing so, VMM creates the Shadow Page Table, into which the following translations are copied:
1. VA - PA, these address pairs are directly copied from the Page Table of the virtual machine, the content of which is the responsibility of the guest OS.
2. PA - HA, VMM itself is responsible for these address pairs.
That is, it is a kind of summary table of double translation. Every time an application in a virtual machine accesses memory (VA), the MMU should look in the Shadow Page Table and find the appropriate VA-HA pair so that the physical processor can work with the real physical host memory address (HA). In this case, isolation of the MMU from the virtual machine is achieved, so that one virtual machine does not gain access to the memory of another virtual machine.
According to the VMware documentation, this type of address translation is very comparable in speed with the translation of addresses on a regular physical server.
The question is, why then make a fuss and invent new technologies? It turns out there is something: not all types of requests to memory from a virtual machine can be executed at the same speed as on a physical server. For example, each time a change occurs in the Page Table in the virtual machine OS (that is, the VA-PA pair changes), VMM should intercept this request and update the corresponding Shadow Page Table section (that is, the resulting VA-HA pair). Another good example is the situation when an application makes the very first request to a specific memory location. Since VMM has not heard anything about this VA yet, it means that it is necessary to create a new entry in the Shadow Page Table, thus again introducing a delay in accessing the memory. And finally, although it is not critical, it can be noted that the Shadow Page Table itself also consumes memory. This technology is applicable to vSphere running on processors that were released before the Nehalem / Barcelona family appeared on the market.
Hardware Assisted Memory VirtualizationAt the moment, there are two main MMU virtualization technologies on the market. The first was introduced by Intel in the Nehalem processor family and this new feature is called Extended Page Tables (EPT). The second was introduced by AMD in the Barcelona processor family under the name Rapid Virtualization Indexing (RVI). In principle, both technologies perform the same functionality, and differ only in very deep technical details, the study of which I neglected because of their insignificance for my work.
So, the main advantage of both technologies is that now the new MMU can simultaneously launch two address search processes in Page Tables. The first process looks for a pair of VA - PAs in the Page Table of your virtual machine, another process searches for a PA - HA pair in the Page Table, which is managed by VMM. The second Page Table is called Extended (sometimes Nested) Page Table. Once both pairs are found, the MMU writes the resulting VA - PA pair to the TLB. Since both tables are now available, MMU needs no content for the Shadow Page Table. Another important point is that now that both tables are separated from each other, the virtual machine can safely manage its own Page Table without control from the VMM.
Well, another significant difference in the Nehalem architecture is that TLB has now introduced a new field Virtual Processor ID. In older processors, this field was not present, and when the processor was switched from the context of one virtual machine to the context of another virtual machine, all TLB content was deleted for security reasons. Now, with the help of VPID, this can be avoided and, accordingly, again reduce the number of cases of TLB miss.
The only noted problem of such a solution is the higher cost of TLB miss for one simple reason - every time the MMU does not find the correct MMU address pair in the TLB, it is necessary to search again in two tables. That is why, when vSphere discovers that it is running on a Nehalem processor, it necessarily uses large pages. Next week I will try to finish disabling support for large pages on all our ESXi hosts and post the performance results - in my first
article I mentioned the results of disabling Large Pages on one of the production servers.
For a number of reasons, from laziness and attempts to make the material more accessible to my ignorance, I omitted quite a lot of different nuances and details. For example, such as various types of TLBs and processors, an additional level of memory translation from Linear Address to Virtual Address, differences in working with memory in different operating systems, etc. Criticism of the quality of the material, technical inaccuracies, tricky questions, well, in general, any lively interest is welcome, because they will help all of us to fill our gaps in knowledge.
The main sources of my inspiration and information were Wikipedia and this is the
document , for which I would like to say many thanks to their authors.