We continue to talk about memory - Page Sharing

This article is the second of my series of articles on working with memory in the virtualization environment in general, and Hyper-V dynamic memory technology, which will appear in Windows Server 2008 R2 SP1 in particular.
The first part: habrahabr.ru/blogs/virtualization/93241
Here we will talk about one of the technologies of “memory re-allocation” called Page Sharing.

How it works?

Page Sharing is one of the technologies of dynamic memory allocation in hypervisors that allows virtual machines to allocate more memory than is physically available - that is, what is called in English “memory overcommitment”.

The principle of operation of this technology is similar to some data compression algorithms. It all starts with the fact that the hypervisor "shovels" all the pages of memory available in the system, and calculates the checksums (hashes) of each page. The obtained values are recorded in a special table. Then, the hypervisor searches the table for the same checksum values, and when a match is found, produces a bit-by-bit comparison of the corresponding pages. If completely identical pages are found, then only one copy remains of them, and then virtual machines are automatically redirected to it when they are accessed. This happens exactly as long as one of the virtual machines does not want to make changes to the “shared” page. Then a separate copy of this page is created, and it is no longer “shared”, but is used only by one virtual machine. Unfortunately, the very “perelopachivanie” memory and hash calculations, with a further search for matches in the table with bitwise comparison - the process itself is quite resource-intensive and long, the duration can take up to several hours with large amounts of memory.

Page Sharing, TLB, Large Memory Pages and all-all

To begin with, let's look at the general principles of working with memory, what is a large memory page (Large Memory Pages, then LMP), for which we need an associative translation cache (Translation Lookaside Buffer, then TLB), and here it’s all Page Sharing. Here is what Alan Zeichick, an engineer from AMD, writes in his article (although the article talks about working with the memory of the Java virtual machine, it is in principle applicable to computer virtualization. I will give a partial translation of the article, the original is in English here ):

All x86-compatible processors, and all modern 32-bit and 64-bit operating systems use page organization of physical and virtual memory. For each application, the virtual address of the page and the physical address are compared using the page table. To speed up this mapping process, modern processors use an associative translation buffer (translation lookaside buffer, TLB), which caches the mapping of physical and virtual addresses most recently accessed.
As a rule, the memory area allocated to an application is not continuous, and memory pages are often fragmented. But due to the fact that the memory page table hides the physical addresses from the applications, the applications “think” that the memory area allocated to them is continuous. By analogy, applications that do not directly work with the file system have no idea about the fragmentation of individual files.
When a running application accesses memory, the processor uses the page table to convert the virtual address used by the application to a physical address. As mentioned above, to speed up this process, a caching system is used — an associative translation buffer. If the requested address is in the TLB, the processor can process the request much faster due to the lack of the need to find a match across the entire page table. Accordingly, if the requested address in the TLB cache is missing, then a standard search operation is performed to match the virtual and physical addresses in the page table, and only after that the request can be processed.
Due to the sheer number of pages, the performance of the TLB cache is of paramount importance. In a standard 32-bit server running any OS — it doesn't matter if it is Windows, Linux, or some other Unix, with 4 GB of RAM, the page table will contain a million entries for each 4-kilobyte page. Now imagine if we have, for example, a 64-bit OS and, say, 32 GB of memory? It turns out as much as 8 million. 4-kilobyte pages.

And further:

Why is the use of this technology [large pages] more convenient? Suppose our application is trying to read 1 MB (1024 KB) of continuous data that was accessed comparatively long ago (that is, this query was not saved in the TLB cache). If the memory pages are 4 KB in size, this means that you will have to access 256 pages. It turns out that we will have to perform 256 search operations in the page table, in which there may be millions of records. It takes a lot of time.
Now imagine that the page size is 2 MB (2048 KB). In this case, the search in the page table will have to be carried out once, if the data block of 1 MB, which we need, is entirely in one page, or twice - otherwise. And if TLB is still used, the process proceeds much faster.
For small pages, the TLB contains 32 entries for the L1 cache and 512 entries for the L2 cache. Since each entry corresponds to a 4-kilobyte page, it turns out that the entire TLB covers only 2 MB of virtual memory.
For large pages, the TLB buffer contains 8 entries. Since each entry here addresses 2 MB of memory, TLBs address 16 MB of virtual memory. This mechanism becomes much more efficient when using memory-intensive applications. Imagine that your application is trying to read, say, 2 GB of data. What will be faster - reading thousands of cached 2-megabyte pages or “shoveling” half a million small 4-kilobyte pages?

By simple arithmetic operations, you can count the number of records in the page table with different amounts of memory. If you take the page size of 4 kilobytes, then for 4 GB of memory there will be only 1 million. For 32 GB - already 8 million, for 64 GB - 16 million records, and as many as 256 million records for 1 TB of memory. And now let's remember that servers have long supported not that 32 or 64, but even 192 GB of memory (for example - HP DL385 G6), but the recently released Intel Nehalem EX processor, according to the specification, supports up to 256 GB of memory per processor socket. It turns out that a terabyte of memory is no longer from the realm of fantasy. This is just one four-processor server. If you use the old model of organization of memory in the form of 4-kilobyte pages - we get 256 million pages, and working with such amounts of memory will resemble scooping up a swimming pool with a beer mug. So the use of large pages of memory is not the distant future, but the most present.
Summing up the summary: the associative translation cache is a fairly important system resource, the effectiveness of which greatly affects the performance of the system. The de facto standard in 32-bit systems that support a maximum of 4 GB of memory was memory organization in the form of 4 KB pages. Currently, when 64-bit systems are used more and more, and the memory volumes can amount to tens and hundreds of gigabytes - using such an organization of memory can seriously reduce the efficiency of TLB usage and, consequently, the performance of the system as a whole.
It is impossible not to mention the new technology, called “Second Level Address Translation (SLAT)”. This technology is called differently by different vendors (for AMD - Rapid Virtualization Indexing (RVI) or Nested Page Tables (NPT), for Intel - Extended Page Tables (EPT)). It allows you to directly convert guest addresses (that is, inside the virtual machine) into physical addresses. Such a transformation can significantly improve performance (officially confirmed a growth of 20%, some speak of a 100% increase in productivity) compared to systems where this feature is not supported. So for virtualization SLAT is useful, and is one of the reasons for the transition to hardware and software that supports it.
However, many people forget, or simply do not know, that SLATs were developed and optimized for working with large memory pages. If support for large pages is not enabled, then the TLB cache works less efficiently, and using SLAT with small pages can lead to the opposite of a 20% drop in performance. In addition, we will not get a 10-20% increase in productivity from the very use of large pages, and, accordingly, in general, we will lose up to 40% of performance.
Summarizing the above, we see that Large Memory Pages is a very important factor that can give performance gains of up to 40%, and the question of whether to use them is rhetorical. Large Memory Pages is a product of the evolution of computer systems, just like the 64-bit processor architecture, or jumbo frames in network technologies.
')

It would seem, where does Yu. Luzhkov?

Many people when reading the article will have a quite reasonable question: why did I spread here in promise on the tree, telling about all sorts of Big Pages and Caches of an associative broadcast? After all, the article was originally about Page Sharing? The answer is very simple: Page Sharing practically does not work when using large pages. Why? Elementary, Watson: because of the large volume of pages (2048 Kb against 2 Kb), the probability of finding completely identical pages in the memory decreases to almost zero. This is one of the few instances when Microsoft and VMware fully agree with each other:
communities.vmware.com/message/1262016#1262016

The only problem with using large pages is that for Page Sharing to work, it is necessary to find completely identical memory pages of 2 MB in size (compared to smaller 4-kilobyte pages). The likelihood of this is much lower (except for empty pages in the guest OS, which are completely clogged with “zeros”), and ESX does not try to separate large pages of memory, and that is why memory savings by using TPS are reduced when the hypervisor matches all the guest pages with physical large pages.

In short - Page Sharing works efficiently with 4-kilobyte pages, but when using large, 2-megabyte pages, it does not give any advantage.

Blank pages

Oddly enough, but Page Sharing works best with large amounts of unused memory — that is, when there are many pages that are clogged with zeros. They are completely identical, and "shared" the easiest. Well, something like the compression of "Malevich's square" in bmp format. The problem is that some operating systems (in particular, Windows 7) tend to use all available memory, and this is not a bug, but a feature - in particular, SuperFetch in Windows Vista / 7, which I wrote about earlier. So, this will also lead to a decrease in the efficiency of Page Sharing technology.

Summary

Large (2 MB) memory pages (Large Memory Pages) are supported in processors now. AMD supports large memory pages in the last few generations of Opteron processors, while Intel has implemented their support in new Nehalem processors. That is, very soon it will become a generally accepted standard.
- Page Sharing works on systems with large memory pages with almost zero efficiency. If 4-kilobyte pages can often be identical to each other, then with a page size of 2 MB it is almost impossible to find completely identical pages. Once again - this is one of the few moments where the opinions of Microsoft and VMware coincide.
- So, page sharing effectively works only with 4-kilobyte pages. But when using 4-kilobyte pages, TLB performance decreases and SLAT cannot be used, which leads to a drop in system performance by up to 40%.
- In Windows Server 2008/2008 R2, as well as in Windows Vista / 7, support for large pages of memory is enabled by default.
- Large pages supported in Hyper-V R2 at the level of the hypervisor.
Even if you forget about large pages of memory - page sharing efficiency decreases due to the fact that modern OSs try to use all available memory (SuperFetch).
The Page Sharing preparation process includes hashing all pages of memory, saving hashes to a table, and then a bit-by-bit comparison. This process is long and resource intensive and can take hours.
Page Sharing is not the most effective method of dynamic memory allocation. That is, if it is necessary to urgently add memory to a virtual machine, or if one virtual machine has free memory and needs to be given to another virtual machine - page sharing will not help much.

From this all it turns out that the technology of Page Sharing when using modern equipment and operating systems becomes at best useless, and even harmful - reducing performance and not allowing the use of new features of iron and OS.

In the next article we will continue the conversation about memory overcommitment technologies, this time it will be about Second Level Paging.

Source: https://habr.com/ru/post/93274/

All Articles