One of the first problems you encounter when you decide to make “unlimited memory” in the cloud is that modern operating systems are not ready for “unlimited memory”. This is due to disk cache.
The kernel takes all the free memory for the cache. If there are disk operations and there is free memory, the cache will grow. In the case of a server with solely usable memory, this is good, however, if we say that all megabytes are paid, we don’t feel like paying for a disk cache of 10-20GB.
I tried to find an opportunity to limit the size of the disk cache for Linux, but all that I found was a strange patch of 2003 (which, of course, was not taken to the main branch).
')
It turns out that the more the guest works, the more his kernel takes to itself memory from (unlimited) volume. The result - if you don’t limit the system at all, it grows in a very short time either to the size of the disk devices (thanks, the disk cache is larger than the disk size does not happen), or to the maximum available memory on the platform. Note that in these conditions, all technologies fail — both memory compression and deduplication of memory pages (each VM has its own cache). This problem faces any virtualization system. Below are described the solutions with an eye on the Xen Cloud Platform.
Solutions
I see three solutions to the problem:
- The xenballoon mechanism used in Xen. In the kernel, a special driver when accessing it from within the system itself, reserves memory from the operating system and gives it to the hypervisor (“frees up”). If necessary, he is asked to "take back" - and he takes (ie, returns to the system). In fact, the used balloon memory is busy from the point of view of the guest and freedom from the point of view of the hypervisor. When the memory is taken from the hypervisor, it becomes free for the guest and busy for the hypervisor.
- External mechanism for the regulation of available memory. It differs from baluning only in that the decision to allocate / delete memory is taken not by the application that is jerking the xenballloon, but by the external control system. Zen has long allowed on the go to change the amount of available memory for guest machines (as you put, so much will be). Up to a funny situation, when the memory gives less than busy now. In this case, oom_killer comes to the rescue inside the guest system (and it’s better not to do so)
- Writing a patch to the kernel, which would limit the size of the disk cache to a certain fixed value.
Read more
The third option is probably the most interesting. But, requiring significant changes in the kernel code, and, perhaps, a revision of the entire mechanism of working with memory. For me, this path is ... somewhat thorny.
The first and second - really working now. However, they have one major drawback: with such a system, the addition / deletion of free memory occurs asynchronously with requests. Roughly speaking, if the guest system has 200 MB of free memory, the application asked for 500, then it will be denied allocation. Because the kernel has no extra 300MB. A daemon (or external monitor) will know about this problem after working out a request for memory allocation.
... It would seem that the obvious solution: let the kernel, when it is asked for free memory, pull the interface. However, in this situation, the same core will infinitely have all the available memory for disk caches, since no memory request will remain unanswered ...
However, the situation is not hopeless. First, applications usually do not eat memory in such chunks. Usually, this is a gradual (more precisely, rather fast, but in smaller pieces) memory allocation for different applications as the load increases. In this situation, every time there is too little memory left in the guest system, a daemon in the guest system or an external monitor can throw more memory.
Swapping
Secondly, there is a swap file. If the application asked for 500 free 200 MB, then it will be given 200 megabytes of free memory, and 300 MB of little-used data will be thrown into a swap and the required will be returned. And then the same monitor, seeing the lack of memory, will throw another 500 MB to the guest system, so that it can continue to work quietly. Since the most irrelevant data will be thrown out in the swap, there is a high probability that access to them will be very, very rare (or never will happen at all). Thanks to the “damper” in the form of a paging file, it is possible to objectively provide memory without situations oom (out of memory), but without a voracious cache of tens of gigabytes.
The final scheme looks like this: For the guest system, a certain amount of free memory is allocated, larger than the amount of occupied memory. This memory is used as a disk cache of a reasonable size (it is still needed, even on guest virtual machines, since it is they who know best about which disk pages are more important). Regardless of how much the application uses memory, the free area remains always the same (minus minor fluctuations due to the asynchronous allocation of memory to applications and memory allocation of the guest OS).
Free memory
A separate task is the task of determining the free memory of the guest. How can we do it "outside"? Immediately I say - no way (if you do not consider the options of dirty digging in the brains of the guest kernel with the rough fingers of the hypervisor). There is no way to understand that page “a” is clean (free), and page “b” is dirty (busy).
In the XCP (Xen Cloud Platform), the only method for determining free memory is the agent (daemon), which tells the hypervisor how much memory is free. Based on this information, the hypervisor decides how much memory to add and remove in the guest.
What does this look like from the guest side? She runs a simple script that, at intervals, writes to the xenstore (the communication channel between the hypervisor and the guest system) free memory data (at the same time, it writes the current IP address, OS version). The cloud management system looks at this information and adds / removes guest memory.
From the point of view of the guest, the amount of free memory almost always remains the same, regardless of how much memory the applications occupied, this memory is used for the disk cache. In the case of some spontaneous bumps (request 1GB at a time - this is not very typical behavior) there are short moments of swapping, however, at the moment (testing has not started yet) I believe that these cases will be isolated and uncharacteristic. From the point of view of the hypervisor, the guest machine changes the amount of memory used, and it varies depending on how much memory is actually used in the guest.
What happens if a guest decides to “gossip” and changes the code of the daemon reporting free memory, or simply turns it off? For the hoster - nothing wrong. It will simply cease to receive information about free memory (and will cease to regulate it). If the data is written incorrectly, it will either take a lot of memory from the guest (the guest could have achieved this himself by nailing extra applications), or would give him a lot of memory - but the guest pays for consumption ... So I don’t think that anyone product decides to seriously bully with this service.
I still doubt which memory management mechanism is better — by means of a guest daemon (xenballooning) or by means of an external monitor.