I will continue to continue the series of official posts in the company's blog about how to work with the cloud, but in parallel I want to tell you about the problems we encountered while adapting the Xen Cloud Platform to our cloud model. These posts will be a bit more complicated and suggest that the reader at least in general knows how Xen works.
When the concept of “payment by consumption” was just taking shape, and I frantically searched for “how to count”, it seemed to me that the processor and memory are the two simplest resources.
Indeed, we have xencontrol (the Xen hypervisor management library), which can accurately tell about each domain (running virtual machine), how much memory it has, how many nanoseconds of time was spent. This library requests information directly (via xenbus) from the hypervisor and subject it to minimal processing.
')
This information looks something like this (outputting xencontrol for python):
{
'paused': 0,
'cpu_time': 1038829778010L,
'ssidref': 0,
'hvm': 0,
'shutdown_reason': 0,
'dying': 0,
'mem_kb': 262144L,
'domid': 3,
'max_vcpu_id': 7,
'crashed': 0,
'running': 0,
'maxmem_kb': 943684L,
'shutdown': 0,
'online_vcpus': 8,
'handle': [148, 37, 12, 110, 141, 24, 149, 226, 8, 104, 198, 5, 239, 16, 20, 25],
'blocked': 1
}
As we see, there is a mem_kb field corresponding to the allocated memory for the virtual machine and there is a cpu_time field containing some kind of mind-blowing number (although in reality it is only 17 minutes). cpu_time considers up to nanoseconds (more precisely, the value that is stored here is considered to be in nanoseconds, the real accuracy is about a microsecond). Memory, as is clear, in kilobytes (although the internal unit of account is that the hypervisor, that the linux kernel is the
page - its default size is 4 kilobyte).
It would seem “take it and count”, however, the devil is in the details ...
Sorry for the yellow caption below, but that was exactly the question in the debate during the discussion of one of the problems:
Hyper-threading + Xen = stealing money from customers
Question: Does Xth include hyperthreading or not? Its inclusion allows the client to offer more cores. For two Zeons, 4 cores each, this will give 16 cores, and if we reserve a couple for our own needs, then 14 cores. 14 cooler than 8? Cooler So it should be 14.
In addition, if you run the multithread application, then it will calculate one and a half times faster on 14 "fake" kernels than on 7 real ones. This is true, I tested on tests on the "bare" Linux on one of the test servers.
... Stop. In a half? That is, 14 cores will work one and a half times less with twice the number of cores?
Yes, that is exactly what happened. The situation became especially dramatic at the moment when I tried to run a numerically crusher with a maximum load in one virtual machine and a task that was obviously not solved within a few hours, and in another I launched an application that considers a computational problem fixed in volume. And compared how much time it took when calculating with hyper-trading turned off and with it turned on.
Load level | CPU time without HT | With HT |
neighbors idle, 1 core | 313.758 | 313.149 |
idle neighbors, 4 cores | 79.992 * 4 | 80.286 * 4 |
idle neighbors, 8 cores | 40.330 * 8 | 40.240 * 8 |
idle neighbors, 16 cores | - | 29.165 * 16 |
fully loaded neighbors, 1 core | 313.958 | 469.510 |
fully loaded neighbors, 4 cores | 79.812 * 4 | 119.33 * 4 |
fully loaded neighbors, 8 cores | 40.119 * 8 | 59.376 * 8 |
fully loaded neighbors, 16 cores | - | 29.634 * 16 |
Easy to see, 29.6 * 16 is 473.6, as well as 59.376 * 8 (475). And this is one and a half times more than 40.119 * 8 (320).
In other words, hyper-threading slows down processors by one and a half times when their number doubles.
It helps if the processor is entirely yours. And if processor time is paid and there are not even “colleagues” nearby, but simply “outsiders”?
After that, we had a big discussion (it took about three days with interruptions) - do you want HT to do cloud on clients for clients? In addition to the obvious “honestly dishonest” and “no one will know” there were much more serious arguments:
- One server will generate more processor resources (something for which Intel did this technology)
- We can take this “performance loss” into account in reducing the price of machine time.
- We can monitor the host load and avoid overloading above 50%, but we can provide the client with more cores
After the discussion, we came to the following set of arguments: We cannot control and really (not statistically) predict the consumption of computer time. Migration, although it is a way out, is partial, because the migration reaction should be about 30-40 seconds minimum, and loading trips can be instantaneous (less than a second). In this regard, we do not know what machine time (full or not) we provided to the client, and in any case the client will face unjustified loss of productivity for no apparent reason due to the fact that his neighbor wanted to find something heavy.
Due to the inability to ensure constant performance of a virtual machine with hyper-threading, we still won the view that HT in the cloud with computer time must be turned off (hence the limit of 8 cores per virtual machine).
Migration: copy and delete
The second ridiculous moment was the problem of resource accounting during migration. By default, it is assumed that a handle (aka uuid) is a proof of the uniqueness of an object, and there can be no two virtual machines with one uuid. Indeed it is. However, this applies to virtual machines, not domains. During migration, the contents of the domain (RAM) are copied, launched on the new node, and only after that is deleted on the old one. All this is accompanied by numerous re-copies of domain fragments (since the virtual machine continues to work at the same time). In any case, we have TWO domains from ONE virtual machine. If we count roughly (to sum up) roughly and in the forehead, then in the total counters we will get completely incorrect numbers. (By the way, this problem was discovered rather late and was one of the reasons for the launch delay).
The solution to the problem was elegant and architecturally beautiful. Only one copy of the domain can work at a time, all others are in pause mode. This is very important, because if two domains work for us, they can break up rare firewood. Thus, the solution looks like this: a stopped (paused) domain is not counted. It has several minor negative effects:
- We could not offer our clients a spectacular “pause” button (in this mode, the domain exists, but is not executed). Since such a domain consumes memory, we cannot afford to ignore it if the client pauses the virtual machine and goes on vacation. We cannot distinguish between a “pause during migration” and simply a “pause” (at least without very large and nontrivial dances with states and databases).
- When the client reboots the machine, there is a small moment when the domain already exists, but still paused - we do not consider it (thus, the client, constantly rebooting the machine will be able to unrecordedly consume a small amount of memory). It is possible that it is even more honest, because these are our problems - where there is what is being built, until the machine starts to work, the client has no reason to pay for it.
But otherwise this solution has no side effects. A few were discussed: the presence of blocking in the accounting database, accounting for the life of the domain, etc. All of them against the background of the elegance of the solution without taking into account the stopped domains look cumbersome and ugly (no one will praise himself, alas).
Who will pay for dom0?
Another giant problem was the dom0 boot problem. When a user makes a request over the network or performs disk operations, the request is transferred from the domU (the domain, which is the running virtual machine) to the second half of the driver in dom0. The driver thinks about the request, passes it to the real drivers of real hardware. And I must say, with intensive disk operations, he thinks oh how. See us under 50-80% - how to make figs. And most of this number is OVS, blktap, xenstore, etc. (the rest is xapi, squeezed, stunnel, which are part of the cloud management system). Who to charge for this machine time? And most importantly, how to separate one user from another? At the driver level, this is perhaps even possible, but more ...
The same OVS (Open vSwitch, a program that provides a virtual network) switches frames, and it doesn’t matter what domain they belong to.
Zenovtsy (the developers of the hypervisor and his strapping) broke his head over this issue. I also began to think, but I came to my senses in time. If disk operations are paid, then in their price, in fact, laid the cost of processing the request. These are not only IOPS disks, load on raid controllers, SAN, depreciation of 10G cards, etc. This and (quite insignificant for the price against the above) machine time dom0. Everything is logical and decided by itself.