Time quantization errors in virtual machines

Introduction

Conducting work on studying the performance of various tasks in virtual machines, I used the old proven method for a quantitative understanding of the results obtained, namely benchmarks. Many of them have proven themselves quite well before, plus even wrote a couple of their own on a wave of enthusiasm. Having already collected quite good statistics, without much doubt about the results obtained, I accidentally noticed on one of the tests that the time spent on the task, which the program happily reported on, was several times more than what I expected for this result. Repeated measurements with a stopwatch confirmed my observations: yes, time in the virtual machine may not go the way we expect. For details and how it threatens us, I ask for a habrokat.

The essence

Everything described below was carried out in Windows 2008 Server R2 with all the latest updates and integration packages on both the host machine and the virtual machine. The hypervisor was Hyper-V.

The essence of the problem: if the properties of the virtual machine do not need time synchronization in the integration packages (and this is sometimes strongly recommended to do, an example ), then there may be time-slicing errors depending on the load on the hardware.
How to check it clearly? Give the virtual machine 5% of one core with time synchronization disabled in its properties. Run in it any benchmark that loads the CPU, I used SuperPi . Not only is the difference in results immediately noticeable - the program claimed about 18 seconds to complete, I actually waited for more than 10 minutes, so if you open the clock in the notification area, you can see that the second hand moves several times slower than in the reality which we are all so used to.
What determines the size of the error? From the resources allocated to the virtual machine and the workload of the host machine. The less resources are allocated and the more loaded virtual machine and / or host machine, the greater the error.

What does it threaten with?

Measuring the performance of such a virtual machine, you can get last year's temperature at the south pole of Mars, and not objective data
Time on a virtual machine may start to lag significantly, which may be critical for some services.
The calculated time to perform any tasks and services may differ from the actual results
I’m not sure at this point (I didn’t torture other hypervisors), but theoretically there could be a situation when, on a cloud hosting, internal measurements of resource usage may diverge from the presented hoster
Etc.

How else can you use this knowledge?

For example, it is possible on the hosting to understand what caused the brakes in the work of services. To do this, it is necessary to initially take measurements with a benchmark and save the results of the program issued and those measured by the stopwatch (all services should be turned off for the duration of the tests for measurement accuracy). If you suspect a delay, you can conduct a similar test again. The obtained variants can be interpreted as:

The time shown by the benchmark and stopwatch has not changed (+ error) - your services are to blame or the load on them from customers
The time shown by the benchmark has not changed, the measured by the stopwatch has grown - your virtual machine has been cut by the hoster or the load on the host machine has increased and you simply do not have enough resources
The time shown by the benchmark has grown - no matter what the stopwatch will show, because here it is clear that your virtual machine has been cut in resources or the load on the host machine has increased

Instead of conclusion

Such is the tale of lost time. Finally, I would also like to share a link for those interested in this issue, which was given by experts from Microsoft during the discussion of this topic.

Source: https://habr.com/ru/post/117908/

All Articles