Modern virtualization capabilities

After recent discussions about which hypervisor is better, the idea arose to write out the functionality of modern virtualization systems without reference to specific names. This is not a “who's better” comparison, it is an answer to the question “what can be done using virtualization?”, A general overview of the possibilities of industrial virtualization.

Code execution

Since the hypervisor fully controls the virtual machines, it can specifically control the operation of the machine.

Different virtualization systems offer several code execution methods (full emulation is not included in the list, as it is not used in industrial virtualization):

binary rewriting. This approach uses VMWare and Connectix Virtual PC (purchased by microsoft) for virtualization on a host without hardware virtualization. The hypervisor (virtualizer) scans the executable code and marks instructions that require “virtualization” with breakpoints and emulates (virtualizing) only such instructions.
Hardware virtualization. Old technology for Alpha and System / 360, a relatively new technology for amd64 / i386. Introduces the ring -1 semblance, on which the hypervisor runs, controlling the machines through a set of virtualization instructions. The intel and amd technologies are slightly different, amd offers the ability to program a memory controller (on the processor) to reduce the computational load on virtualization (nested pages), Intel implemented it as a separate EPT technology. Actively used in the product to run “foreign systems” in VMWare, Xen HVM, KVM, HyperV
Paravirtualization. In this case, the kernel of the guest system is “virtualized” at the compilation stage. userspace practically does not change. The guest system works with the hypervisor and makes all privileged calls through the hypervisor. The kernel of the guest system itself does not work on ring0, but on ring1, so that a rebel guest cannot interfere with the hypervisor. It is used for virtualization of opensource systems in xen (PV mode) and openvz (openvz is in some sense unique, because the “guest” kernel is simply not there, it is closer to jail than to virtualization, although some strong isolation is still provides).
Container virtualization. Allows you to isolate the set of processes for "containers", each of which has access only to the processes from its container. It is located on the thin line between virtualization, bsd jail and just a well-written system of isolation of processes in the OS from each other. Uses a common memory manager, the same core

')
In reality, paravirtualized drivers are used in both HVM and binary rewriting (often called guest tools), because it is in I / O operations that paravirtualization significantly outperforms all other methods in performance.

Without exception, all hypervisors are able to perform a pause (suspend / pause) operation for a virtual machine. In this mode, the operation of the machine is suspended, possibly, with the preservation of the memory data on the disk and the continuation of the work after the “recovery” (resume).

A common feature is the concept of migration - transfer of a virtual machine from one computer to another. It happens offline (turned off on one computer, turned on on the second) and online (usually called live, ie “live migration”) without shutdown. In reality, it is implemented using suspend on one machine and resume on another with some optimization of the data transfer process (first the data is transferred, then the machine drops and the changed data is transferred since the start of the migration, then the new machine is started on the new host).

Also, in Xena, promised (and, it seems, almost brought to product) the technology of parallel execution of one machine on two or more hosts (Remus), which allows the virtual machine to continue operating in the event of a server failure without interruptions / reboots.

Memory management

The classical virtualization model implies the allocation of a fixed amount of memory to the guest, the change is possible only after it is “turned off”.

Modern systems can implement the functionality of manual or automatic changes in the amount of RAM for the guest system.

The following memory management methods exist:

ballooning The generally accepted mechanism (at least xen and hyperv, it seems, is in VMWare) The idea is simple: a special module in the guest system requests memory from the guest system and gives it to the hypervisor. At the right moment, he takes the memory from the hypervisor and gives it to the guest system. The main feature is the ability to give idle memory of the virtual machine back.
Memory hot-plug. Adding memory on the go. Supported in hyper-v in future sp for windows server (not yet in release), in xen 4.1 (in linux 2.6.32). It is an addition of memory, similar to a hardware hotplug. Allows you to add memory "on the go" in a live server without rebooting. An alternative to this method is preinflated balloon, when the hypervisor starts the machine with an already “non-zero” balloon, which can be “blown off” as needed. memory hot-unplug is so far only in linux (it works or not until I can say). Most likely, in the near future, MS will finish Windows for unplug.
Common memory. Specifically for openvz, memory is taken from a common pool, virtual machines are limited only by an artificial limit in the form of a number that can be changed on the go. The most flexible mechanism, however, is specific to openvz and having some unpleasant memory effects.
Memory compression. Guest memory is “compressed” (by compression algorithms), which in some cases get some additional volume. Penalty: delay in reading and writing, the load on the processor from the hypervisor.
Page deduplication. If the memory pages are the same, they are not stored twice, and one of them is made reference to the other. It works well when running multiple virtual machines simultaneously with the same software bundle (and the same versions). Sections of the code match and deduplitsiruyutsya. For data it is ineffective, the picture is spoiled by disk caches, which are different for each machine (and which strive to occupy all the free memory). Of course, checking for duplicates (calculation of the hash for the memory page) is not free.
NUMA - the ability to expand the memory of a viral machine in a volume larger than it is on the server. The technology is raw and not quite mainstream (I didn’t dig deep, so I’ll not tell you more)
memory overcommitment / memory oversell - a technology for announcing virtual machines a memory size larger than it actually is (for example, the promise to 10 virtual machines of 2 GB each with only 16 GB). The technology is based on the idea that no virtual machine normally uses all memory 100%.
Shared swap, which allows partial unloading of virtual machines to disk.

Peripherals

Some hypervisors allow a virtual machine to access real hardware (moreover, different virtual machines, different hardware).

They can also provide equipment emulation, including one that is absent on the computer. The most important of the devices - the network adapter and disk are considered separately; among other things: video adapters (even with 3D), USB, serial / parallel ports, timers, watchdogs.

One of the following technologies is used for this:

emulation of real devices (slow)
Direct access to the device (since “forwarding”, passthrough) to the guest machine.
IOMMU (hardware translation of the addresses of the pages used for DMA, which allows sharing the RAM used by devices between virtual machines). In Intel, VT-d is called (not to be confused with VT aka vanderpool, which is the technology for ring -1 in the processor).
Creating paravirtual devices that implement a minimum of functionality (in fact, the two main classes - block devices and network are discussed below).

Network devices

Network devices are usually implemented on either the third or the second level of abstraction. The created virtual network interface has two ends - in the virtual machine and in the hypervisor / control domain / virtualization program. The traffic from the guest is transmitted unchanged to the host (without any dancing with replayings, speed matching, etc.). And then quite significant difficulties begin.

Currently, minus systems emulating a network interface at the third level (the level of IP addresses, for example, openvz), all other systems provide the following set of features:

Bridging (2nd level) interface with one or several physical interfaces.
Creating a local network between the host and the guests (one or several) without access to the real network. It is remarkable that in this case the network exists in a purely virtual sense and is not tied to "live" networks.
Routing / NAT guest traffic. A special case of the above method, with routing enabled for the virtual interface (and NAT / PAT)
Encapsulate traffic in GRE / VLAN and send to hardware switches / routers

In some virtualization systems, the case of bridging a network interface of a virtual machine with a physical network interface and the presence of a virtual switch are separated separately.

In general, the network of virtual machines delivers a special headache during migration. All existing product systems with interface bridging allow transparent live migration of machines only in one network segment, it requires special tricks (fake ARP) to notify upstream switches about the port change for traffic switching.

At the moment, a rather interesting system has been developed - open vSwitch, which allows to carry out the task of determining the path of a packet to an open-flow controller - it is possible that it will significantly expand the functionality of virtual networks. However, open flow and vSwitch are a little apart from the topic (and I will try to talk about them a bit later).

Disk (block) devices

This is the second extremely important milestone in the work of virtual machines. The hard disk (more precisely, the block device for storing information) is the second, and maybe even the first, by the importance component of virtualization. The performance of the disk subsystem is critical for assessing the performance of the virtualization system. A large overhead (overhead) on the processor and memory will be experienced more easily than an overhead on disk operations.

Modern virtualization systems offer several approaches. The first is to provide a virtual machine ready file system. Overhead costs tend to zero (specifically for openvz). The second is in the block device emulation (without any ryushechek like smart and SCSI commands). A block device from a virtual machine is bound either to a physical device (disk, partition, LVM logical volume), or to a file (via a loopback device or by direct emulation of block operations “inside” the file).

An additional possibility is the use of network storage by the hypervisor - in this case, the migration process is very simple: on one host the machine will be paused, on the second they continue. Without transferring any data between hosts.

However, most systems, provided that the block device of the underlying level supports it (LVM, file), provide the ability to change the size of the virtual block device on the go. That on the one hand is very convenient, on the other - the guest OS is not ready for this at all. Of course, all systems support adding / removing block devices on the go.

Deduplication functions are usually assigned to the underlying block device provider, although, for example, openvz allows you to use copy-on-write mode using the “container template”, and XCP allows you to make a chain of block devices with copy-on-write dependencies from each other. This, on the one hand, slows down productivity, on the other hand, it allows to save space. Of course, many systems allow you to allocate disk space on-demand (for example, VMWare, XCP) - a file corresponding to a block device is created as sparsed (or has a specific format with support for “skipping” empty spaces).

Access to disks can be controlled by speed, by priority of one device (or virtual machine) relative to another. VMWare announced a great opportunity to control the number of I / O operations, providing a small delay in servicing all guests, slowing down the most hungry ones.

Dedicated disk devices can be shared between several guests (using file systems that are ready for this, for example, GFS), which makes it possible to implement clusters with shared storage with ease.

Since the hypervisor completely controls the guest’s access to the media, it is possible to create snapshots of disks (and virtual machines themselves), to build a snapshots tree (who is out of whom) with the ability to switch between them (usually, the snapshots still include the state of virtual memory). cars).

Similarly, backups are implemented. The easiest way to implement a backup is by copying a disk of a backup system - this is a regular volume, file, or LV partition that is easy to copy, including on the go. For Windows, it is usually possible to notify shadowcopy about the need to prepare for backup.

The interaction between the hypervisor and the guest

In some systems, there is a message mechanism between the guest system and the hypervisor (more precisely, the managing OS), which allows you to transfer information regardless of the network operability.

There are experimental developments (not ready for product) on the "self-migration" of the guest system.

Cross compatibility

Work is underway to standardize the interaction between hypervisors. For example, XVA is proposed as a platform independent format for exporting / importing virtual machines. The VHD format could claim to be universal if it were not for several incompatible formats under the same extension.

Most virtualization systems provide the ability to "convert" competitors' virtual machines. (However, I did not see any live migration system that would allow the machine to migrate between different systems on the move, and I did not even see any sketches on this topic).

Accounting

Most hypervisors provide one or another host load estimation mechanism (showing current values and a history of these values). Some provide the ability to accurately account for consumed resources in the form of an absolute number of ticks, iopes, megabytes, network packets, etc. (as far as I know, it is only in Xen, and only in the form of undocumented features).

Association and management

Most of the latest generation systems allow you to combine several virtualizing machines into a single structure (cloud, pool, etc.), either by providing the infrastructure to manage the load, or by providing immediately ready service to manage the load on each server in the infrastructure. This is done firstly by the automatic choice of “where to start the next car”, secondly by the automatic migration of guests to evenly load the hosts. At the same time, the simplest fault-tolerance (high avability) is also supported when using shared network storage — if one host with a stack of virtual machines has died, then the virtual machines will run on other hosts that are part of the infrastructure.

If I missed some significant features of some of the systems, say, I will add

Source: https://habr.com/ru/post/101447/

All Articles