📜 ⬆️ ⬇️

Xen Cloud Platform in Enterprise [4]

Fourth part. Previous parts: First , second , third .

In this chapter: block devices: physical and virtual; disk images, storage.

From the point of view of controlling cars, this is the most informative part. Even network management (the following parts) is simpler than disks. The reason is simple - disk devices in virtualization are the most complex abstractions.
Before proceeding to the actual description, we list the properties of disk devices:

Each line from this list is a separate property that requires a separate implementation (for comparison, the network interface has only two properties: they can receive and send information).

For virtualization, these properties are divided into two groups: the first is related to the storage of information. The second is connected with access to the place of information storage.

Obviously, access (access process) is closely related to the virtual machine. And the storage process is not. We can store without the one who writes, but we cannot write without the one who stores. However, you cannot write information in the same way without the person who writes, so the access process is closely related to both the virtual machine and the information store.

Thus, we have a functional separation of storage and access. In XCP, they are associated with separate entities: storage - VDI (virtual disk image), access - VBD (virtual block device). VDI may exist outside the context of the virtual machine, VBD may not. VDI can be connected to different virtual machines at different times, VBD - no, it cannot exist without a virtual machine. In fact, VBD is a block device in a guest virtual machine.

The difference between VDI and VBD is about the same as between a hard disk and / dev / sda - both drives, but you can hold the first one in your hands, and the second is the abstraction of the operating system. (And we can write to disk only with the help of / dev / sda, and each computer has / dev / sda, although it can point to the same disk that we cross between computers).

VBD itself consists of two parts: one - inside the virtual machine, the second in the dom0 (administrative virtual machine). The driver inside the domU (guest machine), accepting the disk request almost unchanged, transmits it “out” and accepts the result, giving it to the kernel of the guest machine. The half of the driver in dom0 is much more complicated, it accepts requests from the first half of the driver and does all the substantive work of the request (read, send a request to the operating system or the storage), returning the result. The separation in this form is done so that under Xen you can very quickly and simply write drivers for guest systems. Fortunately, during administration, this division is not visible and VBD can be perceived as a single entity.

VDI in the most primitive implementation is a byte copy of the disk. However, in real life it is too wasteful (the place is lost, there are a lot of repeated pieces of data), so VDI can often have significant differences from the “byte-by-byte” correspondence. First, the underlying storage can do thin providing when duplicate blocks are written to the physical disk once, and the rest are just links. When writing to such blocks, they are copied, and the link is deleted (copy on write technology). In addition, some storages (meaning specialized hardware) can provide an independent search for duplicate blocks and convert them into links.

Secondly, it is obvious that areas in which no one has written anything (empty space) are easy to “compress”, that is, to replace the actual thousands of zeros with the notation that “there are only zeroes”. The simplest example of this is sparsed files, in which unrecorded areas are simply not stored, forming “holes” in the middle of the file.

VDI storage locations are called SR - storage repository. SR is an abstraction, it has no physical embodiment. To specify how exactly to implement the physical embodiment used SM (storage manager'y). Each SR indicates through which SM it should be accessed and which parameters should be passed to the SM.

A key feature of the Xen Cloud Platform (compared to bare Xen on multiple computers), which makes the XCP a ready-made platform, is the concept of shared storage (shared SR). These are storages to which all hosts are simultaneously connected with the same settings. It is shared SR that makes it possible to migrate machines between hosts in a short time.

SM is, roughly speaking, a driver. Since several hosts can be connected to the same storage, another abstraction is used for the physical connection of the host - PBD (Physical Block Device). When a host connects to a repository (SR) it uses a driver (SM) with a set of parameters. The driver establishes the connection (exactly how - the SM headache) and gives out to the PBD host. Examples of PBD - mounted nfs-ball, connected iscsi-device, local hard disk, separate partition on this disk, etc.

Why do we need such a complex scheme? It allows you to abstract as much as possible from the specific features of the storage - the “storage driver” (SM) provides a standard interface for XCP, which allows, for example, to replace NFS with FiberChannel without serious consequences without changing any code in XCP.

Thus, the SR connection to the host looks like this: at the time of creating the new storage, all the hosts in the pool are sent the command “create PBD with SM with this and such parameters, such as those belonging to SR to such and such”. Each host executes this command. Similarly, when a host is pooled (it creates all PBDs for all shared SRs). When loading each host connects all new PBDs. If some PBDs fail to connect, they are marked as unplugged and affect the order in which the host is selected to start the virtual machine.

In addition to the shared repositories, there are also “personal” host repositories, but they are not recommended for product, because they automatically deprive the virtual machine connected to the VDI stored on the SR, the right to migrate. One of the uses of local SRs is access to physical screws (usually during server migration) and / or physical CDs.

By the way, it becomes clear how transparent work with thin software, sparced files, copy-on-write for storage is ensured. This is a SM headache. He was told “make a thin copy of such and such blocks”, SM looked-looked if he could, made a thin copy (link), if he could not, ran dd and made a thick copy.

From here, by the way, it is easy to understand that copying VDI between SRs is always an expensive operation - you need to copy each VDI block.

Let's repeat the virtual machine disk connection pattern:

The virtual machine has VBD, VBD indicates the associated VDI, VDI (as an administrative unit) indicates the SR it is located on, SR indicates which SM is used. The virtual machine is located on a certain host, the host has an associated PBD through which it is accessed.

The most interesting thing happens when the machine is migrated: on one host, the VDI is closed, on the second, through another PBD in the same SR, access is established. If not, the migration is canceled and the machine remains on the same host (this is important, in case of an erroneous migration, you can get a lot of brakes on the hosts, but this will not affect the work of the virtual machine).

Storage Types

Actually, these are the SMs that come with XCP, the type field for SR:

* iscsi is a block device for a single VDI (this means that the entire SR is dedicated to a single VDI)
* lvmohba - LVM over the connected Fiber Channel. The main thing is the ability to store multiple VDI on one device through the use of logical volumes. The volumes are managed by the XCP itself.
* file - local storage, just a directory with disk image files
* lvm - local storage, LVM on a local hard disk. Each VDI is a separate volume. Managed by XCP.
* udev is a cute toy for plugging flash drives into virtual machines, SR is served by a single VDI, local storage.
* nfs - VDI images in the form of VHD files are stored on the NFS ball. The difference with file is that it is a global repository, and in it the XCP manages the connection of the ball to the hosts.
* dummy - dummy, cannot store VDI.
* hba - allocation of the specified block device FC for storing a single VDI.
* ext - a rather perverted construction - VDI files are stored in VHD format on an ext3 partition created by XCP on the specified block device.
* ISO storage ISO files.
* lvmoiscsi - LVM over an iSCSI block device.

As we can see, there are three important types in this list: lvmoiscsi, lvmohba, nfs - only they can be used for “virtual machines by bundles” (the rest can also be used, but in specific cases).

Note that when installing XCP it is asked whether to do local storage. If you answer yes, then you get SR with type lvm.

Unfortunately, I have little experience with FC, so I won't say anything significant about him. The choice between NFS and iSCSI is determined by the following differences:

NFS provides the greatest density of space utilization (sparsed files are used), provides very good read caching.
iscsi gives a higher write speed, smaller brakes when you first write to the sparsed-area.

Another advantage of NFS is the ease of manipulating VDIs on storage — they look like regular files. In the case of iscsi, we have lvm on top of the block device (which can be a partition or even a file on the network storage), and VDIs and there inside are logical volumes. It is much more difficult to get to them "piece" access from storage. In addition, NFS is used in the third version, in which there is no cryptographic protection and only validation by IP addresses is in effect. By comparison, iscsi uses mutual authentication with a chap, so my personal advice is to use iSCSI. Especially since iscsi supports multipath [?] , And XCP supports it too.

It is clear that the configuration of the SR goes beyond the XCP - it also requires the configuration of the storage itself. The most common mistake is the limit on the number of simultaneous connections to the iscsi-storage (the regular iscsi-target mode is one connection, XCP requires permission to connect several) will be able to connect.

Operations with SR: create, destroy (the data is deleted!), Forget (only the record about the existence of SR is deleted, data is saved), introduce (“detect” sr and VDI'and on it), scan (allowing to detect XCP vdi'i, this is especially true for NFS). In addition, there is the xe-sr-probe command, which allows you to construct a request to create a storage (more precisely, select the correct parameters). However, this issue is discussed in detail in the XCP manual.

The important field for SR is sm-config (SM configuration). In particular, it indicates the type of space allocation thick or thin (for iscsi it is thick, for nfs it is thin).


As already mentioned, the actual meaning of a VBD is a block device inside a virtual machine that allows it to access VDI.

Accordingly, VBD has the following important attributes: virtual machine uuid, uuid VDI and number . The number is just a number and roughly corresponds to the SATA port number on the motherboard. The VBDs themselves in Linux are called / dev / xvda, / dev / xvdb, / dev / xvdc ... (for the curious: after / dev / xvdz comes / dev / xvdaa and then up to the limit of 256 devices).

By convention, each time a VDD is connected to a VM, a new VBD is created (however, VBD can change the status of plugged / unplugged as many times as necessary) and is destroyed when a VDI or VM is destroyed automatically.

An important attribute of VBD is the presence of the 'bootable' flag. The analogy with real iron is rather complicated; let's say, this is the analogy of the disk specified for loading in BIOS. Important: to boot a virtual machine, VBD must be connected to it with the flag bootable = true, and one and only one. After downloading to the machine, you can cling to anything in any desired quantity - but to start downloading the bootable should be and should be the only one.

In addition, vbd has two more important attributes: unplugable and empty. The meaning of unplugable is clear from the name (the device cannot be turned off while on the move, it is usually set for the system disk). Empty is funnier: vbd allows you to “hook” not only images of hard disks, but also ISO images, and VBD itself can represent CD-ROM . The empty attribute is intended to simulate the extraction of a disk from a drive.

In addition, there is the ro attribute, which allows you to connect the drive in read-only mode (and you cannot get around it from the guest machine).

For the same VDI, there may be several VBDs in the same machine, or even in several. Unfortunately, in the existing system, there can only be one connected at a time (although it is quite possible to imagine a situation when VDI is connected to two virtual machines that are working in turn).

Typical operations for VBD: create, destroy (destroys the connection between the disk and the machine without destroying data on the disk), plug, unplug. There is also an imitation of inserting / removing cd - load / eject, but this is fanservice.

Interesting attributes: virtual-allocation, physical-utilization, physical-size. The first two can vary greatly for a thin provision. Another content-type field was specifically made to indicate that ISO's are stored on SR. The type field indicates whether it is a disk or cdrom (a different subset of the available commands).


VDI also has attributes (in fact, there are a lot of them, I cite only the most important and used in the work). This is the r / o flag (yes, ro can be set at the level of vbd or even at the level of vdi) and ancestors-descendants. About the relations ancestor-descendant and in more detail about thin copying for VDI - in the following chapters when it comes to snapshots.

Interesting attributes: virtual_size (set at creation) and physical_utilisation, showing the amount of occupied space (in terms of storage).

Typical operations: create, destroy (destroys not only data together with VDI, but also VBD - connection between VDI and VM), copy (always thick copy), clone (if possible, thin copy with copy-on-write mode) and ... our favorite, resize . Yes, xcp allows you to change the size of the disks, moreover, it allows you to do it on the go without interruption in service. The only problem is that the guest systems are not really ready for this at the moment, and after the elegant xe vdi-resize, you will have to do an absolutely inelegant resizing of the partition, resize the file system.

For ISO, mass crutches were invented — a content-type in the SR properties, and even a whole set of commands — xe vm-cd-add, xe vm-cd-eject, etc., although in fact ISO is the same VDI that connects with VBD with CDROM type. It is important that it is exclusively R / O, no CD / DVD-RW is emulated.

Auto-create VDI

The theme of the guest machine templates will be dealt with later, but for now, I note that when creating a machine from a template (and this is the only method to create a virtual machine in XCP), a new disk is created for the new machine. It is created thanks to the provision value template written in the other-config field (just a description of what the drive should be). Here is an example from one of the templates: <provision><disk device="0" size="8589934592" sr="" bootable="true" type="system"/></provision> . As we can see, the syntax is clear, although it is too XML - the zero (first) device, the boot disk, the size is 8GB.

Another small note about the type (type = system). In XCP, it was thought that the system disk is tied to a VM (and does not make sense without it), and type = User stores data and should not suffer because of virtual machine problems. Unfortunately, there is a bug in this place, and the system in some cases is treated as user (thanks for not being the other way around). For disks created when creating a machine with the System type, the following rule applies: when you remove a machine (vm-uninstall), these disks are destroyed. User-drives, no.

The bug is that if you create vdi with system type, it will not be deleted when you remove the machine. In principle, this is not a very big problem, but if you often make cars, cling system discs to them and remove cars, then the place may end and you have to rub unnecessary discs with your hands.

In the following releases: migration, templates, snapshots, parent-child relationships, machine cloning, installation of virtual machines, HVM.

Source: https://habr.com/ru/post/105568/

All Articles