📜 ⬆️ ⬇️

Snapshots for virtual machines in the cloud

Summary : A post talks about snapshots in the cloud, how to use them, and how they are organized.

One of the most notable new features in the cloud that appeared this year is snapshots. Everything that we do is divided into three categories - what is useful to us (billing, service utilities, etc.), what is useful to customers, but not visually noticeable (for example, storage systems, changing versions of the hypervisor, previously launched servers), and that is useful to clients and visually noticeable - and here snapshots just from this third category).

I want to warn you that the article will be very complicated. I will first talk about simple things - how to work with it and what is the use of it, and then I will tell you how it works inside. And if, I hope, we did it with convenience and clarity at the “user” level, then with a description of the device ... So say, take heart or skip it.
')

How to use snapshots?

The most common use of snapshots is to create backup copies in case of an error in setting up the machine. I want to warn you right away that this is important: snapshots are stored in the same place as the disks. This means that if a meteorite hits us or another natural disaster of federal significance arrives, then snapshots will be lost simultaneously with the disks, that is, for full backup copies, you should use another storage location geographically away from us. We absolutely do not plan to lose customer drives or allow natural disasters to the server, but I still have to warn.
Creating a snapshot in the selectle cloud
Snapshot can be performed at any time, on or off the machine. At the moment of performing snapshots, the disk activity of the machine is slightly suspended (we are talking about something about a second), after which it continues “as if nothing had happened”. There are two methods to make snapshot: in the properties of the disk on the page with virtual machines (there is also a button to “roll back to the previous snapshot”) and in the list of snapshots on the page with disks. There is also a list of all snapshots disk. Note that for a virtual machine, we usually do not give the opportunity to create snapshots during installation. Especially sneaky attentive customers may find that the “create snapshot” button on the drives page is still active (and working). There will be nothing interesting (except for a semi-established Linux) in such a snapshot, but we decided not to take away from people the opportunity to shoot themselves up to do what they want with their machines.

So, the created snapshot contains a copy of the disk at the time of creation. In size, it is often much smaller than a disk. If someone is interested in how the size of the snapshot is calculated - see the second part. Snapshots form a chain (if snapshots are made in a row) or a tree (how it turns out - see the section on rolling back snapshots). If you remove snapshot, then it begins to "dissolve" - ​​to unite with its neighbors (with the total amount of snapshots reduced). The process is quite fast (a few minutes - and there is no snapshot).

I personally consider the most “tasty” snapshot function to be the ability to connect snapshot as a disk. It is connected in read only mode (read only), and allows you to look at the "previous" state of the disk. No one bothers to make 10 snapshots of a disk and connect all 10 to the same machine - in this case the disks will be the chronology of the “main” disk.

Moreover, snapshot can be connected to any number of machines at the same time. (I immediately answer the question whether it is possible to boot from snapshot - formally, yes, in fact, the file system is very nervous about read only on root - we are working on this issue).

The second most important function is the rollback to the snapshot , that is, the recovery of the disk state. In this case, the changes are lost, so it is better to make a new one before rolling back to the old snapshot. In this case, the disk can be “switched” between snapshots (roll back / forth). There are some minor inconveniences to the rollback snapshot process — disk operations become unavailable and machine consumption is incorrectly displayed in the past. The total consumption of the account is calculated correctly, but since a new VBD (block device) is formed, the data for the VM is displayed for the new VBD. (We know about this not very obvious feature of our billing and plan to change it to a more convenient one in the foreseeable time).

For ease of use in the last few days before the announcement, we added a “final touch” - if the disk rolls back from snapshot, then the reverted_at field appears (that is, “restored from snapshot”). Trifle, but useful. This field will pursue the disk until its death (and after, hehe, we don’t delete data about the objects).

An important point: every time a snapshot is taken or rolled back, there is a “COW” syndrome (copy-on-write) - the first record will be slower than the subsequent ones. So on very busy servers with a large number of entries to create snapshots should be treated carefully.

If you make a few snapshots of a disk, then roll back the disk to the snapshot in the “middle”, then make a few snapshots, then roll it back to another snapshot, then roll it back again, then the snapshots tree will be formed. We store relationships in our database - which snapshot it is. Unfortunately, visualization is still in work (programmers strongly protest, having received the task “to draw a tree on JS”, ​​and let them be ashamed when reading this post).

Limits Unfortunately, all this luxury is not limitless. Our limitations: the length of the snapshot chain is no more than 20 disks, the maximum number of snapshots in a tree (taking into account branching) is no more than 60 pcs. According to our estimates, this is more than enough for normal operation.

On the “disks” page, each disk has a “snapshots” tab, where a list of all disk snapshots is given. Snapshots can be called and give them a multi-line description (but all lazy, yes, I also love it when these fields are filled, but it's usually very lazy to fill them in). In any case, snapshot can be uniquely identified by an absolutely useless number ( English universally useless ID, uuid) and (partially) by date of creation.



A little bit about the "total" field. Due to some features of the system, information about snapshots is unevenly updated - the list of snapshots is updated immediately after creating snapshots, but the “total” field may be late for some time - up to two minutes. Unlike other resources, which we compute in real time, disks and snapshots are counted at (approximately) two-minute intervals. The “total” field is calculated at the moment of calculating the consumption volume, so the “total” immediately after creating the snapshot will be incorrect (but will definitely return to the next write-off tick).

How does it work?


(please remove minors and people with increased susceptibility from the screens, now there will be hardcore).

Our snapshots (as well as disks) are based on the VHD format, which was invented by microsoft, made public, and used by citrix. It supports very effective snapshots (they are much more effective than LVM snapshots, which increase the number of entries in proportion to the number of snapshots). When a snapshot chain is built up, there is implicitly implied a “zero” snapshot, relative to which the changes of all the others are fixed (without this “zero” snapshot, it becomes not clear what kind of “changes” are stored in the first snapshot). Zero snapshot, of course, is not paid (because physically disk space does not take).

When writing to the “leaky” block, this block is copied from the “old” snapshot to the current disk (the part that was recorded is replaced, the rest is taken in the previous copy). After recording in the current disc it becomes one less hole and reading of this place in the future comes from the “current” disc. Disk operations for disks with snapshots cost the same as conventional disk operations (personally, I'm not sure how much snapshots operations are harder for our storage systems, so we decided not to touch this area).

What happens when creating snapshots? (Technical part).

snapshot structure in the selectle cloud

The current disk is declared the so-called 'base copy', that is, read only a copy of the state of the machine. Since the disk could have predecessors in the snapshot chain, base copy refers to other base copy (note, base copy always refers only to base copy). In addition, another “snapshot” is done - it is a read / write copy of the current state (that is, the differences between snapshot and base copy). In general, snapshots can be written, but we prohibit this, since in this case thin provisioning will turn out, and we cannot allow it for reasons of guaranteed reserved space (see section below). But even the "unwritten" snapshot contains 8MB of meta-data. Thus, each snapshot consists of two halves: metadata (8Mb) and the contents of base copy. The disk refers to one type of links to the base copy of the previous snapshot, and the second type of links to the snapshot. When a disk rollback occurs, the snapshot is cloned (not copied - hence the nuances with COW), referring to the same base copy that the snapshot that was cloned referred to.

If someone makes a snapshot two or three times in a row (without changing the data), you will get one base copy and three snapshots with meta-data.

When the snapshot is removed (from the middle), the following happens: the snapshot itself (metadata) is deleted immediately, but the base copy begins to disband - the data is transferred either to the “previous” state, or to the “future”, or is discarded altogether (if there is an alternative both in the past and in the future). This process is the "melting" of snapshots, which does not occur instantaneously. It must be said that the data is not actually copied, but merely “remarked” within LVM (LE is transferred between different LVs) or deleted (if there is another version of the block in the previous copy).

A bit about thin provision


One of the questions we are asked for on the storage system is related to thin provision. What is thin provision? This is when a certain amount of space is declared to the consumer, and the actual space occupied is smaller - and increases as the actual recording goes. This fits perfectly on our model with snapshots, COW from the “empty space”, and XCP implements excellent. In fact, thin provision is an “entry to snapshot”, that is, an entry to “empty space”, which from this begins to take place in reality.

However, thin provisioning is dangerous. The reverse side of thin provision is overselling (aka oversubscription). Roughly speaking, we have 100TB of space. We allowed to create 200 disks of 1 TB on such storage. The actual size of the disks at the beginning is 30-50 gigabytes, so there is plenty of free space. But, suddenly, customers start writing to discs. Disks are already allocated to them. It takes a little time, and ... yes, the average disk fill creeps up to 500GB. And then ... Then someone wants to write down another gigabyte, but gets an error. Because the place is over.

We have no control over the customers' disks, and if we have provided them with the resources, these resources are theirs, and it’s not our business to say “now it is possible, but now not.” If there can be a compromise with respect to other resources (3% of the processor was not given to someone, someone migrated to another host to ensure a performance margin), then there is just a slight “short delivery” that is not noticeable, then with respect to disk space this will not work . They did not allow at least one sector to be written down — an error was fixed throughout the block device.

So due to common sense, we decided not to do so.

Due to the fact that snapshots are made in R / O, and after creation they only decrease, we can refuse to create a new snapshot (anything can happen - the place may suddenly end), but we definitely will not refuse the work of already created disks and snapshots.

Source: https://habr.com/ru/post/138347/


All Articles