
A great many
articles were written about snapshots of virtual machines, which
overly described the theoretical part of this action. In my article, I will focus on the practical side of the issue and exclusively on the VMware vSphere platform.
So why do we need
“quiesced” * snapshots, what are they eating with, and what typical problems arise with them? A look at snapshots will be presented first of all from the point of view of backup, but I will try in some way to reveal other aspects of use.
* If someone is ready to suggest a suitable Russian-language term - you are welcome in the comments, it will be a good option - I will replace Anglicism in the text.')
Using snapshots for backup
In the VMware vSphere environment, the snapshot creation process is controlled by two options:
- Snapshot including virtual machine memory status
- Snapshot preceded by the so-called quiescing guest file system
In the case of virtual machine backup using
VMware vStorage API for Data Protection, the first option is simply not used, and the main reason for this behavior is this: if a virtual machine has a large amount of RAM (and 8-16 GB RAM RAM is not uncommon for a long time), then For this option, the creation time and the size of the incremental backup will be significant (each incremental backup will additionally include the size of RAM). In addition, there are a number of technical difficulties, but today they are of little interest to us, because We are considering an alternative scenario.
Actually, the alternative option is our second option - quiescing. It is of much greater interest and the essence of it is to prepare the guest operating system (the file system in the first place) to remove the backup.
What is quiescing?
If we translate the official
article , we’ll get something like this:
"This is the process of bringing data on a virtual disk to the" suitable "state for backup. This process may include flushing dirty buffers from the operating system memory to disk or other application-specific high-level operations. ”
From this description of what happens with a virtual machine, in fact, it has not become clearer. Let's figure it out for yourself.
First, VMware Tools through its VMware Snapshot Provider service initiate the creation of VSS snapshot inside the guest OS. Then, all registered VSS writers (you can see them with the "
vssadmin list writers " command) in the guest OS receive a request and prepare the corresponding backup applications (all transactions are recorded from memory to disk). When all VSS writers are done, they report on this VMware Tools service (again, via the VMware Snapshot Provider service), which in turn tells VMware that snapshots can be taken.
Thus, all backup applications for VMware vSphere use the following combinations when giving the command to create a VMware snapshot (note that the process of directly creating snapshots is entirely controlled by VMware itself):
Quiesced = ON, Memory = OFFQuiesced = OFF, Memory = OFFWe will not consider the second combination in this article and will focus on the quiescing process.
Why do you need quiescing?
The most obvious example is the
USN rollback problem when restoring a domain controller from a backup. It occurs if the virtualized domain controller was backed up without using VSS (that is, without the quiescing option or other means that ensure that transactions are written to disk).
No additional actions and dances with a tambourine will be required if you restore a backup made with the quiescing option. The InvocationID will be correctly reset and you will see the following entry in the Event Log on the controller loaded after the restoration:
Event ID 1109: Active Directory has been configured to host an application partition. The invocationID attribute for this domain controller has been changed.
Similar correct behavior can be observed when using
Acronis vmProtect 9 . Actually, we specifically tested it as part of the backup and recovery of virtual machines with a domain controller inside.
USN rollback is obviously not the only possible problem when using raw snapshots and other applications (for example Exchange / SQL - explicitly supporting VSS applications) can be subject to failures when recovering from such snapshots.
How to check that snapshot is created correctly using VSS?
There are several ways to determine the correctness of creating a consistent (to the application level) snapshot:
The easiest way: enter the guest operating system and check "Event Viewer" (it was necessary to translate so poor Event Viewer). After creating a snapshot with the options quiesced = ON, snapshot memory = OFF (see the screenshot at the beginning of the article) events from the corresponding VSS writers should be present in the application logs:

Note: Error from VSS with Event ID 12289, which can be seen in the screenshot, is not really a problem . It refers to a 3.5 ”disk and, in order to get rid of it, it is enough to remove the floppy from the virtual machine configuration:
The method is more complicated: use the Datastore Browser component from the vSphere client: a *** vss_manifests * .zip file should appear in the virtual machine folder on the datastore after creating a quiesced snapshot.
The file contains backup.xml with a description of all the VSS writers found in the guest system + metadata for each reader in writerX.xml.

IMPORTANT: if vss_manifests.zip contains only backup.xml - this usually means that the snapshot was actually made without using VSS. Thus, we smoothly approach the most interesting: the study of problems with snapshots. Below I will list the main causes of broken snapshots. It is worth noting that the main danger is not idle snapshots (they are easy to detect), namely those that VMware reports as successful, while these snapshots are not.
Environmental requirements
If the utility of the quiescing option is more or less clear, then in practical use there are often problems, usually associated with the incorrectness of the initial configuration of the environment. The official description of the part of the requirements
is here , and I will try to reveal them more clearly so that it is clear where to look when you encounter problems in practice:
First , make sure that your vSphere + guest OS combination is supported for snapshot ing at application level consistency on this tablet (taken
from here ).

Data relevant to vSphere 5.0 and higher. As you can see, for the most popular OS at the moment (Windows 2008 and above) there are asterisks - the main dog is buried in them, the excavation of which we will now do.
Secondly , in order for quiescing to really work, you need to make sure that the VSS components of VMware Tools are actually installed (and of course VMware Tools should be the most current version).

On older versions of vSphere (3.5 and earlier), quiescing was used, including the Legato Sync Driver, which guaranteed consistency at the file system level, but not at the application level (which is what VSS components are for). Currently, this driver is practically not used and universally replaced by VMware Snapshot Provider. The correctness of the installation can be checked in the guest operating system (in the virtual machine) by the presence of the VMware Snapshot Provider + service of the corresponding COM + component.

What can be shoals at this stage?
If the VMware Snapshot Provider service is disabled or not installed at all, then VMware, when taking snapshots with the options quiescing = ON, snapshot memory = OFF, reports that it is successful, but in fact snapshots will be made without using VSS inside the system, that is, using Legato Sync drivers.

Note that in the case of Windows 2008 and above, the behavior is different - there are no similar events in the log, but the Volume Shadow Copy service just goes into a running, and then into a stopped state.
Thirdly , one of the typical problems of setting up quiescing is the disk.EnableUUID = true parameter in the .vmx configuration of the virtual machine.
This setting only makes sense for guests running Windows 2008 and higher (for Windows 2003, the setting is ignored). An additional feature is the fact that this parameter is automatically entered when creating a new virtual machine only starting with vSphere 4.1. In other words, if the virtual machine was migrated from an older version of vSphere, then the settings may not be.

If there is no parameter, or if it is set to “false”, the behavior during snapshot creation will be similar to the previous case: snapshot will be created successfully, but in fact VSS will not be used and as a result we can get a non-consistent backup. The second symptom of the disabled parameter is an empty backup.xml (without a description of VSS writers) in vss_manifests.zip.
Fourth , check for the presence of dynamic disks inside the guest machine. If there is at least one dynamic disk inside the guest system - no matter whether it is system or not, then VSS will not be involved. The snapshot will be created successfully, but vss_manifests.zip will be empty, just like the event logs inside the guest OS. This rule applies to guest OS Windows 2008 and above.
The same applies to IDE disks - they should not be in the configuration of the virtual machine (but the presence of IDE CD-ROM devices is permissible and does not affect snapshots). It should be borne in mind that the number of free SCSI slots on a single SCSI controller must be equal to the number of disks. For example: if 8 SCSI disks are already present on SCSI1, then there will not be enough slots.
Fifth : Non-working VSS inside the guest machine. This is the main point causing tons of resentment and calls to VMware tech support. Often, people who see unsuccessful snapshots sin on VMware, although the blame is a completely different giant of thought - Microsoft. I got this picture when trying to create a quiesced machine snapshot after unsuccessful installation of a new SQL database (virtual .iso drive was unmounted during installation, which the installer didn’t like.: - \

This problem was solved by simply rebooting the virtual machine, and although this method helps very often, there are cases that are running when the VSS inside is slightly less than completely broken. In these cases, the easiest way to find out if Microsoft is really to blame is to run Windows Backup and make a backup of the system state (Backup of System State, if someone is used to English terms). Windows Backup (or NTBackup) works - then the problem is on the VMware side, does not work - the Microsoft school.
VMware has several official articles on this topic: for example,
here and
here . But there is an interesting feature - to simplify your life (maybe there are some other reasons) in the second article, VMware explicitly recommends setting disk.EnableUUID to “false”, which means not using VSS when creating quiesced snapshots (“quiesced- it is not real! ”). In general, such a method is not a solution, but only a temporary workaround, since the consequences of such an approach can manifest themselves during restoration, that is, when application consistency is key (recall at least the same USN rollback).
Summing up
In my experience, the most frequent problems when creating snapshots (their inconsistency) are points 2, 3 and 5, while IDE or dynamic disks are much less common.
Of course, quite mystical cases are not excluded: for example, snapshot was not created (VMware reported a vague error) due to the fact that the iSCSI LUN (datastor) on which the problematic virtual machine was located was physically connected via 2 network cards in teaming mode and This one worked on 100MBit, and the second on 1Gbit.
The topic of quiesced snapshots can be dug out almost forever - which is at least the fact that Windows 2008, when creating quiesced snapshots, creates not one but two deltas on a datastor and in fact writes to an already created snapshot (this, by the way, is one of the root causes asterisks opposite the OS data in the table above); or having the ability to
disable certain VSS writers through the vmbackup.conf configuration on the guest system. The world is wonderful and amazing, but there is enough rake for everyone. If there is a desire, I will gladly write something else on this topic. As usual, comments are welcome, clarifications - also, about errors and misprints - in a personal, I will try to answer questions with asap.
Do not forget to subscribe to our Hub, we have planned a huge number of articles on the topic of backup and data recovery, perhaps our articles will help you to solve certain problems (or better to avoid them). Thanks for attention. :)