Virtual machine backup and freeze / thaw scripts InterSystems Caché

In this article, I will look at Caché backup strategies using external backup systems and give examples of integration with solutions based on virtual machine snapshots (VM snapshot, snapshot). Most of the solutions that I come across today are deployed based on Linux and VMware, so I will give examples of solutions using VMware snapshots.

A list of my articles from the series 'InterSystems Data Platforms and Performance' is here .

For a better understanding of this article, you should also read the Caché online backup and restore guide in the Caché online documentation.

Caché backup: batteries included?

The built-in hot backup (Caché online backup) comes out of the box with Caché and is designed to back up Caché databases without stopping the system. However, there are more efficient backup solutions that you should be aware of when you are planning to scale a large system. External Backup (External Backup) with the use of snapshot technologies is a solution I recommend for backing up systems, including those using Caché databases.

What should be considered when external backup?

In InterSystems online documentation on external backup , you can find all the details of interest. We only note the key point:

“To ensure the integrity of the snapshot of the file system, Caché provides opportunities for freezing (freeze) records to the database at the time of the snapshot. Only attempts to physically write to the database files are frozen, which allows user processes to continue to perform database updates in memory without fail.

It is also important to note that part of the process of creating a snapshot in virtualized systems causes a small pause in the operation of the virtual machine, which is commonly called fading (stun). Usually fading lasts less than a second, so users do not notice it and it does not affect the operation of the system, however in some cases fading can last longer. If the fade lasts longer than the QoS (Quality of service, quality of service) timeout of the Caché database mirroring, the backup mirror node will decide that a failure has occurred in the main system and will switch the mirror. Later in this article I will tell you how to measure the fading time in case you need to make changes to the QoS timeout setting for mirroring.

Backup options

Minimalistic backup solution - integrated backup (Caché Online Backup)

If you are not able to use other tools, this good old way that comes with InterSystems platforms remains. Note that Caché online backup backs up only Caché database files, storing all non-empty blocks in databases, writing them sequentially to a file. Caché Online Backup supports cumulative and incremental backup.

In the context of VMware, Caché Online Backup runs on a guest machine. Like other similar solutions, the operations performed by Caché Online Backup are the same, regardless of whether the application is virtualized or runs directly on a physical server. Accordingly, copies received by Caché Online Backup must be transferred to backup media along with all other files used by your application. When backing up the system, you should remember about the application directory, the main and alternative directories of the database log, and any other directories containing the files used by the application.

Caché Online Backup should be considered either as an entry-level approach for small projects wishing to implement an inexpensive solution for “hot” backup of databases, or for making one-time backups. For example, making such copies is very useful when initially setting up mirroring. However, since databases grow in size and because Caché databases are usually only part of the client dataset, external backups combined with snapshot technology when using third-party utilities are recommended as the best solution with advantages such as the ability to include files in the backup other than database files, reduced recovery time, the ability to control data across the organization and the availability of improved tools for the directory tion and management.

VMware snapshot

VMware vSphere Data Protection (VDP) and other third-party backup solutions for virtual machines, such as Veeam or Commvault, use the snapshot functions of VMware virtual machines to create backup copies. Below is a brief explanation of how VMware snapshots work. For more information, see the documentation.

It is important to remember that snapshots are made from the entire virtual machine and that the guest operating system and any applications or DBMS are not aware that they are taking a snapshot now. Remember also the following:

VMware images themselves are not backups!

Snapshots allow you to use backup software and make backups, but they are not backup copies by themselves.

VDP and other third-party solutions use the process of creating VMware snapshots in conjunction with any application to manage the creation and, very importantly, the removal of snapshots. Briefly, the sequence of events for creating an external backup using VMware snapshots is as follows:

Third-party backup software requests an ESXi host to take a VMware health snapshot.
The virtual machine .vmdk files are transferred to read-only mode and a child .vmdk delta file is created for each .vmdk file of each virtual machine.
Any write to disk occurs in the delta file of the virtual machine. Any read operations are performed first from the delta file.
Backup software backs up read-only parent .vmdk files
When the backup is complete, the snapshot merges with the original file (the virtual machine disks become writable and updated blocks from the delta files are appended to the parent files).
VMware snapshots are deleted.

Backup solutions also contain special features, such as Changed Block Tracking (CBT), to perform incremental or cumulative backups as quickly and efficiently as possible (which is especially important to save space). Similar solutions usually also add other useful and important functions, such as data compression, organizing work on a schedule, restoring virtual machines with a different IP address to verify integrity, restoring both the entire virtual machine and individual files from it, managing the backup directory etc.

VMware snapshots that are not properly managed or left to hang for a long time can greatly reduce the free space in the storage (as changes accumulate, the delta files become more and more), and also slow down the virtual machine.

You should think very carefully before manually taking snapshots on the main database server. Why are you doing this? What happens if you return to the past by the time you took the picture? What happens to all transactions between snapshot creation and rollback?

If your backup software creates and deletes snapshots, this is absolutely normal. The snapshot should only be created for a short time, and the key part of your backup strategy will be choosing the copy time when the system is at minimum load, which will further reduce the impact on users and overall performance.

Caché features for system snapshots

Before taking a snapshot, the database must be quiesced: all records in the files must be completed and the database files must be in the correct state. Caché provides methods and APIs for completing and then freezing (freeze) writing to the database for a short snapshot. Only attempts to physically write to the database files are frozen during snapshot creation, which allows user processes to continue to perform updates in memory without fail. After the snapshot was taken, the ability to write to the database is restored, the database “thaws” (thaw), and the backup continues to be copied to the backup media. The time between freezing and thawing should be short (no more than a few seconds).

In addition to pausing recording, freezing Caché also changes log files and places a backup token in a log. Writing to the log file at the same time continues normally, while writing to the physical database is frozen. If the system crashes while the records in the physical database are frozen, the data will be restored from the log as usual at startup.

The following diagram shows freezing and thawing with taking snapshots to create a backup with the correct database file.

Note the short time between freezing and thawing — this is only the time to take a snapshot, not the time it takes to copy the entire parent object to the backup.

Freezing and thawing Caché

vSphere allows you to automatically invoke scripts before and after creating a snapshot: these are the very moments that are called freezing and thawing of Caché. Note: for proper operation of this ESXi functionality, the host requests freezing of disks through the VMware Tools from the guest operating system.

VMware Tools must be installed in the guest operating system. Scripts must adhere to strict name and location requirements. You must also assign the correct file permissions. Script names for VMware on Linux:

# /usr/sbin/pre-freeze-script # /usr/sbin/post-thaw-script

The following are examples of freeze and thaw scripts that our team uses to back up with Veeam in our internal test labs. These scripts should also be suitable for working with other products. The examples were tested and used on vSphere 6 and Red Hat 7.

Although these scripts can be used as examples and are an illustration of the method being described, you must ensure that they are correct for your own environment!

Sample freeze script:

 #!/bin/sh # # Script called by VMWare immediately prior to snapshot for backup. # Tested on Red Hat 7.2 # LOGDIR=/var/log SNAPLOG=$LOGDIR/snapshot.log echo >> $SNAPLOG echo "`date`: Pre freeze script started" >> $SNAPLOG exit_code=0 #     for INST in `ccontrol qall 2>/dev/null | tail -n +3 | grep '^up' | cut -c5- | awk '{print $1}'`; do echo "`date`: Attempting to freeze $INST" >> $SNAPLOG # Detailed instances specific log LOGFILE=$LOGDIR/$INST-pre_post.log # Freeze csession $INST -U '%SYS' "##Class(Backup.General).ExternalFreeze(\"$LOGFILE\",,,,,,1800)" >> $SNAPLOG $ status=$? case $status in 5) echo "`date`: $INST IS FROZEN" >> $SNAPLOG ;; 3) echo "`date`: $INST FREEZE FAILED" >> $SNAPLOG logger -p user.err "freeze of $INST failed" exit_code=1 ;; *) echo "`date`: ERROR: Unknown status code: $status" >> $SNAPLOG logger -p user.err "ERROR when freezing $INST" exit_code=1 ;; esac echo "`date`: Completed freeze of $INST" >> $SNAPLOG done echo "`date`: Pre freeze script finished" >> $SNAPLOG exit $exit_code

Defrost script example:

 #!/bin/sh # # Script called by VMWare immediately after backup snapshot has been created # Tested on Red Hat 7.2 # LOGDIR=/var/log SNAPLOG=$LOGDIR/snapshot.log echo >> $SNAPLOG echo "`date`: Post thaw script started" >> $SNAPLOG exit_code=0 if [ -d "$LOGDIR" ]; then #     for INST in `ccontrol qall 2>/dev/null | tail -n +3 | grep '^up' | cut -c5- | awk '{print $1}'`; do echo "`date`: Attempting to thaw $INST" >> $SNAPLOG # Detailed instances specific log LOGFILE=$LOGDIR/$INST-pre_post.log #  csession $INST -U%SYS "##Class(Backup.General).ExternalThaw(\"$LOGFILE\")" >> $SNAPLOG 2>&1 status=$? case $status in 5) echo "`date`: $INST IS THAWED" >> $SNAPLOG csession $INST -U%SYS "##Class(Backup.General).ExternalSetHistory(\"$LOGFILE\")" >> $SNAPLOG$ ;; 3) echo "`date`: $INST THAW FAILED" >> $SNAPLOG logger -p user.err "thaw of $INST failed" exit_code=1 ;; *) echo "`date`: ERROR: Unknown status code: $status" >> $SNAPLOG logger -p user.err "ERROR when thawing $INST" exit_code=1 ;; esac echo "`date`: Completed thaw of $INST" >> $SNAPLOG done fi echo "`date`: Post thaw script finished" >> $SNAPLOG exit $exit_code

Do not forget to set file permissions:

 # sudo chown root.root /usr/sbin/pre-freeze-script /usr/sbin/post-thaw-script # sudo chmod 0700 /usr/sbin/pre-freeze-script /usr/sbin/post-thaw-script

Freeze and thaw testing

To test the operation of the above scripts, you can manually start the snapshot execution on the virtual machine and check what the script displays. The following screenshot shows the Take VM Snapshot dialog and its options.

Uncheck the box "Snapshot the virtual machine's memory" (Save the virtual memory of the virtual machine)
Tick the checkbox "Quiesce guest file system (Needs VMware Tools installed)" (Stabilize guest file system). This will pause the running processes in the guest operating system and reset the buffers so that the contents of the file system are in a known consistent state when the snapshot is taken.

Important! After the test, do not forget to delete the taken picture!

If the quiescing checkbox is checked and the virtual machine is running while the snapshot is being taken, VMware Tools will be used to stabilize the virtual machine file system. File system stabilization is the process of bringing data on a disk to a “ready for backup” state. This process may include operations such as clearing the full buffers between the operating system cache in memory and the disk.

The following output shows the contents of the $SNAPLOG log file listed in the above examples of freeze / thaw scripts after starting the backup procedure, which also takes a snapshot.

 Wed Jan 4 16:30:35 EST 2017: Pre freeze script started Wed Jan 4 16:30:35 EST 2017: Attempting to freeze H20152 Wed Jan 4 16:30:36 EST 2017: H20152 IS FROZEN Wed Jan 4 16:30:36 EST 2017: Completed freeze of H20152 Wed Jan 4 16:30:36 EST 2017: Pre freeze script finished Wed Jan 4 16:30:41 EST 2017: Post thaw script started Wed Jan 4 16:30:41 EST 2017: Attempting to thaw H20152 Wed Jan 4 16:30:42 EST 2017: H20152 IS THAWED Wed Jan 4 16:30:42 EST 2017: Completed thaw of H20152 Wed Jan 4 16:30:42 EST 2017: Post thaw script finished

This example shows that the time between freezing and thawing is 6 seconds (16:30:36 - 16:30:42). During this period, users are not interrupted. You will need to collect statistics from your own systems, but for information, we note that this example was launched during application performance testing on a system without “bottlenecks” in the I / O system, which on average performed over 2 million database reads per second ( Glorefs / sec), 170,000 database writes per second (Gloupds / sec) and an average of 1100 physical disk read operations per second and 3000 records per cycle of the database write daemon (write daemon cycle).

Remember that RAM is not part of the snapshot, so when restoring a backup, the virtual machine will reboot and perform recovery procedures. Database files will be consistent. You do not intend to “continue work” from a backup copy and just want you to have the correct backup copies of files at a specific point in time. You can then run the database logs and perform other recovery procedures necessary to restore the integrity of the application and transaction consistency after the files are restored.

For additional data protection, a log change can also be performed by itself, followed by a backup or log replication, for example, hourly.
The following is the contents of the $LOGFILE from the freeze / thaw example above, which shows the log details for the snapshot.

 01/04/2017 16:30:35: Backup.General.ExternalFreeze: Suspending system Journal file switched to: /trak/jnl/jrnpri/h20152/H20152_20170104.011 01/04/2017 16:30:35: Backup.General.ExternalFreeze: Start a journal restore for this backup with journal file: /trak/jnl/jrnpri/h20152/H20152_20170104.011 Journal marker set at offset 197192 of /trak/jnl/jrnpri/h20152/H20152_20170104.011 01/04/2017 16:30:36: Backup.General.ExternalFreeze: System suspended 01/04/2017 16:30:41: Backup.General.ExternalThaw: Resuming system 01/04/2017 16:30:42: Backup.General.ExternalThaw: System resumed

Virtual machine freeze

During the creation of a snapshot of the virtual machine, as well as after completing the backup and deleting the snapshot, the virtual machine must be frozen for a short period of time. This short-term freeze is often referred to as stun. A good article on virtual machine fading is here . I will set out some of the details below for the Caché databases.

Excerpt from the article: “To take a snapshot of a virtual machine, the virtual machine“ freezes ”to (i) serialize the state of the device to disk and (ii) close the current working disk and create the starting point of the snapshot ... When the delta files merge, the virtual machine freezes to close the discs for recording and put them into a state suitable for merging. ”

The fading time is usually about 100 milliseconds, however, with a very high write activity on the disk, during the merge phase of the delta files, the fading can last up to several seconds.

If the virtual machine is a primary or backup member of the Caché mirroring, and the fading time is longer than the QoS timeout for mirroring, the mirror may mistakenly report the failure of the primary virtual machine and initiate the interception of the mirror by the backup system.

For more information about the QoS parameter when mirroring, refer to the documentation . Strategies that minimize fading time include choosing when to back up when database activity is as low as possible, as well as having a well-tuned storage system.

As noted above, when creating a snapshot there are several options that you can specify. One of the options allows you to enable the preservation of RAM in the snapshot. Remember that retaining RAM is not required to back up the Caché database. If the save memory flag is set, the internal state of the virtual machine will be dumped into the snapshot. Taking a snapshot with memory takes much longer. Snapshots are used to return to the state of the virtual machine that was at the time of the snapshot. This is NOT required to back up the database files.

When a memory snapshot is taken, the state of the entire virtual machine will be frozen indefinitely .

As noted earlier, for backups, the “consistency” (quiesce) check box should be checked to ensure a consistent and successful backup.

We learn time of fading from VMware logs

Starting with ESXi 5.0, the fade time is recorded in the log file of each virtual machine (vmware.log) with messages similar to the following:

 2017-01-04T22:15:58.846Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 38123 us

The fading time is indicated in microseconds, so in the example above 38123 us it is 38123 / 1,000,000 seconds or 0.038 seconds.

To be sure that the machine’s freezing time is within acceptable limits, or if there is a suspicion that a long time that the machine is vmware.log causes problems, you can download and view the vmware.log files from the folder of this virtual machine. After downloading, you can open and organize the log using standard Linux commands, which we will look at in the next chapter.

Example of downloading vmware.log files

There are several ways to download logs, including by creating a VMware support bundle via the vSphere management console or via the ESXi host command line. Refer to the VMware documentation for all the details, and the following is a simple way to create and collect a minimum package of support logs, which includes the vmware.log file, which allows you to find out the duration of a fade.

You will need the long name of the directory where the virtual machine files are located. Log in to ssh on the ESXi host where the virtual machine with the database is running and run the vim-cmd vmsvc/getallvms to get a list of vmx files and their unique long names associated with them.

An example of a long name for a virtual machine database mentioned in this article will look like this:

 26 vsan-tc2016-db1 [vsanDatastore] e2fe4e58-dbd1-5e79-e3e2-246e9613a6f0/vsan-tc2016-db1.vmx rhel7_64Guest vmx-11

Next, run the command to collect log files:

 vm-support -a VirtualMachines:logs

The command will display the location of the created support package, for example:

 To see the files collected, check '/vmfs/volumes/datastore1 (3)/esx-esxvsan4.iscinternal.com-2016-12-30--07.19-9235879.tgz'

Now you can pick up files from the host for further processing and analysis using the sftp protocol.
In this example, after unpacking the support package, you can follow the paths corresponding to the long database names of the virtual machines. For example, in this case:

 <bundle name>/vmfs/volumes/<host long name>/e2fe4e58-dbd1-5e79-e3e2-246e9613a6f0.

There you will see several numbered log files. The latest file has no number, it is vmware.log. The magazine can be no more than 100 KB, but it will contain a lot of information. Since we are simply looking for moments of beginning and end of fading, they are fairly easy to find using the grep utility, for example:

 $ grep Unstun vmware.log 2017-01-04T21:30:19.662Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1091706 us --- 2017-01-04T22:15:58.846Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 38123 us 2017-01-04T22:15:59.573Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 298346 us 2017-01-04T22:16:03.672Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 301099 us 2017-01-04T22:16:06.471Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 341616 us 2017-01-04T22:16:24.813Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 264392 us 2017-01-04T22:16:30.921Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 221633 us

In the example, we see two groups of fading. The first one consists of the moment of creating snapshots, and the second - after 45 minutes for each disk when the image is merged (for example, after the backup software has finished copying the main vmx file). In the example above, we can see that most of the fading does not exceed a second, although the initial fading is just over one second.

Short fading is not noticeable to the end user. However, system processes, such as, for example, Caché mirroring, constantly monitor whether the base is “live”. If the fading time exceeds the QoS timeout for mirroring, the node may be considered non-contact and “dead”, and an emergency will be handled.

Tip: for a review of all logs or troubleshooting, it’s convenient to use the grep command to find all the fading times and then format them using the awk utility and sort them like in the following example:

 grep Unstun vmware* | awk '{ printf ("%'"'"'d", $8)} {print " ---" $0}' | sort -nr

Total

You should regularly monitor your system during normal operation in order to know and understand the amount of fading time and how it can affect the means of ensuring high availability, such as mirroring. As noted earlier, strategies aimed at minimizing fading time include running backups, when database and storage activity is low and when storage performance is maximum. For continuous monitoring logs can be processed using VMware Log insight or other tools.

I will return to backup operations for InterSystems data platforms in future articles. And now if you have comments or suggestions based on the processes occurring in your systems, share them in the comments.

Translator’s note: since we are working with the author in the same office, I can give him your questions and send his answers here. Also, the discussion in English is in the original article on the InterSystems Developer Community.

Source: https://habr.com/ru/post/334144/

All Articles