Best Practices for Backup Policy

Today I want to touch on some important principles of backup and recovery after a crash. In particular, such issues as:

The relationship of the update procedures of the productive system and its backup process
Backup Recovery Testing
The interaction of the backup process with the elements of the network infrastructure of the productive network
Document disaster recovery procedures

Backup

Back Up BEFORE Upgrading Your System

Before installing application and operating system updates, upgrading to new software versions, upgrading equipment and making other significant changes to the system, it is advisable to back up, because if the update is not successful and the system goes into an incorrect state, the administrator will have roll back the changes and return to a stable state with minimal losses for the business (from the point of view of the RPO ).

Suppose the following scenario:

Planned backup runs from Tuesday to Wednesday
Until Wednesday, the system is used in the “as usual” mode.
On Wednesday evening, the system is updated (a new version of the system software is installed). Since the release notes were not carefully read before installation, the system is in an unstable state.
The system is reinstalled with data loss at the time of its condition on Tuesday
On Thursday, users are forced to re-create all data created for the environment

Although this example may seem somewhat exaggerated, however, Murphy's laws are also fully applicable in information technology. Therefore, you should not neglect the creation of an “extra” incremental backup before updating the system. If a company cannot afford to “waste time” on a “preventative” backup, then you should consider using a failover cluster, automatic hardware snapshots of storage systems , or other ways to ensure the system’s constant availability without compromising the reliability of its data.

Perform a FULL backup immediately after a significant system update.

Example:

Full backup is created on Fridays, incremental backups are made on other days.
On Wednesday, the database server was upgraded to a new version, which was accompanied by updating some common system files related to the operating system.
On Thursday night, as usual, an incremental backup was obtained.
On Thursday morning, there was a critical failure that demanded a system recovery.
After examining the nature of the failure, the administrator realizes that only the system files are damaged and to restore the system functioning it would be enough to restore only a lot of system files using the latest full backup. However, if you only do this, then the updated version of the database server will stop working, as it requires newer versions of the system files (for example, the latest version of the .NET Framework) that were installed with it. This leads to the need to continue the recovery process by looking at the necessary incremental backups that contain changes to the system state. This leads, ultimately, to an increase in the complexity of the recovery process and an increase in the RTO time, up to a violation of the SLA.

The general recommendation can be formulated as follows: if the cumulative volume of incremental backups exceeds the volume of the full backup, it is rational to make an unscheduled full backup.
')

Perform a full backup IMMEDIATELY after system recovery

In fact, this is a variation of the previous paragraph. Restoring a system after a crash is a time-consuming process, often requiring considerable time (due to the need to use incremental backups, various patches that were not included in the backups, etc.). In some cases, users immediately begin to work in the system, as soon as its basic functionality is restored and thereby begin to change its state. In addition, you can not exclude re-failure after a short period of time after the first. Therefore, it is reasonable immediately after the restoration to fix in a full backup a new actual state of the system.

“Do not chase after two hares” or “Modernization is evil”

It is not necessary to combine the process of restoring the system after failures and the process of its modernization. Despite the fact that this may seem obvious, lack of time for technological procedures during the continuous production operation of the system, as well as unwillingness to make (even necessary) changes to “what is already working normally”, may cause the desire to use the moment of forced downtime of the system at the time of failure in order to perform upgrades, so to speak “for one thing”.

Consider a few examples of what can happen:

The system is successfully restored after a failure, but the administrator decides to immediately install the latest updates on the system, so as not to interrupt the work of users for these purposes. However, one of the patches is incorrectly installed, and the system crashes. As a result, the system again has to be restored from scratch.
There was a critical failure of the application server version X.1. The system was not affected by the failure. The administrator decided to install a new version of X.2 from the installation disk and restore the application data and additional program modules that implement the specific business logic from the backup. However, after recovering data and modules, it turned out that they are not compatible with the new version of X.2 due to small changes in the logic of some program functions and in the specifications of some program interfaces. As a result, the application server had to restore the application server version X.1.
A critical failure of the operating system of version X occurred. The user did not find the installation disk of version X (since this version was installed 3 years ago) and installed version X + 1. The administrator had a problem with licenses, in addition, one business application used for work was not compatible with the new version of the operating system. The system had to reinstall.

As a consequence, it is possible to recommend the establishment of a policy on disaster recovery procedures, under which no actions on modernization will be considered acceptable. The purpose of the recovery procedure is solely to restore the health of the original system in the minimum time (RTO). System upgrades should be carried out separately at a specially scheduled time, and be accompanied by pre-testing of the updates planned for installation.

About everything

Read the documentation BEFORE setting up your backup

In the procedure for backup and recovery of applications there may be various nuances that must be taken into account and which are not always obvious. These nuances can usually be found only from the documentation of the application, however, as is often the case, the documentation is not read in a timely manner (or not at all).

An example of the "nuances" of this kind:

The backup process does not save various passwords and license information. This means that during recovery, you will not be able to get a workable application or operating system, and the recovery plan after failures in part of the declared RTO will be frustrated.
The system configuration of the application should be saved by a special procedure, separate from the backup procedure of the data itself.
The procedure for backing up files opened for recording is specially specified in the documentation (it is necessary to enable certain settings in advance in the application configuration).

The recovery process should be done by administrators.

System recovery in a company can generally be performed by network administrators, HelpDesk specialists, and the users themselves. I will not consider the last option as characteristic only for small companies or for the IT industry. As for HelpDesk specialists, they can usually solve common basic level problems, including the recovery of individual files. But, if we talk about restoring the system as a whole, then it is better to entrust such a task to a person with administrative powers:

Recovery may affect the task of enabling the workstation to the domain, and this requires domain administrator authority
Restoring a system from scratch may require solving problems (for example, with drivers for RAID) that are beyond the scope of HelpDesk personnel. In addition, higher qualifications of administrators allow them to correctly prioritize errors generated by the system during recovery, postponing solving problems that do not directly interfere with the task of recovering the productive network as soon as possible.
Other problems may arise that require administrative authority (passwords for network resources, inaccessibility of the network as a whole, activation problems due to corporate firewall settings, etc.)

Recovery Testing should be done regularly.

The availability of backups does not mean that the recovery will take place without problems. Firstly, the data may be incorrectly saved, and secondly, the data may be distorted during storage, or it may suddenly appear during recovery that not all the necessary data was saved in the backup. In more detail about it it was already written in post: Testing of recovery from backup copies .

Consider dependencies on network infrastructure components.

Suppose that a failure occurred on the DNS server. After starting the recovery procedure, it turned out that the backup product used uses DNS as part of working with its own infrastructure (for example, it connects to the backup repository using the FQDN server name). The result is a “circular dependency” that does not allow automatic recovery after a failure. A similar situation can be observed with the domain controller (but here it is saved by the fact that there are usually several domain controllers in the company).

Such circular dependencies should be avoided. Testing to restore backups to a sandbox isolated from the productive network, for example, using Veeam SureBackup technology, can reveal them .

Documenting recovery stage procedures

Carefully document the disaster recovery procedure.

Two things need to be avoided:

You do not need to rely entirely on the backup product (in the sense that it “if anything - restores everything itself - installed and forgotten”). There may always be such system configuration parameters (for example, the operating system activation status, license information, or saved accounts with network resource passwords) that will not be saved in the backup, or will be an application (as part of copy protection) will be recognized invalid upon restoration (for example, activation status). In practice, this means that you always need to have on hand a document on the final recovery of applications in case the backup product cannot restore everything fully automatically.
No need to rely on the assumption that any backup product will be fully compatible with any infrastructure in the world. It is necessary to verify that the product is compatible with the existing hardware and operates normally in the conditions of operation of specific modes (for example, when performing LUN rebinding on a SAN or when rebuilding a volume on a RAID-5 array). Special cases of product behavior and compatibility methods must be documented.

Examples of recommended documentation for disaster recovery are:

Regular automated dumping of configuration settings performed, for example, with the help of software products that allow version control of settings and / or control of their integrity. Versioning of application configurations is, in principle, a useful addition to the backup system, since in some cases it reduces the RTO. Unlike backup products, configuration version control products allow you not only to save any change in settings, but (1) to coordinate such changes (2) to mark a certain stable state of the configuration with special tags.
License keys that may be required to re-enter after the application is restored
Passwords that will be needed in the event of a system restore should be recorded on a recovery sheet in paper or electronic form and stored in a restricted access location, in accordance with security policies. This is necessary, at least, for the case when the system is restored by another administrator (for example, if the first administrator is on vacation).
The above information for recovery should, in turn, be backed up, including, if necessary, backed up to another office. This information can be as securely protected as possible from viruses and spyware at all points of storage.

What to test as part of testing a disaster recovery procedure?

In order for the instructions used in the recovery process after failures to be as correct as possible, it is necessary to conduct a “training recovery”. In the process of testing the recovery process, you need to pay attention to at least such issues as:

In case of complete destruction of the site, will the backup copies contain all the information necessary to restore it?
How will the recovery process go if the chief system administrator is not available (say, will be on vacation)?
What happens if the storage medium (tape / disk) containing the backup copy that is most suitable for use in the recovery process is damaged?
What will happen in the case of so-called. "Secondary failure"? That is, in the case when after successful completion of the recovery process, it turns out that the restored system does not work?
Does the time spent on the recovery process fit into the agreed SLA? Do the people involved in the process know about the existence of the SLA and its parameters, and do they take it into account in their work?
How will the failure of the infrastructure components of the production network (mail server, instant messaging server, DNS, domain controller, etc.) affect the recovery process?
Document quality: can an untrained administrator restore a productive system by following the written instructions?
The reaction rate of the company, if the cause of the destruction of information has become a virus (of any type). The specificity of this threat is that the virus can distort data and applications without disrupting their accessibility — as a result, the fault tolerance systems will not fix the failure, moreover, the configured data replication replication mechanisms will automatically spread these distortions across the rest of the replication partners (or cluster nodes / geocluster).
Does the recovery process depend on the specific hardware (which may fail at the time of recovery)?
Is it possible to carry out the recovery process completely remotely?
What happens if the channel linking the backup sites of the company is broken?

Of course, when compiling a list of threats to the performance of a productive system, one should also take into account the likelihood of these threats, and compare them with the potential damage from system downtime in each case.

General conclusion

Planning and testing backup and recovery is a critical factor in minimizing RTOs and meeting SLA conditions. The availability of functionality in the backup product in terms of automating the backup recovery is extremely important.

You can read more about testing backups and the proprietary technology Veeam SureBackup here:

Description on the site Veeam: Technology Veeam vPower for VMware
Webinar: Modern Data Protection for Virtualized Environment
Post " SureBackup - automatic check of the possibility of data recovery from backup "
Post " Testing data recovery from backup "

Source: https://habr.com/ru/post/176927/

All Articles