To prevent the carriage from turning into a pumpkin, or why we need test recoveries from backups

In this post, promised to elaborate on the history of testing backups. Today is about it. To do without unpleasant surprises and in the already exciting moments of data loss, backup copies need to be tested. Further, we will discuss not the integrity check of backup files (checking the checksum of data blocks in the backup file), but the full test recovery, when we check the performance of what we recovered.

What could be wrong with the backup file

Apart from cases when the backup file itself is damaged, there are many technical and organizational reasons why recovering from a backup copy may end up with an error. I will dwell on those that I encountered.
')
Already damaged data / files are backed up. The server began to crumble disk. Monitoring did not work. Some of the files got corrupted, but they were safely backed up. Such a problem can go unnoticed for weeks until you need to open the file you need. During the recovery process, it will turn out that the files in the backup are also non-working.

Non-consistent backup . This can happen when you choose the wrong backup tool. For example, a database is running on a virtual machine, for which the administrator has decided to use VM backup backup without application integrity support (application aware backup).

The fact is that during its work, the database actively uses the cache in RAM, and some of the data is there. The DBMS writes data to the disk so that they are consistent at any time, and when the server is suddenly turned off, the database does not become a useless set of bytes. The backup system does not write data instantly, and knows nothing about synchronizing the cache with the file system, so when backing up some of the data may be written in the wrong order. Then, after the VM is restored, we will get a damaged base, parts of which do not match.

When using special backup agents this will not happen.

There is a working backup, but it's not there. This is quite common, because the life cycle of the system is about the following: they made the system and put it on backup. Then, sooner or later, the system architecture was changed, the servers were added / subtracted, the disks were renamed, they were restored next to them from the backup, and the changes in the backup policy were forgotten. So it turns out that the backup is not what you need.

Why test

It would seem that the answer is simple: to make sure that you can recover from the backup. But there are a couple of important organizational issues that would be nice to clarify for yourself.

Understanding the real RTO. The speculative evaluation will be different from reality. Especially if the entire recovery process is not limited to the deployment of data or applications from the backup. Before you recover, you need to understand what and where we are recovering from. After recovery, the system is not always immediately ready for use, sometimes manual settings are required. After you need to check the performance of the restored systems. If backups are stored on tapes outside the office, then you need to understand how quickly they are taken to your office. All this increases the recovery time or hours to the recovery time.

So, if we look at the whole door-to-door recovery path, then most likely the RTO will be more than just the “pure” data recovery rate.

Who does what. During test restorations, not only the equipment is tested, but also the work of people, processes, and regulations, if there are any;) This is an opportunity to identify weak points, think about what you will do if the right person is not in place.

The more people involved in the restoration, the more necessary such military exercises.

How to test

The frequency of testing. After you have set up the backup system, check at least once that you have backed up there, and try to restore it.

Further, the schedule of checks is determined by the owner of the service, for example, the developer, based on how often changes are made to applications / data, the importance of certain data, and what resources it has for testing.

Various scenarios of accidents and recovery from backups. Turn on your imagination and think about various reasons why you might need to restore from a backup. So you check the equipment, processes, people in combat conditions, and not spend spherical recovery in a vacuum. It’s convenient to map out threat models. As an option:

hardware failure: disk failed, server with source information;
software failure: unsuccessful update, virus;
human factor: the administrator has deleted the desired file.

In each of these cases, you will need to recover in a different volume: somewhere separate files, and somewhere to deploy everything.

Be sure to try to recover remotely from your home computer. After all, failures do not occur only during working hours.

And think over the steps a couple of steps ahead: what will you do next if during the recovery the backup turned out to be zilch or it was not possible to recover. If during the tests it turned out that the last backup was inoperative, if possible, make a new backup out of turn or warn colleagues to work with data as carefully as possible until the next backup cycle.

Recover from different points in time. It is not known what kind of backups you will need, so when testing, try to recover from different recovery points. So you check that you are in order, for example, not only Friday backup, but the one that you do on Wednesday. The larger the sample will be, the less reason to worry about backup performance.

Document recovery procedures. Once I read that in one office they use the following approach to testing recovery from a backup: a person who does not know anything about the system is offered to do the entire recovery only through documentation, none of his colleagues suggest to him. Then, by the results, they check whether they were able to recover, and draw conclusions about the relevance of the instructions. It is not necessary to go to such extremes during combat exercises, but it would be good to fix in the regulations and other documentation all the necessary actions to restore this or that system.

This is done for this to be able to start the recovery process if the person responsible for the system is temporarily unavailable.

You also need to take care that all the necessary information for restoring the system (configuration settings, license keys, passwords) is not only in the head of the absent administrator, but duplicated in electronic form and is stored securely away from prying eyes.

Just in case: we are testing the recovery in a separate sandbox without risking production.

Source: https://habr.com/ru/post/329046/

All Articles

To prevent the carriage from turning into a pumpkin, or why we need test recoveries from backups

What could be wrong with the backup file

Why test

How to test

More articles: