The case of Pixar or again about the importance of testing backups

1998, Pixar Studio. The creation of Toy Story 2 is in full swing. In the process involved more than 150 people. The size of the source materials of the animation is 10 GB (for those times it is a lot). Every day a full backup is built to tape. The cassette has a size of ... 4GB (data on tape is compressed, but, of course, not to that extent). Each time an error is issued, but no one notices this, because the log file is located on the same tape and is written at the very end of the backup task, and since there is no longer any space on the tape, it is 0 bytes in size. Every week a test data recovery is carried out, during which the first 2000 frames of animation are checked. And, of course, every time the test passes successfully.

... And then suddenly the day came when one of the employees (mistakenly or intentionally) launched the command "/ bin / rm -r -f *" (or similar) on the server, which deleted 90% of the 100,000 animation source files. One of the company's employees, Larry Cutler, was just looking through the files of the animation source folder, intending to correct something in Woody’s hat model, when he suddenly noticed that there were only 40 files left in the folder ... then 4 ... and after another second not left at all. Larry called the IT service and said that " there was a massive data loss, " and that "the recovery will require a full backup ... " Which, as it turned out a little later, they did not have, despite the daily backup.

What happened next? The “full backup”, for obvious reasons, turned out to be not quite complete: it turned out that up to 30,000 files have not been on the tape lately. The task was complicated by the fact that permanent files were created, deleted, and modified in the folder, so backups created at different times often contained different sets of files, so they had to be manually compared to detect what was deleted as planned, and that disappeared as a result of a failure. It took many days of painstaking manual work to analyze all the missing files, the surviving versions of which were scattered in incremental and full backups over the past two months.
')
I would like to analyze: what actions need to be taken to minimize problems similar to those described above.

First of all, we should distinguish between testing the integrity of the backup (media testing) and testing of restoration from the backup (data testing). In the first case, you only check the integrity of the copy of the checksums of the data blocks. In the second, you are testing to reflect a particular simulated scenario of a single failure or “full-blown catastrophe” of your productive network.

Backup recovery testing is recommended regularly.

As the productive system grows in size, the results of the recovery test may vary. For example, a week ago everything was restored correctly, and then “suddenly” the backups ceased to fit in the repository, and the error message was not noticed. Or installed a new network application, and its file location was not included in the backup. Or, the system can be successfully replicated by a block method of direct mapping to a disk having a factory capacity of 2TB. “Direct block mapping” means that a block of data on the target disk is mapped directly to a data block on the source disk (without using an intermediate redirect table). But if at some point the system is upgraded, and the capacity of the source disk exceeds 2TB, and the backup disk is not updated (this often happens because the productive system infrastructure and backup infrastructure have different budgets, and in some cases, different administrators), this replication method will stop working.

The recovery test should be carried out in sessions that allow the suspension or even a complete disconnection of the connection, followed by a recovery. Ideally, recovery can be done from an administrator’s laptop, or from any computer on the network, including the administrator’s home computer. Terminal connections allow this to be fully achieved, and the backup product must correctly work out this scenario. The recovery process can take many hours, and flexibility in managing and monitoring this process is an important parameter.

The recovery process may require additional steps that the backup product itself could not perform for some reason: for example, configuring the DNS, running the database script, etc. When creating backup jobs, the administrator can easily forget about these steps due to the “human factor” (after all, the described additional steps are needed during the system recovery phase, and not during the backup phase).

If a backup scheme is used using offsite backup storage, then recovery testing should include a scenario when you need to request a copy from the external office and physically move it to the original office (for example, physically bring a cassette). Remember that there are traffic jams on the roads when you check to see if you are in an SLA with an RTO .

It must be said that the final SLA on the RTO is determined not only by you, but also by the SLA of the hardware and software manufacturers involved in the backup and recovery process. For this reason, do not forget about your Backup-server. If less reliable or less efficient hardware or software is installed on it than on the servers of the production network, the Backup server may let you down at the time of recovery, and you will not be able to perform a recovery SLA. For example, if the productive network contains disks that the manufacturer changes in 1 day in the event of a failure under the premium guarantee, and the backup server has a disk with the usual warranty (repair within 45 days), then your final SLA on the productive network will be 45 days.

Modeling threats in terms of disruptions in the production network

It is extremely useful to simulate possible options for failure of the productive network. The model should cover all scenarios of information recovery after failures. For example, you cannot confine yourself to the scenario of the total disk's death, since when testing recovery, you will use recovery from a full backup plus recovery from an incremental chain each time. But what if only one file on the disk is deleted, the latest version of which is in the last incremental copy? In this case, the full copy is not used correctly, and you need to take only the last incremental one and extract the necessary file from it. Such scenarios also need to be checked regularly. Examples of threats to be modeled:

Physical full disk failure
Deleting a single useful file
Virus infection files
Failure of an Active Directory domain controller, DNS server, VPN server, Exchange server, or other critical infrastructure simultaneously with a failure on the production server file
Failure of a separate server in the main site / office and, at the same time, the disappearance of communication with the site / office backup
Double failure - failure of the production server with subsequent failure of the backup server that performs recovery

For more information on threat modeling, you can read the post " Model of threats in data protection from failures ."

How NOT to do productive network recovery testing

You should not perform recovery testing using intentional destruction of production network data. For example, the head of one of the companies on Friday evening erased the contents of his hard disk of his computer and asked the IT service to restore everything by Monday morning. There are several reasons for not doing this:

Recovery testing may fail and actual data will be irretrievably lost.
Only one script is tested that assumes recovery from a full backup.
Only recovery from one (last) time point is tested (that is, we know that everything works on Fridays, but we can’t say anything about the other days)
Needless to say, there is also a human factor: IT staff will negatively perceive such a scenario of their work. Since nobody likes to fix artificially created problems.

Summarizing

It is necessary to regularly conduct data recovery testing, rather than simple backup integrity testing. That is, to check the operability of the restored systems and the availability of data in them, in accordance with the RPO , and not just to check the checksums of the backup file data blocks. In the case when the productive network is in a virtual environment, backup products created specifically for the virtual environment make the data verification process automated and transparent for the administrator, as they allow you to create virtual test sandbox laboratories isolated from the productive network. An example of such a technology is Veeam SureBackup.

It is necessary to randomly select various objects for the scan, failure scenarios and time points: productive network computers, different recovery points, simulate different failures that require different types of recovery (from a full copy, from an incremental copy, recovery of a single file). In the latter case, you need to randomly select files for testing recovery, rather than restoring the same file all the time.

For the sake of data recovery testing, you should not unduly risk data from a productive network.

Additional materials

Source: https://habr.com/ru/post/193568/

All Articles