How to know that backup was successful

Greetings

Everyone knows that admins are divided into those who are not yet making backups and those who are already making backups. However, it is believed that there are also those who are piously convinced that backups are being made, but in reality this is not the case. In this post I would like to tell a couple of real stories and (if possible) take stock, draw conclusions.

')
Disclaimer: all the stories are true, but in some places I cut off the edges, the image of the company and the admin are collective, all names are changed, faces are distorted beyond recognition, my first topic, blah blah blah, is one ...

Introductory: let's imagine the company as a classic development company: it actively uses the version control system (subversion - in our case it is important), the version build system, well, into the load of the task-turn system and wiki. The volumes are large, the loss of data costs a lot of money, everything should work “like a clock” and “and if suddenly a fire” doesn’t bother anyone - you need to store the data! We assume that backups after they are made automatically fall on the magnet. tape / dvd in the safe general director / cell in a Swiss bank, so we have no problem with the availability of the last backup.

History number one

Preloch
The admin writes a script that backs up the database and writes about it to the log.

Drama

-Shef, it's all gone, chef!
-Not a problem, we have backups! Where is this our bearded bear?

The admin takes backups from backups that are neatly folded into daddies date_time and sees there that the dump files, starting with “half a year ago,” are of zero size.

After thunderstorm

The mistake was at least amusing. Instead

mysqldump db > db.sql &2>> log.txt

was written

mysqldump db > db.sql &>> log.txt

In fact, the fact that writing to the logs was added with the help of >> and saved the situation helped to avoid the worst, but this, of course, is a very big success. When the error was found and the file log.txt, about ten gigabytes in size, was a matter of technique, finding the necessary lines near the end of the file and deploying the same dump.

Story number two

Preloch
The admin writes a script that, with the help of svnadmin, dumps the entire repository and throws a copy on the backup server. “And if something goes wrong?” Making the correct conclusions from the “history number one time” Admin adds logging that a repository was saved for so many bytes on such and such a day.

Drama
In fact, the drama was avoided, but, again, lucky, everything could be much worse. I wanted to make a second svn server, a kind of sandbox, a little bit later I wanted to roll the most recent dump on it once a day. When solving this task, the Admin found out that the repository dump file, starting from some day, is broken. At the same time, the size check was successfully passed - all revisions had been backed up to this critical one.

After thunderstorm

This time svnadmin was guilty, which makes a full backup iteratively, starting with the very first revision. Some kind of revision in the middle was a bat, svnadmin reached it, broke, honestly reported it and left to itself. Unfortunately, I don’t know all the details from here, but they are not very important to us. It was not possible to fix the revision, it was also not possible to remove it (by the way, I do not know, again, how things are with this in the latest versions of subversion). Therefore, a commanding decision was made to transfer the giant repository to the sandbox daily, using rsync.

Here it is necessary to sum up

What did I want to say to all of them? What I personally think is very difficult to automate the process of deciding that the backup was successful. Those. successfully, for example, he went through, but how can he, without deploying this backup, verify that the data in it is all correct and relevant? And, for example, new data, after some time will not begin to break the process of backup itself? Moreover, errors leading to this can be as old as the world:

Human factor
Unreliability of tools
Errors in validation logic
other

Personally, I do not know the answers to the questions voiced above. If dear Habr know, forgive share experiences.
In the meantime, I have long believed that the principle “made and forgotten”, at least in the case of backups, does not work. And I advise you to deploy backup entirely on a stand-alone test server, if there is such an opportunity (here we kill two birds with the cost of writing time to deploy a backup). Or write a separate script to check the integrity of backups. Check to:

backup date
file size
file resizing relative to previous backup
list of files in backup (or at least the number of unique ones)
... and send all this information with an accumulating letter to the mail once a day.

Thanks for attention.

UPD Thank you for your karma => transferred to the “system administration” blog.

Source: https://habr.com/ru/post/94837/

All Articles

How to know that backup was successful

History number one

Story number two

Here it is necessary to sum up

More articles: