Bad advice on setting up a backup and a few tales

If you set up a backup, did not score, did not save - it is already well done. But backup still needs to be able to cook properly.

Personally, I have a feeling that some administrators have done a special book with a bad advice on setting up a backup for optimists who do not believe in Murphy’s law.

Today I decided to lay out my short hit parade with real stories that I had to deal with while working on the side of service providers and customers. After each such case, the hero of the occasion feels keenly that he is no longer happy with the company, and he realizes that it’s time to update the summary on XX.
')
Read and shake on the mustache.

Tip 1. Use RAID instead of backup.

Why backup when there is a RAID? Do not bother with the deployment and configuration of the backup system, a separate infrastructure for it and a special person who will administer it. You have the same RAID, which means there are two copies of the data or redundant data is stored.

Case of life. Client data was on the storage system. For speed and reliability, it was created RAID-10 of more than 24 drives. The storage system also had a number of spare disks. When one of the disks in the mirror failed, the storage system replaced the fallen disk with a spare disk and began to restore the contents on it. The remaining disk worked for two and actively gave the data to that new disk, but did not do it for long, literally 5 seconds. Then the disk could not withstand the load, and the entire RAID failed. As a result, data on all 24 disks was lost.

If serious. Mirroring, duplication of data that underlies the operation of RAID does not protect them. RAID is needed so that the system does not stop due to every hard disk failure. RAID is about availability, not about data integrity. It will not help if a careless administrator or a virus has corrupted data. He does not provide and versioning.
In general, RAID is not for backup. Do not confuse warm with soft.

Council 2. Put backups on the same storage where the source data is located

If you still decide to backup, do not spend on a separate backup storage. Modern storage systems are very reliable and smart. Take and allocate fast volumes for productive data, and for slow-add backups. Forget about rule 3-2-1. Those who speak about him simply do not know that this is possible.
Use the power of storage to the end!

Case of life. There are many stories, because anything can happen to the storage system. The end of these stories, however, is the same: storage failure and unavailability, and sometimes complete data loss. My most favorite stories from the “human factor” series:

the administrator has confused and deleted the productive LUN;
A novice administrator stuck a large sticker on the air intake for the air intake with a detailed description of the purpose of the storage system, its IP addresses, brief instructions on switching on and off, contacts of the person in charge, etc. with very “necessary” information. The storage system has overheated and failed.

Or just recently colleagues told: the storage system of the old model worked for a long time and died. When the engineers saw her body, they realized that they only saw such a storage system in a museum. There are no ideas how to restore it, especially - how to do it quickly.

If serious. Do not do this. If you do, then be mentally prepared for the fact that you will be left without source data and without backups for an indefinite time.

Tip 3. Discard the backup server and backup system database.

Back up the server and backup database itself? What for? You already have backups, somehow recover. If the backup server orders you to live a long time, then simply re-deploy everything, import backed-up files there. Then wait until the system realizes what to do with all this farm and restores the database with information about tasks, schedules, backup objects and their location, storage location of backups. It does not matter that at first you will spend time restoring the backup system and only then begin to revive the fallen infrastructure.

Case of life. On one LUN SHD were placed the original data and the backup server. When the LUN became unavailable and it took to recover, it turned out that there is no backup of the server and the backup database. Without the last, all backups turned into a suitcase without a handle. The guys had to deploy a backup system from scratch. The benefit of the solution used was the ability to recreate the base anew through the import of backups into the system. But there was one caveat: the data for import was 100 TB. To build a new database, the new system must go through them in detail and catalog all the data. As a result, only 20% was processed in 1.5 days. Then someone advised them to import only full backups (the largest files), and things went faster, but the victims have already lost a lot of time and nerves.

If serious. To be fully prepared when the backup server is lost, two things need to be considered:

backup server and database backup ( ~~backup backup~~ ). In some solutions, it is enough to deploy the system again and import the backup files into it. That is, having a database backup with tasks, backup objects, etc. is not so critical. So, for example, arranged by Veeam.

With others without a database can not do (Symantec). If she did not back up, you will have to spend time on her recovery.

For the third - after losing the database, it remains only to contact the vendor’s technical support (Commvault).
Disaster Recovery pre-thought plan for backup server in case of primary server / site failure. Many backup system vendors in the documentation offer options for organizing a Disaster Recovery backup server, for example: Veeam , Commvault .

Tip 4. Do not test backup recovery.

Backup is not for recovery, but in order to just be and warm the administrator with the soul by its very existence. The main thing is to set up a backup, and then everything will be backed up accurately so that you can easily restore the virtual machine, databases and applications. In all cases, the recovered data will be consistent and unharmed.

Until the “F-hour” the RTO will remain secret for the administrator and for the authorities. But then for all there will be a surprise.

Even better, no one except the administrator knows where the backups are, and the recovery script is only in his head. Let colleagues try their hand at improvisation.

Cases from life. There are also many such stories, one of the most typical. Once upon a time there was a database called prod. The backup engineer properly put it on a backup. One day, there was a crash, and the administrators restored the base next to the name prod1. Productively accordingly began to live on prod1. The backup engineer did not know anything about this, so I did not install prod1 on the backup. The old prod database continues to back up. When the need comes to recover from the backup, it turns out that it is not. Data for the last 3 months is lost.

A similar story happens when the database moves to a new disk or server. The problem is largely organizational, but it can also be avoided if a test restore from a backup copy with the participation of the system owner is periodically held.

If serious. After setting up the backup system, perform a test recovery at least once. The further test schedule depends on the backup object: how often it changes, its dimensions, etc. The owner of the system / application should participate in the verification.

If you do not want to check everything manually, then most backup software has automation functions for checks (for example, Veeam Sure Backup has data integrity).

In addition, many software have the ability to use scripts to check backups, allowing you to do some actions after recovery (download applications, connect to certain ports, create objects in the database, etc.).

A test recovery will help to obtain a realistic estimate of the RTO (the calculated RTO may be very different from life). It will also help identify gaps in the recovery regulations, if any.

There are many reasons for testing backups, as well as recommendations for its organization. I will tell about it in detail in one of the following articles.

Tip 5. No separate backup infrastructure

We do the backup system and the infrastructure for it as a residual. When the product is already working, we also back up the same infrastructure and network. Let everything go through the same channel. Do not put restrictions on the bandwidth for backup jobs, let the backup eat as many resources as it wants.

Case of life. At the client, the product and the backup were shared by one network channel.

Once a quarter, the company calculated a quarterly report. The whole process took more than 15 hours, so the group (let's call it conditionally “business”) starts the report generation in the morning and receives a ready report the next morning. One such beautiful morning turns out that the quarterly report was not formed. The investigation revealed that the error occurred due to a disconnection from the database. There were no visible technical reasons for this. A week later, the guys re-launch the creation of a report, and everything works out. But after six months, the problem repeats.

It turned out that at about the same time another group (let's call it “IT”) is launching a task for a full backup of the database. The task execution occupied the most part of the network band, loaded the database, as a result of which the connection between the system calculating the report and the database was interrupted. The problem was solved after the failure of the backup on the day of the quarterly report.

If serious. Backup is a resource-intensive thing. Ideally, it needs its own infrastructure and network on dedicated equipment. Then the backup will not interfere with the product, and the backup and restore tasks will be completed in a reasonable time.

Putting a separate infrastructure for backup is easiest at the planning stage of a productive stand.

If you can’t separate flies from cutlets, you can try to set limits on the means of a backup system (Network Traffic Throttling) or limit the bandwidth using QoS on network equipment. The main thing here is to keep a balance and not “kill” the backup lane so that tasks will run for 24 hours.

Tip 6. Do not monitor

Why set up email notifications that the next backup task was successfully done or zafeillos? There is enough spam in the box. Moreover, you do not need to monitor the state of the server where the backup system works.

The less you know the better you sleep.

Case of life. Veeam has one feature: after installing the update, you must go to the main server and run this updated Veeam. It will check and issue a list of infrastructure servers that require updating. If some of the components of the backup infrastructure remain un-upgraded, the main server will not be able to communicate with them, and an error will be displayed during the execution of tasks. The administrator either forgot about this moment, or simply did not know, updated the main server and left with peace of mind for the weekend. Notifications are not configured, so he found out about his mistake only on Monday, when he needed backups, which were not.

If serious. Customize alerts for completed tasks and errors. So you will understand whether you have a backup at all or you have run out of disk space and have nowhere to put it. In addition, you can customize weekly reports on tasks and errors, the amount of remaining space for backups.

Monitor the availability of the backup server itself: at least ping.

This list can be continued for a long time, but I will probably stop at this. Share in the comments your bad tips and backup stories from the cycle “it would be funny if it were not so sad.”

Source: https://habr.com/ru/post/326948/

All Articles