Big Data: Backup can not be done without it

During my work as a database administrator, I developed for myself one rule that many DBAs adhere to. This is the “golden rule” of all database administrators - do not do anything serious with the database if you do not have a backup. If you are going to seriously change the parameters of the database, carry out operations for the maintenance of the database, etc. - you should always perform a backup operation before this. This principle worked for a long time and justified itself, and even in several cases helped to restore the database to a specific point in time.

Recently, we were assigned the task of developing a procedure for backing up a data warehouse with a size of 20 Terabytes. Using the established practices of backup, I tried to develop such a procedure and at the same time fit into the framework of RPO (recovery point objective) and RTO (recovery time objective). Both of these characteristics are measured in time and are as follows: RPO is the allowable amount of possible data loss, RTO is the allowable downtime or how long the database should recover. It was here that the most interesting thing began - no matter how much I wondered and didn’t expect it, but the backup procedure developed didn’t want to fit into this framework - too much data had to be backed up. In the best case, with numerous reservations and conditions, the database was restored in a few hours, but such a business could not afford. In the usual situation, when no serious restrictions and conditions were imposed on the database, recovery would take several days. This was aggravated by the fact that it was impossible to “remove” a backup in a reasonable time - it also took several days and created a heavy load on the database. At once I will make a reservation that this database does not support incremental backup in the current version. Perhaps if we could get incrementality, then the game would have cost a candle, and the traditional backup procedure would have the right to life in this case.

Realizing that the backup procedure is not viable here, I began to search for existing solutions to this problem. It was quickly discovered that no one backed up such volumes of information head-on. There are several approaches that allow you to have a backup of a database of this size, more or less relevant in time.

Incrementality

If the database supports incremental backup and the size of the permanent changes in the database is relatively small, then you can try performing an incremental backup procedure at certain intervals. However, this method is not suitable for everyone and is rather inconvenient in the sense that this backup must be constantly “rolled” onto the second instance of the database. Here the incremental backup plays the role of the most likely last resort, and the incrementality allows you to remove the extra load on the database and back up only the changed data. However, with a number of conditions, this decision has the right to life, although it is not the best in my opinion.
')

Replication

One of the most common solutions is to replicate new and changed data to one or more copies of the database. There are many technologies that allow such replication, both at the transaction level and at the file system level, it can be both synchronous and asynchronous. The advantages of such replication are that you will have an almost exact copy of the database. The mechanisms for catching errors during replication make it possible to quickly and painlessly understand their cause and, as a result, to fix it quickly. The biggest drawback is the heavy load and high cost of these technologies. However, in the absence of the ability to keep a backup copy of the database up to date using other means, replication has been and will be one of the most used solutions for extra-large data.

"Double" ETL

As a rule, before entering the data warehouse, the data passes through an ETL or ELT procedure. The abbreviation ETL itself tells us that the data is transformed appropriately before it enters the data warehouse, and the extra data is truncated. This process can be parallelized - i.e. do not load data into one data warehouse, but into two or several. Thus, we will have as many copies of the data warehouse as we need. But despite this, this approach has a significant drawback - often the copies are not identical, since errors and inconsistencies occur during the data loading process. It is not always clear which of the copies is more correct. Maybe some business may allow such a discrepancy, but if we are talking about financial companies, then such an assumption does not have a right to exist. You can develop a complex procedure of verification and correction of errors, but, as a rule, this only complicates and slows down the whole process. Summarizing this approach, we can say that it is applicable in a limited number of cases.

As it has already become clear, the practice of recovering such volumes from backups is not used anywhere - it takes a few days, or even weeks. The main method of restoring functionality in the event of the fall of the main database is to switch to a working copy of the database. To maintain the relevance of this copy, a number of methods are used, some of which I have listed above. Traditional approaches to backup, which consists in preserving a copy of the database and restoring it in case of failure, do not work with databases of very large volumes — you don’t have to go far for examples. Summarizing all the above, I want to put a comma in the title at the right place - backup cannot be done, work without it.

Source: https://habr.com/ru/post/147944/

All Articles

Big Data: Backup can not be done without it

Incrementality

Replication

"Double" ETL

More articles: