Big Data Backup

When designing and operating our data warehouse, several times the question arose of how to make backups or replication. I always gave him the same answer - no way. I will explain a little why.

Backups of large databases (from hundreds of gigabytes and above) are quite useless for one simple reason: restoring from a backup can take days. If the database is used continuously for business and data is loaded into it in a continuous stream, this is unacceptable. The situation is somewhat better in the case of an incremental backup to the backup system, which can be enabled directly on top of the backup. However, this method is not suitable for all databases, but only for those that do not change files once written to disk. For example, for MySQL this method is poorly suited, all tables are either in a single tablespace (InnoDB), or in separate files (MyISAM). For Vertika, this is a possible option, since the data is written in impersonal files that do not change after recording, but are only deleted. However, in the case of cluster systems, it is necessary to provide an identical topology of the primary and backup systems. There may also be data integrity problems in the event of a primary system failure.

Sometimes replication is used to maintain the backup system. But it should be understood that replication rather drains performance because it requires writing a binary log, and if the replication is synchronous, then synchronization. In analytical applications with large data flow, when you need to constantly load thousands or tens of thousands of records per second into the database, this may be unacceptable.
')
What to do? For a long time, we invented and implemented a system for cloning or multiplexing systems, which occurs not at the database level, but at the source level. We support several “almost” identical systems that are not connected with each other, but load the same data into themselves in the same way. Since users never write directly to analytical databases, it is possible to do this. Such cloning has another important advantage: you can have one or more test systems with real combat data and load. Another advantage is staging deployment and QA. The behavior of the system with the new functionality can be compared with the current combat, and time to catch errors.

So, cloning allows you to:

Have a constantly prepared live backup system or several
Have identical systems for different purposes or for load distribution
Have systems with the same data, but different settings (for example, optimized for light or heavy queries)
Have a test system with combat data and load
Conduct a gradual deployment of new functionality, reducing the risk of errors
Restore one system data with another (copy)
Transparent to manage all of this.

And all this without penalties for performance, and with minimal risk. However, there are difficulties that need to be mentioned:

Control data integrity between systems
Starting a new system from scratch

Both of these problems are quite difficult to solve with 100% accuracy, but we did not need it. It is enough that the financially significant statistics match, and the detailed data may differ slightly or even be absent. In both cases, data can be synchronized by copying meaningful data with a live system. Thus, we always had complete control and space to choose whether we wanted to get an exact copy, but after a few days, or less accurate, but faster.

The described approach has helped us a lot. Moreover, it allowed us to have systems on different databases (different vendors), but working on the same algorithms. Thereby simplifying migration from one database to another.

Update: After receiving comments, some clarification was needed.

Probably, it was worth starting with and writing about what kind of data we process. I am sure that the proposed approach works in different cases, but the specifics do not hurt.

We process and load data from several types of sources into the storage. It:

Logs from runtime servers that record statistics and context of advertising campaigns
Ontology and description that allow the correct interpretation of logs
Data from our partner sites

All this data is uploaded to the repository, and used by our clients, partners, own users, and various algorithms that decide on the basis of the data what, where and how to display. The failure of the database means not only stopping the business and losing money, but also the fact that later you will have to “catch up” with the data accumulated during the failure. And the efficiency is very significant. Therefore, the question of the backup system is not idle.

The amount of data is large. Damp logs take up a few terabytes per day, although in the processed form it is much less. The combat base is growing steadily, and now occupies several terabytes.

Cloning or multiplexing at the source level is, in our opinion, a good, simple and relatively inexpensive way to solve the problem of redundancy, which also has a number of advantages. The purpose of this article was to describe a working idea, not a specific implementation. The applicability of this idea is quite universal in cases where the data in the repository is loaded only through ETL.

Source: https://habr.com/ru/post/146249/

All Articles

Big Data Backup

Update: After receiving comments, some clarification was needed.

More articles: