⬆️ ⬇️

Big Data Backup

When designing and operating our data warehouse, several times the question arose of how to make backups or replication. I always gave him the same answer - no way. I will explain a little why.



Backups of large databases (from hundreds of gigabytes and above) are quite useless for one simple reason: restoring from a backup can take days. If the database is used continuously for business and data is loaded into it in a continuous stream, this is unacceptable. The situation is somewhat better in the case of an incremental backup to the backup system, which can be enabled directly on top of the backup. However, this method is not suitable for all databases, but only for those that do not change files once written to disk. For example, for MySQL this method is poorly suited, all tables are either in a single tablespace (InnoDB), or in separate files (MyISAM). For Vertika, this is a possible option, since the data is written in impersonal files that do not change after recording, but are only deleted. However, in the case of cluster systems, it is necessary to provide an identical topology of the primary and backup systems. There may also be data integrity problems in the event of a primary system failure.



Sometimes replication is used to maintain the backup system. But it should be understood that replication rather drains performance because it requires writing a binary log, and if the replication is synchronous, then synchronization. In analytical applications with large data flow, when you need to constantly load thousands or tens of thousands of records per second into the database, this may be unacceptable.

')

What to do? For a long time, we invented and implemented a system for cloning or multiplexing systems, which occurs not at the database level, but at the source level. We support several “almost” identical systems that are not connected with each other, but load the same data into themselves in the same way. Since users never write directly to analytical databases, it is possible to do this. Such cloning has another important advantage: you can have one or more test systems with real combat data and load. Another advantage is staging deployment and QA. The behavior of the system with the new functionality can be compared with the current combat, and time to catch errors.



So, cloning allows you to:



And all this without penalties for performance, and with minimal risk. However, there are difficulties that need to be mentioned:



Both of these problems are quite difficult to solve with 100% accuracy, but we did not need it. It is enough that the financially significant statistics match, and the detailed data may differ slightly or even be absent. In both cases, data can be synchronized by copying meaningful data with a live system. Thus, we always had complete control and space to choose whether we wanted to get an exact copy, but after a few days, or less accurate, but faster.



The described approach has helped us a lot. Moreover, it allowed us to have systems on different databases (different vendors), but working on the same algorithms. Thereby simplifying migration from one database to another.



Update: After receiving comments, some clarification was needed.




Probably, it was worth starting with and writing about what kind of data we process. I am sure that the proposed approach works in different cases, but the specifics do not hurt.



We process and load data from several types of sources into the storage. It:



All this data is uploaded to the repository, and used by our clients, partners, own users, and various algorithms that decide on the basis of the data what, where and how to display. The failure of the database means not only stopping the business and losing money, but also the fact that later you will have to “catch up” with the data accumulated during the failure. And the efficiency is very significant. Therefore, the question of the backup system is not idle.



The amount of data is large. Damp logs take up a few terabytes per day, although in the processed form it is much less. The combat base is growing steadily, and now occupies several terabytes.



Cloning or multiplexing at the source level is, in our opinion, a good, simple and relatively inexpensive way to solve the problem of redundancy, which also has a number of advantages. The purpose of this article was to describe a working idea, not a specific implementation. The applicability of this idea is quite universal in cases where the data in the repository is loaded only through ETL.



Source: https://habr.com/ru/post/146249/



All Articles