📜 ⬆️ ⬇️

The pitfalls of backup and recovery of deduplicated data in a disaster recovery script



Developing the theme of backup and recovery on the storage system with a new architecture, consider the nuances of working with deduplicated data in the disaster recovery script, where storage systems with their own deduplication are protected, namely, how this efficient storage technology can help or hinder data recovery.


The previous article is here: Pitfalls backup in hybrid storage systems .
')

Introduction


Once the deduplicated data takes up less disk space, it is logical to assume that backup and recovery should take less time. Indeed, why not backup / restore deduplicated data immediately in a compact deduplicated form? Indeed, in this case:


But if you look at the situation more closely, it turns out that not everything is so simple, the direct path is not always more efficient. At least because in general purpose storage systems and backup storage systems different deduplication is used.

Deduplication in general purpose storage systems


Deduplication, as a method of eliminating redundant data and increasing storage efficiency, has been and remains one of the key areas of development in the storage industry.


The principle of deduplication.

In the case of productive data, deduplication is intended not only and not so much to reduce disk space, but to increase the speed of access to data due to their more dense placement on fast media. In addition, deduplicated data is convenient for caching.

One deduplitsirovanny block in the cache, on the top level of multi-level storage, or simply placed on the flash, can correspond to dozens or even hundreds of identical user data blocks, which previously took place on physical disks and had completely different addresses, and therefore could not be effectively cached

Today, deduplication on general purpose storage is very effective and profitable. For example:



Efficiency of solving storage problems using a combination of deduplication and tiering. In each embodiment, equal performance and capacity is achieved.

Deduplication in storage backup


Initially, deduplication became widespread in these systems. Due to the fact that data blocks of the same type are copied dozens, or even hundreds of times, into the CPK, substantial space savings can be achieved by eliminating redundancy. At one time, this was the reason for the “offensive” on tape disk library systems for backup with deduplication. The disk has heavily pressed the tape, because the cost of storing backups on disks has become very competitive.


The advantage of deduplitsirovannogo backup disks.

As a result, even such adherents of tapes as Quantum began to develop disk libraries with deduplication.

What deduplication is better?


Thus, in the storage world at the moment there are two different ways of deduplication - in backup and in general-purpose systems. The technologies they use are different - with variable and fixed blocks, respectively.

The distinction between the two methods of deduplication.

Fixed block deduplication is easier to implement. It is well suited for data that needs regular access, so it is more commonly used in general purpose storage systems. Its main disadvantage is the lesser ability to recognize identical data sequences in the general stream. That is, two identical streams with a small offset will be perceived as completely different, and will not be deduplicated.

Variable block deduplication can better recognize repetitions in a data stream, but for this it needs more processor resources. In addition, it is unsuitable for providing block or multi-threaded access to data. This is due to the storage structure of the deduplicated information: if to put it simply, it is also stored in variable blocks.

With their tasks, both methods help to cope perfectly, but with unusual tasks, everything is much worse.

Let's look at the situation that arises at the interface of the two technologies.

Problems backing up deduplicated data


The difference between the two approaches in the absence of coordinated interaction leads to the fact that if the storage system that stores already deduplicated data is backed up with deduplication, then the data is “duplicated” each time, and then deduplicated back in the process of saving them to the backup system copying.

For example, physically stored 10 TB of productive deduplicated data with a total ratio of 5: 1. Then during the backup process, the following happens:


In terms of use of processor resources - this can be compared with the simultaneous pressing of the gas and brake. The question arises - can this be somehow optimized?

The problem of restoring deduplicated data


When restoring data to volumes with deduplication enabled, you will have to repeat the whole process in the opposite direction. Not all storage systems have this process running on the fly, and many solutions use the “post process” principle. That is, the data is first recorded on physical disks (even if on flash) as is, then analyzed, data blocks are compared, duplicates are detected, and only then is the cleaning performed.


Comparison of in-line and Post-Process Dedupe.

This means that in the storage system during the first pass, there may potentially be not enough space to fully recover all non-duplicated data. And then you have to do the restoration in several passes, each of which can take a lot of time, consisting of recovery time and deduplication time with the release of space on the general purpose storage system.

This possible scenario relates not so much to recovering data from a data recovery (minimizing the risks of the Data loss class), but rather to recovering from a catastrophically large data loss (which is classified as a disaster, that is, Disaster). However, to put it mildly, such Disaster Recovery is not optimal.

In addition, in case of a catastrophic failure, it is not at all necessary to restore all the data at once. It is enough to start only with the most necessary.

As a result, the backup, which is designed to be a means of last resort, which is addressed when nothing else worked, does not work optimally in the case of deduplicating general-purpose storage systems.

Why, then, do we need a backup from which, in the event of a disaster, one can recover only with great difficulty, and almost certainly not completely? After all, there are replication tools built into the productive storage system (mirroring, snapshots) that do not have a significant impact on performance (for example, VNX Snapshots, XtremIO Snapshots). The answer to this question will be all the same . However, any normal engineer would try to somehow optimize and improve this situation.

How to combine two worlds?


The old organization of working with data during backup and restoration looks at least strange. Therefore, many attempts were made to optimize the backup and restore deduplicated data, and a number of problems were solved.

Here are just a few examples:


But these are just “patches” at the level of operating systems and individual isolated servers. They do not solve problems at the general hardware level in SHD where it is really difficult for making.

The fact is that in general purpose storage systems and in backup systems, different, specially developed deduplication algorithms are used - with fixed and variable blocks.

On the other hand, it is not always necessary to do a full backup, and much less often a full restore. It is not necessary to deduplicate and compress all productive data. However, you need to remember the nuances. Because the catastrophic loss of data has not been canceled . And to prevent them, standard industrial solutions have been developed, which should be provided for under the regulations. So, if data cannot be recovered from a backup in normal time, then responsible people may cost a career.



Let's look at how to best prepare for this situation and avoid unpleasant surprises.

Backup


Full data recovery, Disaster Recovery


These are some of my considerations regarding backing up and restoring deduplicated data. I will be glad to hear here your feedback and opinions on this issue.

And I must say that one interesting special case that requires separate consideration is not affected here yet. So to be continued.
Denis Serov

Source: https://habr.com/ru/post/270907/


All Articles