The pitfalls of backup and recovery of deduplicated data in a disaster recovery script

Developing the theme of backup and recovery on the storage system with a new architecture, consider the nuances of working with deduplicated data in the disaster recovery script, where storage systems with their own deduplication are protected, namely, how this efficient storage technology can help or hinder data recovery.

The previous article is here: Pitfalls backup in hybrid storage systems .
')

Introduction

Once the deduplicated data takes up less disk space, it is logical to assume that backup and recovery should take less time. Indeed, why not backup / restore deduplicated data immediately in a compact deduplicated form? Indeed, in this case:

Only unique data is placed in the backup.
No need to reduplytirovat (rehydrate) data on a productive system.
No need to dedup data back to IBS.
Conversely, you can restore only those unique data that are necessary for reconstruction. Nothing extra.

But if you look at the situation more closely, it turns out that not everything is so simple, the direct path is not always more efficient. At least because in general purpose storage systems and backup storage systems different deduplication is used.

Deduplication in general purpose storage systems

Deduplication, as a method of eliminating redundant data and increasing storage efficiency, has been and remains one of the key areas of development in the storage industry.

The principle of deduplication.

In the case of productive data, deduplication is intended not only and not so much to reduce disk space, but to increase the speed of access to data due to their more dense placement on fast media. In addition, deduplicated data is convenient for caching.

One deduplitsirovanny block in the cache, on the top level of multi-level storage, or simply placed on the flash, can correspond to dozens or even hundreds of identical user data blocks, which previously took place on physical disks and had completely different addresses, and therefore could not be effectively cached

Today, deduplication on general purpose storage is very effective and profitable. For example:

On the flash system (All-Flash Array) you can put significantly more logical data than their raw capacity usually allows.
When using hybrid systems, deduplication helps to identify “hot” data blocks, since it only retains unique data. And the higher the deduplication, the more calls to the same blocks, which means - the higher the efficiency of multi-level storage.

Efficiency of solving storage problems using a combination of deduplication and tiering. In each embodiment, equal performance and capacity is achieved.

Deduplication in storage backup

Initially, deduplication became widespread in these systems. Due to the fact that data blocks of the same type are copied dozens, or even hundreds of times, into the CPK, substantial space savings can be achieved by eliminating redundancy. At one time, this was the reason for the “offensive” on tape disk library systems for backup with deduplication. The disk has heavily pressed the tape, because the cost of storing backups on disks has become very competitive.

The advantage of deduplitsirovannogo backup disks.

As a result, even such adherents of tapes as Quantum began to develop disk libraries with deduplication.

What deduplication is better?

Thus, in the storage world at the moment there are two different ways of deduplication - in backup and in general-purpose systems. The technologies they use are different - with variable and fixed blocks, respectively.

The distinction between the two methods of deduplication.

Fixed block deduplication is easier to implement. It is well suited for data that needs regular access, so it is more commonly used in general purpose storage systems. Its main disadvantage is the lesser ability to recognize identical data sequences in the general stream. That is, two identical streams with a small offset will be perceived as completely different, and will not be deduplicated.

Variable block deduplication can better recognize repetitions in a data stream, but for this it needs more processor resources. In addition, it is unsuitable for providing block or multi-threaded access to data. This is due to the storage structure of the deduplicated information: if to put it simply, it is also stored in variable blocks.

With their tasks, both methods help to cope perfectly, but with unusual tasks, everything is much worse.

Let's look at the situation that arises at the interface of the two technologies.

Problems backing up deduplicated data

The difference between the two approaches in the absence of coordinated interaction leads to the fact that if the storage system that stores already deduplicated data is backed up with deduplication, then the data is “duplicated” each time, and then deduplicated back in the process of saving them to the backup system copying.

For example, physically stored 10 TB of productive deduplicated data with a total ratio of 5: 1. Then during the backup process, the following happens:

Not 10, but completely 50 TB are copied.
A productive system in which the original data is stored will have to do the work of rehydrating (“replicating”) the data in the opposite direction. At the same time, it should provide productive applications and backup data flow. That is, three simultaneous heavy processes that load system I / O buses, cache memory, and processor cores of both storage systems.
The target backup system will have to deduplicate the data back.

In terms of use of processor resources - this can be compared with the simultaneous pressing of the gas and brake. The question arises - can this be somehow optimized?

The problem of restoring deduplicated data

When restoring data to volumes with deduplication enabled, you will have to repeat the whole process in the opposite direction. Not all storage systems have this process running on the fly, and many solutions use the “post process” principle. That is, the data is first recorded on physical disks (even if on flash) as is, then analyzed, data blocks are compared, duplicates are detected, and only then is the cleaning performed.

Comparison of in-line and Post-Process Dedupe.

This means that in the storage system during the first pass, there may potentially be not enough space to fully recover all non-duplicated data. And then you have to do the restoration in several passes, each of which can take a lot of time, consisting of recovery time and deduplication time with the release of space on the general purpose storage system.

This possible scenario relates not so much to recovering data from a data recovery (minimizing the risks of the Data loss class), but rather to recovering from a catastrophically large data loss (which is classified as a disaster, that is, Disaster). However, to put it mildly, such Disaster Recovery is not optimal.

In addition, in case of a catastrophic failure, it is not at all necessary to restore all the data at once. It is enough to start only with the most necessary.

As a result, the backup, which is designed to be a means of last resort, which is addressed when nothing else worked, does not work optimally in the case of deduplicating general-purpose storage systems.

Why, then, do we need a backup from which, in the event of a disaster, one can recover only with great difficulty, and almost certainly not completely? After all, there are replication tools built into the productive storage system (mirroring, snapshots) that do not have a significant impact on performance (for example, VNX Snapshots, XtremIO Snapshots). The answer to this question will be all the same . However, any normal engineer would try to somehow optimize and improve this situation.

How to combine two worlds?

The old organization of working with data during backup and restoration looks at least strange. Therefore, many attempts were made to optimize the backup and restore deduplicated data, and a number of problems were solved.

Here are just a few examples:

But these are just “patches” at the level of operating systems and individual isolated servers. They do not solve problems at the general hardware level in SHD where it is really difficult for making.

The fact is that in general purpose storage systems and in backup systems, different, specially developed deduplication algorithms are used - with fixed and variable blocks.

On the other hand, it is not always necessary to do a full backup, and much less often a full restore. It is not necessary to deduplicate and compress all productive data. However, you need to remember the nuances. Because the catastrophic loss of data has not been canceled . And to prevent them, standard industrial solutions have been developed, which should be provided for under the regulations. So, if data cannot be recovered from a backup in normal time, then responsible people may cost a career.

Let's look at how to best prepare for this situation and avoid unpleasant surprises.

Backup

If possible, use incremental backup and synthetic full copies. In Networker, for example, this feature is available starting from version 8.
Leave more time to full backup, considering the need to re-organize data. Choose the time for the minimum utilization of system processors. In the course of backups, it is better to watch the disposal of productive storage processors. It is better that it does not exceed 70% at least on average during the backup period.
Use deduplication meaningfully. If the data does not deduplitsiruyutsya and do not huddle, then why waste processor power during backup? If the system always deduplicates, then it should be powerful enough to handle all the work.
Consider the processor power allocated for deduplication in storage. This function is even found in entry-level systems that do not always cope with the simultaneous execution of all tasks.

Full data recovery, Disaster Recovery

Prepare a sane Disaster Recovery or Business Continuity Plan, taking into account the behavior of storage systems with deduplication. Many vendors, including EMC, as well as system integrators, offer the services of such planning, because each organization has its own unique combination of factors affecting the process of restoring applications to work.
If the general purpose storage system uses the post-process deduplication mechanism, then I would recommend to provide in it a free capacity buffer, in case of recovery from backup. For example, the buffer size can be taken as 20% of the logical capacity of the deduplicated data. Try to support this parameter at least on average.
Look for opportunities to archive old data so that they do not interfere with quick recovery. Even if deduplication is good and effective, do not wait for a crash, after which you will have to restore from the backup and completely deduplicate the volume to many dozens of TB. All non-operational / historical data is better transferred to an online archive (for example, based on Infoarchive).
On-the-fly data deduplication in general purpose storage systems has an advantage over post-process in terms of speed. It can play a special role in recovering from a catastrophic loss.

These are some of my considerations regarding backing up and restoring deduplicated data. I will be glad to hear here your feedback and opinions on this issue.

And I must say that one interesting special case that requires separate consideration is not affected here yet. So to be continued.
Denis Serov

Source: https://habr.com/ru/post/270907/

All Articles