Transformation of backup storage technologies: software products and data deduplication devices

Market-oriented storage backup disk storage is measured in billions of dollars. In this market, there are quite a few well-known companies that produce products that have already become well-known throughout the world: EMC DataDomain, Symantec NetBackup, HP StoreOnce, IBM ProtectTier, ExaGrid, and others. How did this market start, and in what technological direction is it developing now, how can we compare different software products and deduplication devices with each other?

The first storage systems with deduplication appeared in the early 2000s. They were created to solve the problem of backing up exponentially growing data. The growth of data in productive systems of companies led to the fact that the duration of backup to tapes increased so much that full backups were no longer "placed" in the backup window, and using backup storage systems that were existing at that time was difficult to use as backup storage. - for their insufficient capacity. As a result, backups could be terminated either because of lack of time (for the case of tapes), or because of lack of space (for the case of disks). The problem of disk space could be solved by purchasing a high-capacity storage system, but in this case there was a problem of high storage costs.

Backup software products were originally designed on the basis that the backup storage is a tape drive, and the backup algorithm is a father-son-grandson algorithm:

“Father” (full backup once a week),
"Son" (incremental copies six days a week),
"Grandson" (old full backup, usually sent to extra-site storage).

This approach generated the generation of a rather large amount of backup data and cost companies relatively inexpensively when using tapes, but with the use of disks, the cost of this approach increased significantly.

In those days, only a small number of backup software products provided the ability to built-in backup data deduplication. Storage systems with built-in deduplication appeared precisely to solve this problem - reducing the cost of storing data on disks (in the long term down to tape level). A key factor in the success of these new devices was the fact that storage deduplication worked transparently and did not require any modifications to existing backup software.

However, since then, almost all backup software products have acquired built-in deduplication, and the cost of disks (the initial problem of disk storage) has significantly decreased. Moreover, now many backup products are able to do deduplication on the side of the original data, that is, backup data is deduplicated before it is transferred to the backup repository for storage. This allows you to reduce the load on the channel, increase the speed of work and reduce the backup window. For this reason, the functionality of many disk storage systems now includes functions of integration with such software products.
')
Currently, storage systems that are positioned as backup storages are subject to additional competitive pressure from storage systems designed to act as primary servers of the production network (Primary Storage), since deduplication functionality is often included in them for free.

A logical question arises: why then do we need specialized Backup Target storage systems and how to use them correctly? If you compile information from different manufacturers of such storage systems, they use the following three strategies:

It is claimed that (under certain conditions) deduplication on the Backup Target storage system has advantages over the deduplication built into the backup products;
Positioning their storage systems not only as a storage location for the backup repository, but also as a possible storage location for an archive of electronic documents of the organization;
Include backup software with your storage system, or simply integrate your storage system with backup software (including other manufacturers).

Consider the first point (which deduplication is better?)

When comparing the arguments of manufacturers are reduced to a comparative analysis of the deduplication coefficients, the duration of backup windows, the total equivalent storage capacity and replication efficiency. However, in fact, this analysis strongly depends on the “environmental factors” (that is, on the experimental conditions, and if the client’s actual conditions differ from the experimental ones, then the measurement result of the coefficients will be different).

For example, take the deduplication factor . Here we need to correctly determine what and how we measure. Some manufacturers indicate that their products have a 30 to 1 deduplication rate. Sounds impressive, of course. However, at the same time, other manufacturers indicate that their product gives a deduplication factor 10 times less, for example, “3 to 1”. Does this mean that the products of the first manufacturers are better than the second? No, since different data sets were evaluated in the calculations, and as a result, such different deduplication factors were obtained. That is, the “deduplication ratio”, specified as a constant, is more likely a marketing term, since it shows different data deduplication from different manufacturers, and based on it you cannot compare products unless you yourself can try different products in practice. prepared by the same test data set. At the same time at the moment there is no industry (or at least de facto) standard for assessing the deduplication coefficient. For example, in the anti-virus industry there is a so-called. standard reference test virus EICAR, which should be determined by any antivirus. Also here a test reference data set could be created, on which the deduplication coefficient of different software products and storage systems is calculated, but in reality there is no such reference.

The discrepancy in the comparison of deduplication coefficients can also be observed due to the existence of a difference in the algorithms of the backup process itself when using different products. Suppose a backup software product is used, and a scheme with a full copy once a week and incremental copies on other days is used. The product performs deduplication and backup compression. Now let's compare this with the case of using Backup Target DSS, which, say, each time receives a full copy of the disk volume for preservation, and which performs deduplication before saving data to disk. In the second case, the deduplication ratio will be much larger, and the actual savings in the disk space of the backup repository are, on the contrary, much smaller.

At the same time, the disk space of the backup repository saved for a certain period of time (and not the deduplication ratio) is ultimately the most correct criterion for comparing deduplication tools. However, in advance (before purchase) it is impossible to find out, usually, alas.

" Equivalent storage capacity " (or the size of storage that is required to store data without deduplication) is another, but also purely marketing criterion, since it is based on the same deduplication factor and calculated through it (manufacturers simply multiply the actual usable capacity deduplication storage ratio). In the end, using one controversial coefficient, get another controversial coefficient.

Sometimes the equivalent backup performance ratio is used. The idea of this coefficient implies that the user uses a special software client that performs primary deduplication on the source data side (to minimize network traffic), and then sends data to the Backup Target storage system, where global data is deduplicated (to minimize disk space). Such clients are usually installed on a database server, an application server, and a backup server. The equivalent backup performance, measured in terabytes per hour, is defined as the amount of data actually stored on the storage system per hour and multiplied by ... deduplication ratio. Obviously, in this case, the comparison of different storage systems for this coefficient, if it is specified in the materials per product, will be incorrect. At the same time, the very idea of combining two types of deduplication (on the source data side and on the storage side) is very good and can be used by IT providers when working with different clients, or within a company with centralized backup of distributed servers.

Only the original data transfer rate can be considered an objective metric.

Strategy number 2 (Backup Target storage as an electronic archive)

Repositioning Backup Target storage systems as storage systems that can be used not only to store the backup repository, but also to store the organization’s electronic archives is a good idea. However, the requirements for storage systems in these two cases are significantly different. Archives, unlike backups, by their very nature rarely contain duplicate information. Archives should provide the ability to quickly search for individual items, whereas accessing backup copies is relatively rare. These differences in requirements indicate that the storage systems for these tasks must still have a different architecture. Manufacturers are taking steps in this direction — for example, changing the file system architecture of their storage systems, however, by doing so, they are essentially moving towards a universal file system and a universal storage system (and the competition with universal storage systems has already been indicated above).

Strategy # 3 (integration of storage systems with backup software)

As for the idea of integrating backup software with storage, it looks very reasonable if the integration is carried out not just in marketing materials, but includes integration at the technological level. For example, storage systems make hardware snapshots of their disks as efficiently as possible (getting the lowest possible RPO in practice, since any software implementation from a third-party vendor will most likely be slower). At the same time, software backup products perform well other important backup functions: building a repository and organizing long-term backup storage, performing a backup testing procedure and fast data recovery in case of failure (minimizing RTO ). Such a technological “symbiosis” between manufacturers of backup software products and hardware storage systems allows you to get the most effective solutions for the user.

As a conclusion

For 10 years, there has been a technological evolution of the market for products and devices with deduplication - they have become advantageous to complement each other in functionality. There was a shift from deduplication in the backup repository to deduplication on the source data side, or to a combination of approaches.
There is no need to compare the effectiveness of deduplication by “deduplication coefficients” and metrics derived from them, since they strongly depend on the source data, on the nature of their daily changes, on network bandwidth and other factors on the “environment”.
At the moment, when creating the backup infrastructure architecture, it is optimal not to look separately at " hardware storage systems with deduplication " and separately at " backup software products ", but at their integrated complementary bundles of software + storage

Additional links

Source: https://habr.com/ru/post/216753/

All Articles