In the field of business continuity there are many different problems associated with the rapid growth of data in modern IT infrastructures. In my opinion, we can distinguish two main ones:
Indeed, the growth of data on terabytes per year for some large organization is today a very real scenario. But what about efficient storage and backup? After all, there are a maximum of 24 hours in a day and the backup window cannot grow indefinitely (unlike the data itself). Today I want to tell how deduplication can help reduce the severity of this problem.
In a broad sense, there are two main types of deduplication:
Usually, the more granular deduplication scheme is used, the greater is the space savings in the data storage.
What is SIS? The essence of the method is simple, for example, if there are 2 files that are absolutely identical, then one of them is replaced by a link to the other. This mechanism works successfully in mail servers (for example, Exchange) and in databases. For example, if one corporate mail user sends an email with an attachment to several recipients, this file will be saved in the Exchange database only once.
Sounds great! But only as long as the files are absolutely identical. If one of the identical files is changed by at least a byte, a separate modified copy of it will be created and the deduplication efficiency will decrease.
Block deduplication works at the level of data blocks recorded on the disk, to assess the identity or uniqueness of which hash functions are used. The deduplication system stores a hash table for all data blocks stored in it. As soon as the deduplication system finds matching hashes for different blocks, it intends to save the blocks as a single instance and a set of links to it. You can also compare data blocks from different computers (global deduplication), which further increases the effectiveness of deduplication, since disks of different computers with the same operating system can store many repetitive data. It is worth noting that the highest efficiency will be achieved by reducing the block size and maximizing the block repeatability ratio. In this connection, there are two methods of block deduplication: with a constant (predetermined) and variable (dynamically matched for specific data) length.
Most product developers with deduplication support are focused on the backup market. In this case, over time, backup copies can take up two to three times more space than the original data itself. Therefore, file deduplication has been used in backup products for a long time, which, however, may not be sufficient under certain conditions. Adding block deduplication can significantly improve storage utilization and make it easier to meet system resiliency requirements.
Another way to use deduplication is to use it on the servers of the production system. This can be done by the OS itself, additional software, or data storage hardware (DSS). This requires attentiveness, for example, Windows 2008 — the OS positioned as capable of producing data deduplication does only SIS. At the same time, storage systems can perform deduplication at the block level, representing the file system for the user in expanded (original) form, hiding all the details associated with deduplication. Suppose there are 4 GB of data on the storage system that is deduplicated to 2 GB. In other words, if the user accesses such a storage, he will see 4 GB of data and that is exactly the amount that will be placed in backup copies.
The percentage of disk space saved is the most important area that is easily manipulated when it says “95% reduction in backup file sizes”. However, the algorithm used to calculate this ratio may not be completely relevant to your particular situation. The first variable to take into account are file types. Formats such as ZIP, CAB, JPG, MP3, AVI - this is already compressed data, which give a smaller deduplication rate than uncompressed data. Equally important is the frequency of data changes for deduplication and the amount of historical data. If you are using a product that deduplicates existing data on a file server, then you should not worry. But if deduplication is used as part of the backup system, then you need to answer the following questions:
Deduplication is easy to calculate online with the help of special calculators, but you cannot imagine how it will be in your particular situation. As you can see, the percentage depends on many factors and in theory reaches 95%, but in practice it can reach only a few percent.
Speaking about backup deduplication, it is important for us to know how fast it is executed. There are three main types of deduplication:
It runs on the device itself, where the source data is located. Any data marked for backup is divided into blocks, hash is calculated for them. There are 3 potential problems here.
Suppose that data from all computers are sent to one backup repository. Once the data is received, the repository can create a hash table with blocks of this data. The first advantage of this method is a larger amount of data, and the larger the data pool, the larger the hash table and, accordingly, the greater the chance of finding identical blocks. The second advantage is that the whole process takes place outside the productive network.
However, this option does not solve all the problems. There are some points that need to be taken into account.
Transit deduplication is explained as the process that occurs during the transfer of data from source to target. The term is a bit confusing. Data is not really deduplicated "in the wire." In fact, this means that the data collected in the target memory of the device is deduplicated there before the write operation to disk. This takes the disk search time out of the equation. Transit deduplication can be considered the best form of target deduplication. It has all the advantages of a global data view along with the unloading of the hashing process, but none of the disadvantages of slow I / O disks.
However, it still represents a lot of network traffic and potential hash collisions. This method requires the greatest computing resources (processor and memory) among all listed.
Deduplication technologies can help reduce storage costs. You should carefully choose the type of deduplication. Ultimately, deduplication will allow the company to more slowly increase the cost of storing its growing data.
Source: https://habr.com/ru/post/203614/
All Articles