Today, many companies use RAID-arrays of hard drives instead of traditional tape libraries for backup. The benefit of such a replacement is obvious - the recording of backups and the restoration of the original data from them is significantly accelerated, it is easier to find the necessary backup or to check the compliance of the backup with the original. However, despite the gradual reduction in the cost of one gigabyte of hard disk capacity for this indicator, they are much more expensive than magnetic tape.
StorageWorks D2D
In addition, removable media is used in tape libraries, so a fully-filled tape cartridge can be removed from the library and sent to storage, and a clean cartridge inserted instead. The capacity of disk arrays in this way cannot be scaled and if everything on the disks does not remain free capacity, then you need to delete some old backups or connect additional disk shelves to it (the latter is not always possible due to limitations of the array itself or lack of space in the rack in which array is mounted).
')
To reduce the cost of storing backups on hard drives, many vendors offer their own implementation of deduplication technology, which reduces the total volume of backups by identifying identical sets of source data. For such duplicates, only one backup is recorded and, depending on the type of source data, the reduction in the volume of backups can be up to two orders of magnitude.
At the end of June, HP Technology Forum 2010 presented its own approach to deduplication by Hewlett-Packard, which, unlike its main competitors in the storage market, did not spend money on acquiring companies specializing in niche deduplication solutions, but used the developments of scientists from HP Labs.
One of the main problems with online deduplication is the need to analyze the data stream arriving at a speed of several hundred megabytes per second on the fly and look for duplicates on the index into which previously copied data is entered. If the entire index is in the computer's RAM, then such a search is performed fairly quickly, but as the backup volumes grow, the index no longer fits in the RAM and then it needs to be unloaded onto the hard disk. The read / write speed drops sharply at the same time and as a result, access to the index starts to slow down the deduplication process.
The StoreOnce technology created in HP Labs uses the Sparse Indexing (rare or selective indexing) indexing algorithm, which means that only a sample of the index is in RAM, and the main part of the index is stored on the hard disk. The principle of Sparse Indexing is based on the fact that duplicates are usually bundled, i.e. if for the first data block there are already duplicates, then most likely there will be duplicates for the subsequent data blocks too. Sparse Indexing consistently writes hash pointers of a series of data blocks on a hard disk, so if a duplicate is found for a new data block in an index sample, then pointers to possible duplicates of the following data blocks are quickly loaded into RAM from the hard disk (a detailed description of the StoreOnce this technology from HP Labs -
www.hpl.hp.com/personal/Mark_Lillibridge/Sparse/final.pdf )
HP will use StoreOnce for deduplication across all of its StorageWorks D2D series disk backup systems, including the twelve-disk D2D4312 model announced at the HP Technology Forum 2010, scalable to 36 TB of usable capacity (the hardware of all these systems is HP ProLiant DL standard architecture servers for example, the HP StorageWorks D2D4312 photo shown was developed on the basis of the two-socket HP ProLiant DL370). In addition, the company plans in the future to integrate StoreOnce with HP Data Protector backup package and similar software from other vendors, as well as use it in storage consoles and implement this technology using virtual machines.