At the head of large IT business consumers, there has finally been a shift in focus from business applications to data processed by these applications. And in the phrase "data center" is now deservedly distinguish the third word, and not the second. Along with the understanding of the main role of data in business came the panic fear of losing it. Indeed, according to IDC statistics, in the event of a prolonged lack of access to operational data, most companies expect bankruptcy.
There are two fundamentally different approaches to ensuring the reliability of data storage. The first is a backup. Two main concepts are associated with redundancy - RPO (recovery point objective) and RTO (recovery time objective). RPO is the point in time at which the system contained data corresponding to the backup. An RTO is the time taken by the backup / restore process. Naturally, with the growth of corporate data, the RTO grows in proportion to the volume of data, and RPOs appear less and less. This means that the freshest, most valuable data becomes the most vulnerable, and their volume increases.
The second approach is “data is always there”, that is, the protection of data directly in the storage system, at the time they are there. And that means real time RPO and an RTO approaching zero. This approach is heavily promoted by the grandees of storage systems (in particular, by EMC). The most popular way to ensure protection under the proposed concept is RAID (redundant array of independent disk; by the way, earlier, instead of the word “independent”, there was “inexpensive”, which is hardly applicable for modern fiber channel disks). The principle of operation is to combine several disks into a group and store data and redundant information in it. I think it makes little sense to talk about RAID levels, since we are now interested in the most popular level - 5.
((cut))
In a RAID5 group, the data is saved “smeared” across all disks, and the correction codes are also “smeared” —the information required for data recovery. Its redundancy for RAID5 is the optimal 25% of the amount of useful data. RAID5 is designed in such a way that the group withstands a single disk failure at a time.
')
It would seem that with such storage technology, the data is really always there. Let's see how much "always." The subtle point here is that the group withstands the failure of only one disk at a time. Even if you instantly replace this disk, the group needs some time to restore the data and correction codes (rebuild) to this disk. Data, of course, is available, but if another disk fails during the rebuild procedure, the group will be destroyed. The more disks in a group and the larger the size of each disk, the more frequent the failure of one of them will be, and the longer it will take to rebuild. Up to the point that a RAID5 group of a large number of inexpensive volume disks can completely collapse several (3-4 times) once a year!
The solution to this problem is in introducing double correction, RAID6 or RAID5 DP. Such a group can withstand the failure of two disks at one time (as we found out above, the “moment” for large groups is quite a long time of the rebuild procedure). Failure of two discs in a row event is not frequent. Theoretically, for groups less than 20 TB, RAID6 provides 2 orders of magnitude better data protection (time to data loss) for disks with average parameters than RAID5.
The practice makes one doubt the probability theory: the failure of the second disk at the time of rebuild is very likely. This is especially true for systems that are under serious workload. Two factors affect this. First, the rebuild procedure on a productive system seriously loads disks, the number of read / write operations significantly increases on an already heavily loaded system. Secondly, at the modern level of microelectronics, the disks go off the conveyor like each other like clones; accordingly, such an important parameter as MTBF is almost the same. Thus, one of the disks, which has reached the limit of operating time, leads to an increased load on the whole group, faster than usual under normal conditions, exhaustion of the resource of other disks and, as a result, increased probability of failure of another disk. This kind of fan off.
Storage manufacturers are struggling with this as much as they can. For example, IBM, when ordering storage systems, supplies disks from different manufacturers and different batches in order to introduce heterogeneity in MTBF disks and reduce the likelihood of a simultaneous failure of two disks in a group. However, the concept of data is always there does not save. And along with data protection, in-place backup is also used. Which, by the way, also does not provide 100% data protection from hardware failure ...
Keep this in mind: your business is as vulnerable as your data. Absolute data protection is impossible, but using a combined approach to data protection, reliable devices and full redundancy of storage systems, the probability of loss of corporate data can be minimized.