Why is RAID5 “must have”?

Small, but, I hope, a reasonable answer to the topic. Why is RAID-5 a “mustdie”? .
Below I will make the simplest calculation of the reliability of RAID10 and RAID5 and a comparison of their characteristics, as well as point out some of the fundamental flaws of RAID1 and RAID10.

A little introductory:

We will consider the simplest cases - RAID10 of 4 disks and RAID5 of 3 disks. All drives in the system will take the same.
In the original version of the article, RAID0 + 1 was mentioned instead of RAID10, but this causes unnecessary confusion. The correct name is of course RAID10 - I pour ash on my head.

Let n be the probability of failure of one disk;
')

So - RAID10:

The number of disks in the array - 4;
The price of the array is equal to the cost of four disks;
The capacity of the array will be equal to twice the capacity of the used disks (one disk);
The maximum read speed is equal to twice the speed of one disk;
The probability of failure of the array for the best case (when the controller implements RAID1 + 0 as a single matrix and is able to combine drives in an arbitrary manner):
The probability of failure of one disk: P1 = n (1-n) ^ 3;
The probability of failure of two disks: P2 = (n ^ 2) * (1-n) ^ 2;
The probability of failure of three disks: P3 = (n ^ 3) * (1-n);
The probability of failure of four drives: P4 = n ^ 4;
Probability of no-failure operation: P0 = (1-n) ^ 4;
Total probability: 4 * P1 + 6 * P2 + 4 * P3 + P4 + P0 = 1;
Array failure probability: P (RAID10) = 2 * P2 + 4 * P3 + P4;
_{* In the first term, instead of 6, it costs 2, since only in two cases (in case of damage to disks with the same data), the array cannot be restored.}

Separately, I note that most controllers do not know how to combine drives, which means that the failure of any two drives leads to data loss, and the reliability of the array as a whole is much lower.

RAID5:

Number of disks in the array - 3;
The price of the array is equal to the cost of three disks;
The capacity of the array is equal to the capacity of two disks;
the maximum read speed is equal to the one and a half reading speed of one disk;
The probability of failure of the array is equal to the probability of failure of two disks in it:
The probability of failure of one disk: P1 = n (1-n) ^ 2;
The probability of failure of two disks: P2 = (n ^ 2) * (1-n);
The probability of failure of three disks: P3 = n ^ 3;
Probability of no-failure operation: P0 = (1-n) ^ 3;
Total probability: 3 * P1 + 3 * P2 + P3 + P0 = 1;
Array failure probability: P (RAID5) = 3 * P2 + P3;

Findings:

We begin, of course, with the probability of failure - we subtract the probability of failure of RAID5 from the probability of failure of RAID10:
P (RAID10) -P (RAID5) = 2n ^ 2 * (n-1) ^ 2-n ^ 3 + n ^ 4 + 3 * n ^ 2 * (n-1) -4 * n ^ 3 * (n -one)
Considering that n-> 0 P (RAID10) -P (RAID5) <0, i.e. RAID5 Reliability BELOW RAID10 Reliability. The difference is quite small, but in favor of RAID10;
If we assume that drives can not be combined arbitrarily, then RAID5 is more reliable.
Price ratio: RAID5 is 1.333 times cheaper.
Speed ratio: RAID5 is 1.333 times slower than RAID10, but at the same time it is one and a half times faster than a single drive.
Attention question which option is better? ~~One that is more expensive and less reliable, albeit a little faster.~~ ~~Or the one that is cheaper and more reliable?~~
Personally, my opinion is ~~leaning toward a more reliable and cheaper RAID5 is~~ not leaning anywhere.

Addition:
In the comments, the respected track argued that in some cases RAID-5 may be much slower than RAID1. In my humble opinion, these should be very, very specific cases, but it should be kept in mind.

Any kind of comments:

Recovery time:

RAID10 recovery is ideally equal to the time it takes to copy the entire amount of data.
For RAID5, the situation is more complicated, since data recovery by correction codes is required.
When implemented in software, the RAID5 recovery time will be determined by the speed of the processor.
In hardware, the RAID5 recovery time is equal to the RAID10 recovery time.
Considering that modern processors can easily handle data flow on the order of 100MB / s (the approximate peak read speed of modern drives), it can be argued that, if implemented correctly, software RAID5 will be not much slower than RAID10.
About reliability during recovery. For the case under consideration, there is no need to talk about this at all - you need to make backup copies! In the general case, it should be taken into account that at the time of recovery, the number of disks in RAID10 is greater than in RAID5, which means the probability of failure is higher, and it cannot be said that at the time of recovery, RAID10 is definitely more reliable.

Addition:
If RAID-5EE is used, then in the event of a first failure, it is “compressed” into RAID-5, which can take a very long time. However, it should be noted that the result is a full-fledged RAID-5, which is resistant to single failures, i.e. in fact (with some limitations), the system can survive two failures in a row.

CPU load:

Software implementation of RAID5 loads the processor. For modern processors, this is usually not critical, but for fast drives you need to keep in mind that the faster the drive, the greater the load on the processor.

And again, reliability is the last nail in the coffin lid:

For some reason, when talking about RAID10 and especially about RAID1, one very important point is missed by everyone.
Yes, in the event of a physical failure of the drive, it will provide data recovery from a copy, but what if the drives return different data? Indeed, in RAID1 there is no way to find out which data is correct! You can try to determine the accuracy of the data on their content, but this is not a trivial task that can be performed only manually, and, by no means always.
It is for this reason that I do not consider RAID1 here at all - it does not provide a mechanism for controlling the reliability of data. And RAID10 in general, too.
And RAID5 (6?) Generally provides it very well - if one of the three drives returns incorrect data, it will definitely be known that they are not reliable.
How can this (data falsity) happen?
Problems with overheating drives. Power problems. Problems with the firmware of disks. Lots of options! Up to complete burnout of the electronics as a result of their breaking down the computer power supply. In this case, you can try to revive the disks by putting the boards from similar devices, but there will be no guarantee that all data on the disks is reliable.
And one more carnation there. In the topic with which it all started a lot about BER (bit error rates). Without going into details, I note that, firstly, for hard drives, it is still customary to talk more about MTBF (mean time between failures), secondly, if you talk about BER, then UBER (uncorrectable bit error rates), and , thirdly, it will be an argument in favor of RAID5 - if the drives return the corrupted data (which went through all the correction procedures), then how can you tell which drive to believe?

Addition:
Wiki says the opposite - recovery information is not used until one of the disks fails. Life experience, however, says otherwise, but it was a long time ago and I don’t even remember on which controller (maybe it was one non-standard RAID level). So it is possible to speak unequivocally about the reliability of the data only for ZFS / RAID-6.

Verdict:

~~The verdict is simple - if you don’t need the extra problems out of the blue, then you don’t need to raid RAID1 or RAID0 + 1 - you need to look towards RAID5, 5E, 6, ZFS~~
The verdict against “pure” RAID5 is not unambiguous :)

Udpate:
Corrected the calculation of probability - the output has not changed. Corrected "RAID0 + 1" to "RAID10". I note that in the case described, “RAID0 + 1” is identical to “RAID1 + 0”. But the correct name is of course "RAID10".

Udpate2:
That is so easy and not intricate the meaning of the article has changed, if not to the opposite, then certainly radically.

Source: https://habr.com/ru/post/78348/

All Articles