Family photo without a grandmother, or what to do when an ECC crashes?

Not so long ago, I already wrote about an interesting error correction algorithm called LDPC. How to be if error correction can not perform its task? The LSI blog recently had a good note by Kent Smith on this subject, I decided to translate it.

Many users do not even suspect that in the course of reading data from external devices, be it traditional HDDs, or the same SSD, computers constantly encounter a large number of errors. Therefore, error correction codes (Error Correction Code or ECC) are used to correct error bits before incorrect data is returned to the user. But the possibilities of ECC are limited, and if the number of errors exceeds a certain limit, the Error Correction Code saves. Therefore, most companies operating in the data storage industry develop more complex algorithms. LSI in its SandForce controller goes noticeably “deeper” to protect user data.

What happens when an ECC passes?
')

If ECC algorithms cannot cope with their tasks, only backup mechanisms can come to the rescue. There are three alternatives. The first is when the user has to back up himself in order to avoid ECC failures and other threats that could damage the data or make it inaccessible. These can be both natural disasters, damaging buildings and what is inside (earthquakes, mudflows, landslides), and more exotic problems, starting with damage to computers without proper lightning protection, and ending with banal theft. According to the results of modern research, less than 10% of data are properly reserved. Not a very comfortable figure.

The second solution is to use RAID (Redundant Array of Independent Disks). Data is automatically stored in abundance on several disks (sometimes even connected to different computers), and in the event of a failure, this redundancy allows you to recover lost data. This technology is very widely used in the corporate sector, but among home users it is often exotic.

Is there a simple automatic solution that works for a single disk?

Yes, the answer to all three questions is exactly the third solution implemented by LSI in SandForce chipsets called RAISE ™ data protection. This technology was introduced in 2009 with the first SandForce chipset. RAISE stands for Redundant Array of Independent Silicon Elements, it sounds like RAID, and in some ways the technologies are similar. This technology uses individual SSD cells as disks in a RAID array, saving data with some “excess”. The original RAISE level 1 protection can protect against the failure of a whole flash-memory page (I already wrote about flash pages here , approx. Translator), which is definitely beyond the power of the classic ECC.

Introduced last December, the SF3700 gives RAISE even more flexibility, allowing you to further protect user data. The original RAISE Level 1 required a certain amount of data to be reserved exclusively for reserving data. In the case of a 64 GB drive, the amount available to the user was 60 or even 55 GB. Such losses are not very pleasant at such a small amount, and the only way to avoid this was to disable the RAISE protection. In new versions, such losses have become optional. The new, “fractional” option of RAISE allows this technology to use minimal amounts of memory, while ensuring information protection and a sufficient level of data redundancy (and the latter is especially important because it allows you to deal with recording gain, maintaining high speed SSD and protecting them from excessive wear, read more about redundancy in this article).

Best protection with RAISE Level 2

The new RAISE protection layer 2 allows you to protect users from even bigger failures, starting with reading errors of several pages and ending with the failure of a whole chip. This technology uses auto-redistribution of data, taking into account the number of errors of a particular chip. If the chip is close to failure, the security will redistribute the data from it to the others. This leads to a decrease in the amount available to the user, so RAISE Level 2 has the ability to "roll back" to Level 1 protection without losing the storage available to the user.

Another feature of the new chipset is the presence of an additional (ninth) channel of flash-memory, which allows manufacturers to make chips of higher capacity, which in turn will allow using RAISE 1 level without reducing the amount available to the user (without this, the disks will be reduced in volume to 60, 120 and 240 GB respectively).

Of course, RAISE will not protect you from a possible theft or catastrophe, such as a voltage surge or flood, but these events are clearly less likely than the usual ECC failure. Therefore, the best strategy is to buy a disk with RAISE and periodically make backup copies to protect against global problems.

Source: https://habr.com/ru/post/210350/

All Articles

Family photo without a grandmother, or what to do when an ECC crashes?

More articles: