Should I buy ECC memory?

Jeff Atwood, arguably the most widely read programmer blogger, has published a post against using ECC memory . As I understand it, his arguments are:

Google did not use ECC when they built their servers in 1999.
Most RAM errors are systematic, not random errors.
RAM errors rarely occur because the hardware has improved.
If ECC memory were really important, it would be used everywhere, and not only in servers. The fee for this kind of optional material is clearly too doubtful.

Let's look at these arguments one by one:

1. Google did not use ECC in 1999

If you are doing something just because Google did it once, then try:

A. Put your servers in shipping containers.

Today, they are still writing articles that this is a great idea, although Google just conducted an experiment that was regarded as unsuccessful. It turns out that even Google’s experiments are not always successful. In fact, their well-known addiction to “breakthrough projects” (“lunshots”) means that they have more failed experiments than most companies. In my opinion, for them this is a significant competitive advantage. You should not do this advantage more than it is, blindly copying failed experiments.
')

B. Cause fires in your own data center.

Part of Atwood's post discusses how amazing these servers were:

Some may look at these early Google servers and see the lack of professionalism regarding the risk of fire. Not me. I see here a far-sighted understanding of how inexpensive standard equipment will shape the modern Internet.

The last part of this is true. But in the first part there is some truth. When Google began developing its own fees, one generation had the problem of "growth" ( 1 ), which caused a non-zero number of fires.

By the way, if you go to the post of Jeff and look at the photo referenced by the quotation, then you will see that there are many jumper cables on the boards. This caused problems and was fixed in the next generation of equipment. You can also see rather sloppy cable wiring, which additionally caused problems and was also quickly resolved. There were other problems, but I will leave them as an exercise for the reader.

C. Create servers that hurt your employees.

The sharp edges of one of the generations of Google servers have earned them a reputation for being made from "razor blades and hate".

D. Create your own weather in your data center

After talking with employees of many large technology companies, it seems that most companies had such climate control that they had clouds or fog in their data centers. You could call it Google’s prudent and insidious plan to replicate the Seattle weather to entice Microsoft employees. Alternatively, it could be a plan to create literally "cloud computing." Or maybe not.

Please note that all the above Google tried and then changed. Making mistakes and then eliminating them is a common occurrence in any organization that successfully develops. If you idolize engineering practice, you should hold on at least to modern practice, and not to what was done in 1999 .

When Google used servers without ECC in 1999, a number of symptoms appeared on them, which, as it turned out, were caused by memory corruption. Including search index, which returned virtually random results in queries. The real failure mode here is instructive. I often hear that ECC can be ignored on these machines, because errors in individual results are valid. But even if you consider random errors to be valid, ignoring them means that there is a danger of completely damaging the data, unless you carry out a thorough analysis to make sure that one error can only slightly distort one result.

In studies conducted on file systems, it has repeatedly been shown that, despite the heroic attempts to create systems that are resistant to one error, this is extremely difficult to do. Essentially, every heavily testable file system can have a serious failure due to a single error ( see the results of the research team of Andrea and Ramsey, Wisconsin, if you are interested in this ). I'm not going to attack file system developers. They are better versed in this analysis than 99.9% of programmers. It has simply been repeatedly shown that this problem is so difficult that people cannot reasonably discuss it, and the automated tool for such an analysis is still far from the process of simply pressing a button. Google discusses error detection and correction in its companion to store inventory data , and ECC memory is considered the most correct option when it is obvious that you need to use hardware error correction ( 2 ).

Google has excellent infrastructure. From what I’ve heard about infrastructure in other major IT companies, Google seems to be the best in the world. But this does not mean that everything that they do should be copied. Even if we consider only their good ideas, for most companies it makes no sense to copy them. They created a replacement for the Linux interception scheduler, which uses both runtime hardware information and static tracing to allow them to take advantage of the new hardware in Intel's server processors, which allows them to dynamically split the cache between the cores . If you use it on all their equipment, then Google will save more money in a week than the Stack Exchange company spent on all its cars in its entire history. Does this mean that you have to copy google? No, if the manna from heaven has not already fallen on you, for example, in the form that your basic infrastructure is written in highly optimized C ++, and not in Java or (God forbid) Ruby. And the fact is that for the overwhelming majority of companies, writing programs in a language that entails a 20-fold decrease in productivity is a perfectly reasonable decision.

2. Most RAM errors are systematic errors.

The argument against ECC reproduces the following section of the DRAM error study (highlighted by Jeff):

Our study has several major results. First, we found that approximately 70% of DRAM failures are recurring (for example, permanent) failures, while only 30% are intermittent (intermittent) failures . Second, we found that large multi-bit faults, such as faults that affect the entire row, column, or block, account for more than 40% of all DRAM faults. Third, we found that almost 5% of DRAM failures affect circuit board-level circuits, such as data lines (DQ) or gating (DQS). Finally, we found that the Chipkill function reduced the system failure rate caused by DRAM failures by 36 times.

The quote seems somewhat ironic, since it does not look like an argument against ECC, but an argument for Chipkill — a certain class of ECC. Putting it aside, Jeff's post indicates that systematic errors occur twice as often as random errors. Then the post reports that they run memtest on their machines when systematic errors occur.

First, the 2: 1 ratio is not so large as to simply ignore random errors. Secondly, fasting implies Jeff's belief that systematic errors are essentially unchanged and cannot manifest after some time. This is not true. Electronics wear just as mechanical devices wear out. The mechanisms are different, but the effects are similar. Indeed, if we compare the analysis of the reliability of chips with other types of analysis of reliability, we can see that they often use the same families of distributions to model failures. Third, Jeff’s line of reasoning implies that the ECC cannot help in finding or correcting errors, which is not only wrong, but also directly contradicts the quotation.

So, how often are you going to run memtest on your machines in an attempt to catch these system errors and how much data loss are you willing to survive? One of the key applications of ECC is not to correct errors, but to signal errors, so that equipment can be replaced before “silent corruption” (“hidden data corruption”) occurs. Who agrees to shut down everything on the machine every day to start memtest? It would be much more expensive than just buying an ECC memory. And even if you could convince you to drive memory testing, memtest would not detect as many errors as ECC can find.

When I worked at a company with a fleet of about a thousand cars, we noticed that we had strange failures when checking the integrity of the data, and after about six months we realized that failures on some machines are more likely than on others. These failures were quite rare (maybe a couple of times a week on average), so it took a lot of time to accumulate information and understand what is happening. Without knowing the cause, the analysis of the logs in order to understand that the errors were caused by isolated instances of bit inversion (with a high probability) was also nontrivial. We were lucky that as a side effect of the process we used, checksums were calculated in a separate process on another machine at different times, so the error could not distort the result and extend this damage to the checksum.

If you are just trying to protect yourself with checksums in memory, there is a good chance that you perform a checksum calculation on data that is already damaged and you get the correct checksum of incorrect data, unless you are doing some really unusual calculations that give their own checksums. And if you are serious about error correction, then you probably still use ECC.

In any case, after completing the analysis, we found that memtest could not detect any problems, but replacing the RAM on bad machines led to a decrease in the error rate by one to two orders of magnitude. Most services do not have this kind of checksums that we had; these services will simply silently record the damaged data in the permanent storage and will not see the problem until the client starts complaining.

3. Due to the development of hardware, errors have become very rare

The data in the post is not enough for such an assertion. Note that as RAM usage increases and continues to increase exponentially, RAM failures must decrease at a higher exponential rate in order to actually reduce the frequency of data corruption. In addition, as the chips continue to decrease, the elements become smaller, which makes wear problems discussed in the second paragraph more relevant. For example, with a 20nm technology, a DRAM capacitor can stick 50 electrons somewhere, and this number will be lower for the next generation of DRAM, while maintaining a decreasing trend.

The 2012 study, which Atwood cites , has this graph of corrected errors (a subset of all errors) on ten randomly selected failed nodes (6% of the nodes had at least one failure):

Fig. 1. Monthly corrected errors for randomly selected nodes. The number of errors indicates the type of failure that occurred at each node.

We are talking about the number of errors from 10 to 10 thousand for a typical node that fails, and this is a carefully selected study from a post stating that you do not need ECC memory. Please note that only 16 GB of RAM are considered here, which is an order of magnitude smaller than modern servers, and that the study was conducted on an older process technology that was less vulnerable to noise than the current one.

For those who are used to dealing with reliability issues and just want to know the FIT value (the unit of failure rate measurement): research shows that the FIT value is 0.057-0.071 failures per megabit (which, in contrast to Atwood’s statement, is not an astoundingly low number). ).

If you take the most optimistic FIT value of 0.057 and calculate for a server not with the largest RAM (I use 128 GB here, because the servers I see at the moment usually have RAM from 128 to 1.5 TB) then we will get the expected value of 0.057 * 1,000 * 1,000 * 8,760 / 1,000,000,000 = 0.5 failures per year for each server. Please note that this refers to failures, not errors. From the graph above it can be seen that a failure can easily cause hundreds or thousands of errors per month. It should also be noted that there are several nodes that have no errors at the beginning of the study, but errors appear later.

The company Sun / Oracle seriously faced this a few decades ago. The DRAM transistors and capacitors were getting smaller as they are now, and the memory usage and caches were growing, just like they are now. Faced, on the one hand, with an ever decreasing transistor, which was less resistant to temporary disruption and more difficult to manufacture, and, on the other hand, with a growing onboard cache, the overwhelming majority of server vendors introduced ECC into their caches. Sun decided to save a few dollars and not use ECC. The direct result was that a number of Sun customers reported periodic data corruption. As a result, Sun has been developing a new architecture with a cache for ECC for several years and forced customers to sign an NDA to replace the chips.

Of course, it’s impossible to constantly hide such things, and when they surfaced, Sun’s reputation for building reliable servers received a strong blow, much like the case when Sun tried to hide poor performance indicators by imposing a ban on benchmarks in its terms of use .

One more note: when you pay for ECC, you do not just pay for ECC memory - you pay for parts (processors, fees) that are of higher quality. This can be easily seen with the frequency of disk failure, and I have heard that many people notice this in their personal observations.

If you cite publicly available research: as far as I remember, a group of Andrea and Ramsey released a document SIGMETRICS several years ago, which showed that the probability of a failure in reading a SATA disk is 4 times higher than that of a SCSI disk, and the probability of hidden data damage is 10 times higher . This ratio was maintained even when using disks of the same manufacturer. There is no particular reason to think that the SCSI interface should be more reliable than the SATA interface, but this is not about the interface. We are talking about the purchase of highly reliable server components compared to the client. Perhaps, specifically, the reliability of the disk does not interest you, because you have everything on checksums, and the damage is easy to find, but there are some types of violations that are harder to detect.

4. If ECC memory were really important, then it would be used everywhere, and not only in servers.

Paraphrasing this argument a bit, one can say that “if this characteristic were really important for servers, it would also be used in non-servers”. You can apply this argument to a fairly large amount of server hardware. In fact, this is one of the most unpleasant problems faced by major cloud solution providers.

They have enough leverage to get the most components at the right price. But bargaining will only happen where there is more than one viable supplier.

One of the few areas where there are no viable competitors is the production of central processing units and video accelerators. Fortunately for large suppliers, they usually do not need video accelerators, they need processors, a lot - it happened a long time ago. There were several attempts of processor suppliers to enter the server market, but always every such attempt had fatal flaws from the very beginning, which made its doom obvious (and these are often projects that require at least 5 years, ie, it was necessary to spend a lot of time without confidence in success).

Qualcomm's efforts got a lot of noise, but when I communicate with my contacts at Qualcomm, they all tell me that the chip that was made at the moment is designed for testing. This happened because Qualcomm needed to learn how to make a server chip from all those experts whom it had poached from IBM, and that the next chip would be the first to hopefully become competitive. I have high hopes for Qualcomm, as well as the efforts of ARM to create good server components, but these efforts do not yet give the desired result.

The near total unsuitability of the current ARM (and POWER) variants (apart from the hypothetical variants of Apple’s impressive ARM chip) for most server workloads in terms of performance per dollar total cost of ownership (TCO) is a bit of an issue, so I'll leave it to another publication. But the fact is that Intel has such a position in the market that it can make people pay top for server functions. And Intel does it. In addition, some functions are really more important for servers than for mobile devices with several gigabytes of RAM and an energy budget of several watts, mobile devices from which they still expect periodic crashes and reboots.

Conclusion

Should I buy ECC-RAM? It depends on a lot. For servers, this is probably a good option given the cost. Although it is actually difficult to carry out a cost / benefit analysis, because it is rather difficult to determine the damage from hidden data damage or the cost of losing a developer’s half a year to track intermittent failures, only to find that they are caused by memory usage without ECC.

For desktops, I am also a supporter of ECC. But if you do not do regular backups , then it is more useful for you to invest in regular backups than in ECC memory. And if you have backups without ECC, then you can easily write the damaged data to the main repository and replicate this damaged data to the backup.

Thanks to Prabhakar Ragda, Tom Murphy, Jay Weisskopf, Leah Hanson, Joe Wilder and Ralph Corderoy for the discussion / comments / corrections. Also, thanks (or maybe not-thanks) to Leah for convincing me to write this oral impromptu as a blog post. We apologize for any errors, lack of links and sublime prose; this is essentially a record of half the discussion, and I did not explain the conditions, did not provide references or did not verify the facts at that level of detail, as I usually do.

1 . One of the funniest examples is (at least for me) the magical self-healing fuse. Although there are many implementations, let us imagine a fused jumper on the chip as some kind of resistor. If you pass some current through it, then you should get a connection. If the current is too large, the resistor will warm up and eventually collapse. This is usually used to disable elements on the chip or for actions such as setting the clock frequency. The basic principle is that after burning the lintel there is no way to return it to its original state.

A long time ago, there was a manufacturer of semiconductor devices, who had a little hurried to his production process and somewhat reduced the tolerances in some technological generation. After a few months (or years), the connection between the two ends of a similar jumper could reappear and restore it. If you are lucky, then this jumper will be something like the high-end bit of the clock frequency multiplier, which, if changed, will disable the chip. If you are unlucky, this will lead to hidden data corruption.

I heard from many people in different companies about the problems in this technological generation of this manufacturer, so these were not individual cases. When I say it's funny, I mean it's funny to hear this story in a bar. It is less amusing to find out after a year of testing that some of your chips do not work, because their jumper settings are meaningless, and you need to remake your chip and postpone the release for 3 months. By the way, this situation with the restoration of a fuse jumper is another example of a class of errors, the severity of which can be smoothed with the help of ECC.

This is not a Google issue; I mention this only because many people with whom I communicate are surprised at how the hardware can fail.

2 If you do not want to delve into the whole book, then here is the necessary fragment:

In a system that can withstand a number of failures at the software level, the minimum requirement for hardware is that failures of this part are always detected and reported to the software sufficiently timely to allow the software infrastructure to restrict them and take appropriate recovery actions. It is not necessary for the hardware to cope with all the failures. This does not mean that equipment for such systems should be designed without error correction. Whenever error correction functionality can be offered with a reasonable price or complexity, their support often pays off. This means that if hardware error correction would be extremely expensive, then the system could be able to use a cheaper version that would only provide detection capabilities. Modern DRAM systems are a good example of a situation in which powerful error correction can be provided at very low additional costs. However, mitigating the requirement for detecting hardware errors would be much more difficult, since it would mean that each software component would be burdened with having to check its own proper execution. At the initial stage of its history, Google had to deal with servers on which DRAM even lacked parity control. Creating a web search index consists essentially of a very large sorting / merging operation that uses several machines for a long time. In 2000, one of the monthly updates to the Google web index did not pass a preliminary check, when it was discovered that a subset of the verified requests returned documents, apparently randomly. After some research, a situation was discovered in the new index files that corresponded to fixing the bit to zero in a certain place in the data structures, which was a negative side effect of streaming a large amount of data through a faulty DRAM chip. Consistency checks were added to the index data structures to minimize the likelihood of this problem recurring, and there were no further problems of this nature. However, it should be noted that this method does not guarantee 100% detection of errors in the indexing pass, since not all memory positions are checked - instructions, for example, remain unchecked. This worked because the index data structures were so much larger than all the other data involved in the calculation that the presence of these self-controlled data structures made it very likely that machines with defective DRAM would be identified and excluded from the cluster. The next generation of machines in Google already contained parity detection in memory, and as soon as the price of memory with ECC dropped to a competitive level, all subsequent generations used ECC-DRAM.

Source: https://habr.com/ru/post/328370/

All Articles