Fault statistics in server memory

In 2009, at the annual scientific conference SIGMETRICS, a group of researchers who worked at the University of Toronto with data collected and submitted for study by Google, published an extremely interesting document " DRAM Errors in the Wild: A Large-Scale Field Study " dedicated to statistics on failures in server RAM (DRAM). Although similar studies were carried out earlier (for example, a 2007 study that observed a fleet of 300 computers), it was the first survey that covered such a large server fleet, calculated in thousands of units, for more than two years, and provided such comprehensive statistical information.

I also note that the same group of researchers, led by a graduate student and now a professor at the University of Toronto, Bianca Shroeder, previously in 2007, published an equally interesting study on the statistics of hard drive failures in Google’s data centers (a brief popular squeeze from Failure Trends in a Large Disk Drive Population (pdf 242 KB), if you are bored with reading the entire report, you can find it here: http://blog.aboutnetapp.ru/archives/tag/google ). In addition, several other works belong to their pen, in particular, on the effect of temperature and cooling, and on the statistics of failures in the operational memory caused, presumably, by high-energy cosmic rays. Links to publications can be found on the Schröder home page, on the university server.

Briefly about how exactly the statistical data was collected. The fact is that for quite a long time (the published work analyzed a period of about 2.5 years), Google data centers collect various monitoring data and other events in the life of equipment in a large database, the data of which can be further analyzed for any desired period. of time.
')

(in the photo, by the way, the authentic look of the Google server platform, it is from these "bricks" that Google clusters are created, many thousands of nodes in size, however, it has already been written about them here)

The results of this analysis are presented in a published paper. And the results are surprising in many ways, forcing to look differently at the issues of reliability and customary assumptions in the field of server equipment reliability.

The study convincingly demonstrated that the impact of failures in RAM is significantly underestimated , that failures of RAM occur more often than before it was considered to be, finally, many assumptions, for example, that RAM practically does not “age” as “age” By increasing the likelihood of failures, components with moving parts, such as hard drives, for example, or that overheating has a detrimental effect on the performance of RAM, are incorrect and need to be revised.

There is no doubt that in the past few years, due to the comparative cheapening of DRAM, and the widespread use of server virtualization systems, which are extremely eager for memory, the concentration of more and more RAM in one server system also increases its reliability requirements.

The study showed that approximately every third server (or 8% of memory modules) in the observed data centers for 2.5 years of research encountered a malfunction in the RAM . The number of failures registered by the monitoring system was over 4000 per year! Most of them, of course, were eliminated using the ECC (Error Correction Code) used in RAM and its more complex variants, such as Chipkill (it allows to eliminate multi-bit errors, for example, immediately in a group of cells). However, Uncorrectable Errors, that is, errors that could not be fixed, and which almost certainly led to fatal consequences like BSOD or kernel panic are much more common than is commonly believed. And in the case of memory use without ECC, each of these errors is almost certainly BSOD or kernel panic, or a serious application failure. After all, for example, very many people keep the database data in memory to speed up its work.

Compared to a previously published study, the work of the Schröder group sharply increased the “expectations” of failures. Thus, they estimated the failure events at 25-70 thousand failures per billion server hours, which is almost fifteen times higher than the earlier estimate made on a smaller population.
With failures as a result of uncorrectable (uncorrectable, uncorrected ECC or Chipkill), 1.3% of servers per year, or about 0.22% of DIMM, were encountered.
Systems using “multi-bit” mechanisms, such as Chipkill, had 4–10 times fewer failures, compared to conventional ECC.

Other interesting findings from the published paper are:

The working temperature, and its increase very little correlates with the probability of failure in DRAM . This is another fact that indicates that the prevailing opinion in the industry about the harmful effects of high temperature on semiconductors and computer equipment (opinion based on the study of the 1980s) should be radically revised today. This is another confirmation of this fact, which has already been installed, for example in the work on hard drives. Paradoxically, it was found that the smallest number of HDD failures was observed at temperatures around 40-45 degrees, and its reduction in the number of failures increased (!).
In the case of DRAM, the correlation between temperature (in the observed range of about 20 degrees between the lowest and the highest) and the failures was extremely small.

(hereinafter on the slides: CE - correctable errors, errors registered but corrected by ECC, UE - uncorrectable errors)

However , failures significantly correlated with memory load and exchange rate with it (partly high memory load affects its temperature, of course, but not always). It is likely that the intensive exchange and the large relative amount of memory filled with data greatly increases the likelihood of a quick failure detection.

It was found that the probability of a repeated failure in a previously failed memory module is hundreds of times higher than in a previously failed one . This can be caused both by the presence of a poorly detected technological defect, and by the fact that a failure, for example, the breakdown by a charged particle of cosmic rays, does not pass completely to memory, even if the error was corrected by ECC.
70-80% of cases when an unrecoverable error was registered in the memory module, this module already had a correctable ECC or Chipkill failure in this or previous month.

It was found that relatively new modules, made with higher density and finer process technology, do not show a higher failure rate . Apparently, while in the DRAM technology, the technological limit, near which reliability problems begin, has not yet been reached. In the observed park of modules, there were about six different types and generations of memory (DDR1, DDR2 and FBDIMM of different types), and no correlation was found between high density and the number of failures and failures.

Finally, the effect of "aging" in DRAM modules was demonstrated with frightening clarity. Moreover, in memory, it manifested itself much more clearly than, for example, in HDD, where the threshold, after which the failures grow at times, was about 3-4 years.

Paradoxically, statistics show an increasing growth rate of correctable errors with an increase in the age of the modules , but a decreasing rate for Uncorrectable errors, but most likely this is simply the result of a planned replacement of memory in servers that have been noticed for failures.

Surprisingly, DRAM, devoid of any moving parts, shows a significant and continuing growth of correctable failures after a year and a half of operation.

Summing up, I would like to note that these statistics force us to revise the principles of building server platforms and data centers that are familiar to many, based on "life experience", and the position "the colder the better," "memory does not wear out", "if the north correctly assembled, it does not break "and" ECC DRAM is a waste of money, because my desktop works without ECC, and nothing. " And the sooner such capturing moods are eliminated in such a serious area as the construction of data centers, the better it will be.
And I would like to recommend the inescapable source of sweetness , intellectual exercise and food for the brain, as publications of the annual conferences of the USENIX group , to you, gentlemen, not a marketing bulletin, which is so familiar to us all, but a real serious science, which you will not dismiss.

Source: https://habr.com/ru/post/171407/

All Articles

Fault statistics in server memory

More articles: