📜 ⬆️ ⬇️

DRAM Errors or don't rush to blame Software


When the computer hangs or gives the notorious BSOD, as a rule, software is blamed for everything ( as well as driver curves and the hands of the under-educated programmers, Microsoft and Bill Gates personally, etc. ). But in the past few years, scientists have begun to look more closely at hardware failures, and have discovered another serious type of problems that manifest themselves much more often than many people think. About them will be discussed.

Chip manufacturers make a lot of effort to make sure that their products are thoroughly tested and work properly. But they do not like to say that it is not easy to ensure the correct operation of the chips for a long time. Since the late 70s, it is known that hidden hardware problems can cause unexpected switching of bits inside chips from one state to another. The fact that transistors decrease from year to year only increases the likelihood that the passage of a flying particle will switch their state. Such failures are called “soft errors” and their significance will only increase with decreasing technical process, since even a single particle can do much more damage.

But “soft errors” is only part of the problem. Over the past five years, researchers have monitored several very large data centers, and they found that in many cases the cause of the failures was simply faulty memory chips. Temperature exposure or manufacturing defects over time can lead to the appearance of component failures (destruction of conductive bonds or the appearance of new ones). This is “hard errors”

Soft errors


“Soft errors” are extremely worried about the developers of the next generation of microcircuits because of one important factor: power consumption. When the next generation of supercomputers appears, they will contain even more microprocessors and memory chips. And all this huge number of transistors will require more and more energy in order to avoid uncontrolled switching of bits.
')
The problem itself is related to the basics of physics. As manufacturers make the connections inside the microcircuits more subtle, the electrons simply “run away” like water droplets from a leaky hose. The thinner the bond, the more energy is required to maintain correct operation.
The problem is so complex that Intel is working with the US Department of Energy and a number of other government organizations to resolve it. Using the future generation of 5nm process technology, Intel will create 1000 times more powerful supercomputers than existing ones by the end of the decade. But it seems that such supercomputers will not only be much faster, but will also be true energy eaters.
“We have a way to achieve this without worrying about energy consumption” (achieve a 1000 times increase in productivity). "But if you want us to solve the problem of energy consumption - this is beyond our plans."


The graph - not the most current data, and related to a different type of memory. Specifically for DRAM data could not be found. But the general trend is visible: an increase in the voltage level reduces the number of failures.


Manufacturers do not like to talk about how often their products falter - such information is considered secret and it is not easy to find research on this topic. Often, companies simply forbid their customers to talk about the frequency of hardware failures.
“This is an area of ​​active research. We do not talk about it openly, because This is a very sensitive topic. ”

Soft Errors?


“Soft errors” is one of the problems, but there are other problems that hardware manufacturers say even less. According to the University of Toronto, when computer memory fails, it is much more likely to be caused by age or manufacturing errors (these are “hard errors”) rather than “soft errors” caused by cosmic radiation.

In 2007, a group of researchers got access to Google data centers, where they collected information about how often specialized search engine giant Linux systems crashed. It was recorded ten times more failures than expected. If previous studies reported numbers from 200 to 5,000 failures per billion operating hours, Google’s research showed numbers from 25,000 to 75,000.
But even more interesting, about 8% of the memory chips were responsible for more than 90% of the failures.

A closer look revealed that malfunctions tend to occur on older machine park representatives. After about 20 months of operation, the number of failures is growing rapidly. It is probably not by chance that a typical update of the IT infrastructure takes place in the area of ​​the three-year mark. And, probably, the results of these studies will prove to be another argument in favor of the fact that postponing planned upgrades will soon begin to cost more than savings.

Thus, the problems found turned out to be “hard errors” rather than “soft errors”, and there were more of them than in the wildest predictions.
Subsequent studies have shown a similar picture for memory chips used by IBM in Blue Gene systems and for the Canadian supercomputer SciNet. For all systems, the frequency of memory failures was about the same.

Studies conducted by AMD have also shown that hard errors for DRAM chips are much more common than soft errors. But AMD, like Intel, has never published studies on the frequency of failures of SRAM memory used in microprocessors.
Vilas Sridaran (Vilas Sridharan), reliability architect from AMD and one of the authors of articles on this topic said:
“This is not a new problem. Errors in DRAM modules were first noticed in 1979, but since then we continue to learn. ”
And according to Samsung, the largest DRAM memory manufacturer in the world, they have
"There is no detailed information that they could report on this."


Chip manufacturers should pay more attention to hard errors. Today, there are many ways to correct “soft errors”: from error correction codes (ECC) to the use of leaded mines to host servers. But in the case of “hard errors”, everything is not so good.
At the same time, “hard errors” cause more errors than most people assume. And if Hi-end supercomputers and servers can use ECC, in the case of a PC this is not the case. Most mobile devices, as well as laptops and desktops, do not contain ECC. Partly because, according to the failure model used, most of them are caused by “soft errors”. This model is beneficial to manufacturers. And users do their bit by voting with “ruble”. If you had to choose memory modules for a home (and not just a home) computer, did you consider the presence of ECC as an important detail?

Meanwhile, the presence of ECC is even more important than it seemed before: it often conceals the difference between a recoverable error and a catastrophic one, leading to forced downtime. It is not surprising that the creators of data centers and supercomputers insist on it.
By the way, a similar situation is observed for SSD. When choosing between 240Gb and 256Gb models, at the same price, most will choose the second. At the same time, the fact that the capacity is actually the same, but the first model reserves 16Gb for error correction, units will be noticed, and for quite a few it will affect the choice in favor of the first. I will not name specific models and vendors - this is not so significant.

Unfortunately, today BSOD can often be seen on billboards, information boards, ATMs, at airports and many other places. Who knows if this situation will change in the future for the better?

And finally, the thematic demotivator :):

Source: https://habr.com/ru/post/171395/


All Articles