📜 ⬆️ ⬇️

Why computer chips become faster to "grow old" and what to do about it

Last week on thematic resource Semiconductor Engineering published an article that highlights the trend of "aging" of chips in the data center. We decided to take a closer look at the material and tell what is happening in this area.


/ photo JD Hancock CC

On the sixth page of the McKinsey & Company report , it was noted that in 2008 the percentage of the workload of the data center computing systems did not exceed 6%. But with the development of cloud data centers, the growing popularity of virtual infrastructure and IaaS, the trend began to change. As NRDC notes in its report to the Data Center Efficiency Assessment , in 2014, the "employment rate" of servers in the cloud environment was already 65%.
')
This is due to the fact that today one of the basic criteria for choosing a cloud provider is accessibility. Therefore, suppliers seek to minimize the allowable downtime of the cloud platform. For example, if the SLA provider promises availability of the “three nines”, then the downtime per year can be no more than 9 hours. These conditions pose serious infrastructure requirements, so providers use load balancers to efficiently allocate CPU and memory resources and ensure the continuity of customer workflows.

Note that this approach additionally allows you to save on cooling and maintenance of equipment - according to a study by the Uptime Institute, optimization of the server fleet in the data centers of the world will free up about $ 30 billion. Due to this, the data center and IaaS-providers will be able to reduce the cost of services and make them more efficient.

Warm up problem


However, as the author of the article at Semiconductor Engineering notes , now in a number of data centers the increased load on the processors leads to their increased heating, which accelerates the aging of the chips. It is believed that at a certain ratio of the activation energy of the device (0.8 eV / K) and its working temperature (75–125 ° C), every 10 degrees above normal can reduce its service life by two times.

In this case, the temperature rise can lead to failures, which are quite difficult to diagnose. This is the so-called electromigration effect. It manifests itself in power surges, which lead to accidental short circuits of one or several contacts and disruption of the operation of the circuits (the appearance of delays and even breakage). An example of such a situation is the failure of a part of WD hard drives after a year of work - the reason was electromigration in one of the controllers used in the HDD.

Test for engineers


To reduce the "stress level" for chips and slow down the wear and tear of electronics, companies use different technologies. For example, CAD to simulate the operation of chips before transferring them to production. During the simulations , connections and power supply parameters are checked, static risks of failure are analyzed, and the electromagnetic field is assessed.

For example, computer-aided design systems help assess the effects of electromigration and mark places where expansion of the connections between transistors or an increase in the number of contacts is required to prevent premature system failure.

Regarding temperature modeling, then, as Ralph Iverson, an engineer from the CAD department at Synopsys, a CAD engineer, says , the “ random walk ” model is used to track overheating. It is used to optimize the target function (heat propagation paths) and predict the effect of temperature on the boards and chips.


/ photo IT-GRAD Unboxing servers Cisco UCS M4308

Another area is the development of systems for tracking the "aging" of chips in real time. For example, researchers from the Technical University of Munich suggested assessing the degree of degradation of a circuit by tracking the delay in the current flow through it. A special program controller evaluates the signal transmission delay and reports that the permissible level of degradation of the electronic device has been exceeded. In this case, the system can automatically reduce the frequency of the chip and adjust the operating voltage until the device is replaced.

Search for new materials


Electronics developers are also starting to pay attention to new materials that could withstand higher loads than silicon. For example, one of the potential materials that is considered as a replacement for silicon is gallium nitride (GaN).

This semiconductor has a higher mobility of charge carriers and a higher thermal conductivity coefficient. Due to this, transistors based on gallium nitride are smaller in size and have large power ratings. For example , nitride gallium transistors are used to create and deploy broadband wireless networks, including to ensure the operation of data centers.

The possibility of using materials such as antimonides and bismuthides is also being investigated. They can be the basis of infrared sensors for use in telecommunications equipment. Another option is to combine zinc and cadmium with tellurium. In particular, their potential can be useful for creating alternative sources of electricity ( solar panels ).

However, scientists themselves do not intend to discount silicon. Researchers at Tufts University's REAP Labs "give silicon a new life."

They work in the field of "silicon photonics", creating electron-optical microcircuits on a single silicon crystal. This gives the chips the ability to interact through optical, rather than electrical signals, which speeds up the transfer of large amounts of information and reduces the effect of electromagnetic interference on the system.

Work in this area and in IBM. The company has already been able to place devices made using silicon photonics technology right on the processor chip.

Such technologies will create fundamentally new computing systems that would withstand increased loads during operation.



PS What else do we write in the First blog about corporate IaaS:

Source: https://habr.com/ru/post/349480/


All Articles