📜 ⬆️ ⬇️

Temperature control in the data center: why it is sometimes possible and hotter

Today we will talk about cooling the data center. A group of scientists from the University of Toronto has published a study of the method of cooling data centers, in which the temperature is specifically raised. We decided to understand the essence of this work and analyzed the situation.

/ photo by Emilio KĂĽffer CC

Recently, much of the electricity consumed and carbon emissions are from data centers. Huge power is spent on their cooling, which was the main motivator for conducting research in the field of temperature control. An interesting fact is that until the end it is not clear at what level it is necessary to maintain the temperature in the data centers.
')
Most companies set the temperature recommended by the suppliers of the equipment used, but it is not clear how its increase affects the performance of the systems. At the same time, according to the results of the conducted research, a temperature increase of only 1 degree can reduce energy consumption by 2-5%.

For this reason, it was decided to conduct a study and answer the question of how to manage the temperature in data centers? To do this, an extensive set of data on production equipment was collected, which allowed to study the effect of temperature on hardware performance, including the reliability of the storage subsystem, memory subsystem and the server as a whole.

Foreword


Although increasing the temperature in the data center looks like the easiest way to save electricity and reduce carbon emissions, there are several problems here: one of them is a possible reduction in system reliability. Unfortunately, detailed information about the effect of high temperatures on the performance of servers is very small, moreover, it is very controversial.

According to the results of some studies, it was found that every 10 ° C after 21 ° C increases the probability of electronic failure by 50%. Other papers say that every 15 ° C doubles the frequency of failures of hard drives, and in a recent study, Google found that low temperatures, on the contrary, further harm the work of storage devices.

With the increase in temperature in data centers, another problem arises related to the decrease in server performance. The fact is that when the temperature reaches a critical point, the processor enters the throttling mode (trotting), and the coolers begin to rotate at an increased speed - all this leads to additional power leaks and increased power consumption.

Temperature and reliability


Let's first take a look at two specific hardware components — hard drives and DRAM, since they are replaced most frequently in modern data centers.

Temperature and errors in the hidden sectors of the hard disk (LSE)

LSE is one of the most common types of errors when individual disk sectors become inaccessible and data stored on them are lost (if the system is not redundant and cannot recover them). 3-4% of all disks are faced with LSE, and these numbers only grow as available capacities grow.

The reliability of the equipment is influenced by a huge number of factors (load, humidity, voltage drops, maintenance of devices); we have divided the results obtained for each model by data centers. It is pretty obvious that with increasing temperature, the likelihood of LSE also increases. However, the increase is much slower than standard estimation models suggest (for example, a model based on the Arrhenius equation ). It is believed that there is an exponential relationship between temperature and the number of errors, which leads to a doubling of the failure rate for every additional 10-15 ° C.

Scientists conducted a statistical analysis and found that higher temperatures do not increase the number of LSEs if the disk is already subject to LSE, and this tells us that the causes of errors in hidden sectors are the same for both hot and cold disks. In this case, the frequency of occurrence of LSE for a single disk model may vary from data center to data center.

In the range known to us, namely from 0 to 36 months, old disks have the same probability of encountering LSE as new ones. Scientists measured the reading load by the number of operations performed per month and placed the disk in a group with a low degree of load if it [the number of operations] turned out to be less than the median for the presented data set (otherwise, in the group with a high load). Based on an analysis of the data, they stated that the degree of disk usage does not affect the likelihood of LSE with increasing temperature.

Temperature and drive failures


The purpose of this section is to consider how temperature affects the frequency of disk failure. To get the most complete answer to this question, the impact of the workload was taken into account, as well as the differences between disk models and data centers. Based on data from 5 different models of storage devices collected from January 2007 to May 2009 and provided by 19 different Google data centers.

For temperatures below 50 ° C, the disk failure rate grows much slower than classic models suggest. The increase in the number of failures with increasing temperature slightly. Following the same methodology as in the case of the LSE, the disk groups were broken down by degree of load and age - as it turned out, neither of these factors significantly affects the frequency of disk failures.

The effect of temperature on performance


To study the effect of ambient temperature on server performance, scientists built a test bench with a thermal chamber. The thermal chamber was large enough to fit inside the whole server, and allowed us to control temperatures in the range of -10 ° C to 60 ° C with an accuracy of 0.1 ° C.

For the experiment was chosen one of the most popular servers - Dell PowerEdge R710. It has a quad-core Intel Xeon 5520 processor with a frequency of 2.26 GHz, 8 MB third-level cache, 16 GB DDR3 ECC and runs on Ubuntu 10.04 Server with a Linux 2.6.32-28-server core. Hard disks (SAS and SATA) from different suppliers were connected to it.

In the course of the work, a series of stress tests were carried out using microbench marks and macrobench marks designed to simulate the workload that real applications create. Used benchmarks and techniques: STREAM, GUPS, Dhrystone, Whetstone, random write / random read, sequential write / sequential read, OLTP-Mem, OLTP-Disk, DSS-Mem, DSS-Disk, PostMark, BLAST.

All SAS disks and one SATA disk (Hitachi Deskstar) show some decrease in performance at high temperatures: from 5-10% to 30%. Taking into account the fact that for all models the decline occurs in the same temperature range (and not at an arbitrary moment), and none of the discs reported errors, we can assume that the cause of performance degradation is the inclusion of protective mechanisms devices.

Increase server power consumption


Increasing the temperature of the air entering the electronic equipment may affect the amount of energy dissipated. Many IT firms are starting to increase the speed of rotation of the coolers when the ambient temperature reaches a certain threshold value.

Although the amount of energy consumed under different loads varies greatly, it begins to increase when the ambient temperature reaches 30 ° C and rises up to 40 ° C. The increase in energy consumption is 50% - this is a lot.

Here it can be said with certainty that differences in power consumption are associated with fans: an increase in the rotational speed occurs at the same temperature values ​​at which power consumption rises. Thus, with an increase in the ambient temperature, the amount of energy consumed increases, which for the most part is associated with an increase in the speed of rotation of the coolers. Energy leakage is extremely small.

findings


Increasing the temperature in data centers could potentially save a huge amount of money on electricity and reduce carbon emissions. Unfortunately, it is not completely clear what difficulties this is connected with, so many data centers are trying to maintain a low temperature in the room. Temperature has a much smaller impact on the reliability of the equipment than is assumed: errors associated with DRAM and failure of server nodes are weakly associated with high temperatures.

These encouraging results make it possible to pay attention to other aspects related to temperature, for example, to an increase in the energy consumption of individual servers with an increase in the temperature of the air supplied to them. The study found that this is due to an increase in the speed of rotation of the cooling system fans. Power leakage in this case is completely insignificant. Most of this energy is wasted because of poorly compiled algorithms for controlling the speed of rotation of coolers.

However, this is not so simple, so that you can make some general recommendations or predictions about what the temperature should be in the data center, and how much energy can be saved. Answers to these questions depend on too many factors related to the location of the data center and its purpose. However, we see that most organizations can “warm up” their equipment a little, without sacrificing the performance and reliability of the system.

Source: https://habr.com/ru/post/273575/


All Articles