📜 ⬆️ ⬇️

Dumb ways to die, or why data centers “fall”



The operation of the data center, the server is a bit like driving on a highway. When the road is empty, you can take a risk and ride against the rules, and nothing terrible will happen. But as soon as there are a lot of cars on the road, any wrong maneuver, an overlooked pit or pothole can lead to an accident. A similar situation with data centers and server: the greater the workload, the higher the cost of the error.


Today I’ll tell you about errors in design, construction and operation, which may cause an accident in the data center.


Errors at the design and construction stage


I had a separate article on errors in design. There are mainly listed points that will make the operation of the data center inconvenient, but now I’ll tell you about what will really hurt.


The project does not provide for the whole system. Some believe that the data center can do without a system of guaranteed power. i.e. DSU. Once, one of the customers for whom I did an audit of the data center project asked what the level of fault tolerance would be for Uptime without DGU. I have not found anything better than to call Tier 0.


Many perceive DGU as a reserve, which can be neglected, if necessary, - spare the same. In fact, it is treated as the main one, because only this type of energy supply can we fully control.


Single point of failure. Here are possible options:



Error in the calculations. Here are the top most sensitive mistakes in the power distribution system:



Now for cold supply:



Errors in operation


A data center built on an exemplary project can be spoiled by improper operation. Below we consider what mistakes in the management of engineering infrastructure can lead to accidents.


Unbalanced phase and ray load. The power of cable and automatons is used effectively if the load on the phases is evenly distributed. When one or two phases are overloaded, and one or two are underloaded, the so-called “imbalance” of phases occurs. Because of it, the available power is used inefficiently. In the worst case, this will lead to the disconnection of the machine and overheating of the cable.


With rays, the story is as follows: in a data center with a power reserve of 2N, when one of the power supply is disconnected, the second takes on the load that has failed. In order for the remaining beam to withstand a double load, each of them must be loaded only half the rated power, taking into account the starting currents. Otherwise, the reserve on the second beam will not save.


Both conditions must be respected at the same time. Monitoring the load distribution from transformers to PDU will help monitor the system in the maximum number of points. How to arrange it, told in this article .


The settings on the machines. To maintain selectivity, the nominal power of the automata is artificially reduced with the help of settings. During operation, when it is necessary to connect an additional load, the settings are forgotten and are guided by the nominal value of the machine. Accordingly, if the connected load is greater than the setpoint, the machine will shut down.


Instructions and regulations service operation. In the server or data center, a pre-emergency condition, and the engineer has little understanding of what to do and who to call. Even worse, when the person on duty decides to do nothing. Regulations and instructions can save you from confusion and loss of time during an emergency.


But the rules of procedure are different: if it is written for a tick, it has never been updated and no one has tested it during the exercises, then we can assume that there are no rules.


Even if all the schemes have been worked out, the regulations and instructions should always be at hand (in paper and electronic form) so that in case of an accident you do not have to waste time searching for them. Hang posters with brief instructions on the workplace of the engineer, from which the operation to rescue the data center in case of an accident begins. Instructions for working with the equipment, place it directly on the equipment case. You can add checklists to the instructions, in which the engineer will mark each of his actions. So it will be less likely to skip items instructions.


Quickly locate the problem in the data center will help the location of the equipment, which must also be relevant in the reach of engineers.


Marking It would seem, what does marking have to do with accidents? The most direct. For example, turn on the machine turned off - the question a couple of minutes. But if there are no schemes and markings, then this turns into a real quest with good prospects for a long simple. Or another situation: for repair, you need to turn off some equipment. Open the shield, and there all the machines are the same from the face and without labeling. How high is the probability to turn off not what you need, consider yourself.


Monitoring In small server-based monitoring, the engineering infrastructure may be absent as a class or not all systems are being monitored. Then we have to deal with the following situations: on Sunday evening the air conditioner turns off, but the engineer only finds out about it on Monday morning, when there is a bath in the server room. Or there was a breakdown with city food, and the diesel engine did not start. The situation was noticed only when we sent alerts about problems with one of the server power beams. In either case, a large-scale accident could have been prevented if minimal monitoring had been set up with SMS or email alerts.


The monitoring of data centers has its own nuances: it needs to be properly configured. For example, set valid thresholds. If the monitor is permanently red from critical errors, then monitoring is configured incorrectly. For an engineer, such monitoring will quickly become uninformative, false alarms will occur, and real accidents will go unnoticed among routine alerts.


What else could cause an accident?


Let's see what can go wrong in the work of air conditioning, power supply (power distribution system, uninterruptible power system, diesel generator set) and fire extinguishing system.


Cooling. For the cold supply system, everything can start with the breakdown of several air conditioners, for example, due to the fact that the outdoor units are clogged with poplar fluff. If the hall is well loaded and the cold ceases to be enough, then local overheating occurs. Freon air conditioners are very sensitive to inlet temperature, so when it rises, other air conditioners start to stop by mistake. As a result of this “domino effect,” the hall will remain without cooling.


For chiller systems, the worst is the loss of pressure in the circuit, for example, due to leaks. In this case, the whole system rises, not a separate air conditioner. In order to track this situation in time, monitor the pressure, install more leakage sensors, consider the possibility of feeding the system with the help of storage tanks, additional pumps.


Uninterrupted power supply. In addition to the failure of the UPS, which can be prevented by maintenance and timely repair, there is such an interesting thing as the mismatch of the real time of battery life of the UPS and the assessment on the UPS display. I, of course, the case when the display shows more than it actually is. For example, during maintenance of the boards between the DGU and the UPS, when the battery holds the entire load, the operation service counts for one time, but in reality it gets a couple of minutes less.


It is possible to avoid such embarrassment by periodically conducting a “controlled” discharge of the battery with the construction of the corresponding graphs. Based on this graph, the battery life is calculated and the readings on the UPS screen are calibrated. For reinsurance, the resulting time is better to round down. It’s like a clock: it’s better to be in a hurry and you will come to the meeting sooner than you’ll be late.


Guaranteed power supply. Failures can occur at any stage of the DSU:



Firefighting:



I’ll stop on this, although, of course, these are not all the reasons for which the data center can “lie down”. Share your stories in the comments. If an accident has occurred, and the reason has not been able to figure out, write here or at b> consulting@dtln.ru </ b . Let's try to figure it out together.


Other articles on the design and operation of data centers:


Monitoring of engineering infrastructure in the data center. Part 1. Highlights
Monitoring of engineering infrastructure in the data center. Part 2. Power supply system
Maintenance of engineering data center systems: what should be in the contract
Errors in the project of the data center that you feel only during the operation phase
The path of electricity in the data center
How to test the DSU in the data center
DataLine experience: how we prepare duty engineers for our data centers
Experience DataLine: the work of technical support service


')

Source: https://habr.com/ru/post/329744/


All Articles