The main cause of accidents in data centers - laying between the computer and the chair

The topic of major accidents in modern data centers raises questions that were not answered in the first article - we decided to develop it.

According to the statistics of the Uptime Institute, most of the incidents in data centers are related to power supply system failures - they account for 39% of incidents. They are followed by the human factor - this is another 24% of accidents. The third most important (15%) reason was the failure of the air conditioning system, and in fourth place (12%) were natural disasters. The total share of other troubles is only 10%. Without questioning the data of a reputable organization, we will highlight something common in different accidents and try to understand whether it was possible to avoid them. Spoiler: possible in most cases.

Contact science

To put it simply, there are only two problems with power supply: either there is no contact where it should be, or it is where contact should not be. You can talk for a long time about the reliability of modern uninterruptible power supply systems, but they do not always save. Take, for example, the sensational case of the data center used by British Airways, which belongs to the parent company of International Airlines Group. Not far from Heathrow Airport are two such objects - Boadicea House and Comet House. In the first of them, on May 27, 2017, an accidental power outage occurred, which led to overload and failure of the UPS system. As a result, part of the IT equipment was physically damaged, and it took three days to eliminate the last accident.

Airlines had to cancel or transfer more than a thousand flights, about 75 thousand passengers could not fly on time - $ 128 million were spent on compensation, not counting the costs required for restoring the data centers. The story of the reasons for the blackout is incomprehensible. According to the results of the internal investigation, announced by Willi Walsh, the general director of the International Airlines Group, it happened because of an error of engineers. However, the uninterrupted power supply system had to withstand such a shutdown - in order for it to be installed. The data center was managed by specialists from the outsourcing company CBRE Managed Services, so British Airways tried to recover the amount of damage through a court in London.

Accidents with power supply occur in similar scenarios: first they go off due to the fault of the electricity supplier, sometimes due to bad weather or internal problems (including personnel errors), and then the uninterruptible power supply system does not cope with the load or a short interruption of the sinusoid causes a set of services to fail. recovery of which takes a breakthrough of time and money. Is it possible to avoid such accidents? Of course. If the system is designed correctly, however, even the creators of large data centers are not insured against errors.

Human factor

When the wrong actions of data center personnel become the immediate cause of the incident, the problems most often (but not always) affect the software part of the IT infrastructure. Such accidents occur even in large corporations. In February 2017, due to an incorrectly recruited member of the technical maintenance team of one of the data center teams, some of the Amazon Web Services servers were disabled. An error occurred while debugging the billing process for Amazon Simple Storage Service (S3) cloud storage clients. The employee tried to remove a number of virtual servers used by the billing system, but touched a larger cluster.

As a result of an engineer's error, the servers that were running important Amazon cloud storage software were removed. First of all, the indexing subsystem, which contains information about the metadata and the location of all S3 objects in the US-EAST-1 region, suffered. The incident also affected the subsystem used to host the data and manage the storage space available. After the virtual machines were removed, these two subsystems required a complete restart, and then Amazon engineers were in for a surprise - for a long time the public cloud storage could not handle customer requests.

The effect has been massive, with many large resources using Amazon S3. Malfunctions have affected Trello, Coursera, IFTTT and, most worryingly, the services of major Amazon partners from the S & P 500 list. The damage in such cases is not easy to count, but its order was in the region of hundreds of millions of US dollars. As you can see, to disable the service of the largest cloud platform, one wrong command is enough. This is not an isolated case. On May 16, 2019, during maintenance work, the Yandex.Oblako service deleted the virtual machines of users in the ru-central1-c zone that at least once were in SUSPENDED status. It has already suffered client data, part of which was irretrievably lost. Of course, people are imperfect, but modern information security systems have long been able to control the actions of privileged users before executing the commands they entered. If such solutions are implemented in Yandex or Amazon, such incidents can be avoided.

Frozen cooling

In January 2017, a major accident occurred in the Dmitrovsky data center of Megafon. Then the temperature in the Moscow region dropped to −35 ° C, which led to the failure of the cooling system of the object. The operator’s press service didn’t give much attention to the causes of the incident - Russian companies are extremely reluctant to talk about accidents at their facilities, in the sense of publicity, we are far behind the West. In social networks there was a version about the freezing of the coolant in the pipes laid along the street and the leakage of ethylene glycol. According to her, the operation service was unable to quickly obtain 30 tons of coolant due to the long holidays and was wrung out using improvised means, organizing improvised free-cooling in violation of the rules for operating the system. Severe cold aggravated the problem - in January, winter suddenly happened in Russia, although no one expected it. As a result, the staff had to de-energize part of the server racks, which is why some operator services were unavailable for two days.

Probably, here we can talk about the weather anomaly, but such frosts are not unusual for the capital region. The winter temperature in the Moscow region can drop to lower elevations, therefore, data centers are built based on stable operation at −42 ° C. Most often, the cooling systems in the cold fail because of an insufficiently high concentration of glycols and an excess of water in the coolant solution. There are problems with the installation of pipes or with miscalculations in the design and testing of the system, associated mainly with the desire to save. As a result, a serious accident occurs out of the blue, which could well have been prevented.

Natural disasters

Most often, thunderstorms and / or hurricanes disrupt the work of the engineering infrastructure of the data center, which leads to a halt in services and / or physical damage to equipment. Bad weather-provoked incidents happen quite often. In 2012, Hurricane Sandy with heavy rain rolled across the west coast of the United States. Located in a high-rise building in Lower Manhattan, the Peer 1 data center lost its external power supply after the salt water flooded the basements. Emergency generators of the facility were placed on the 18th floor, and the fuel supply for them was limited - the rules introduced in New York after the 9/11 terrorist attacks prohibit storing large amounts of fuel on the upper floors.

The fuel pump also failed, because the staff dragged diesel for generators manually for several days. The heroism of the team saved the data center from a serious accident, but was it so necessary? We live on a planet with a nitrogen-oxygen atmosphere and plenty of water. Thunderstorms and hurricanes are common here (especially in coastal areas). Designers would probably need to take into account the risks involved and build an appropriate uninterruptible power supply system. Or at least choose a more suitable location for the data center than a high-rise on the island.

All other

In this category, the Uptime Institute identifies a variety of incidents, among which it is difficult to choose a typical one. Theft of copper cables, cutting into the data center, power transmission poles and transformer substations cars, fires, excavators, spoiling the optics, rodents (rats, rabbits and even wombats, which actually belong to the marsupials), as well as amateurs to practice shooting on wires - the menu is extensive . Even an illegal marijuana plantation can cause power outages. In most cases, specific people are to blame for the incident, that is, we are again dealing with the human factor when the problem has a first and last name. Even if at first glance the accident is associated with a technical malfunction or natural disasters, it can be avoided if the facility is properly designed and properly operated. The only exceptions are cases of critical damage to the infrastructure of the data center or the destruction of buildings and structures due to a natural disaster. This is really a force majeure, and all other problems are caused by the gasket between the computer and the chair - perhaps the most unreliable part of any complex system.

Source: https://habr.com/ru/post/452962/

All Articles