📜 ⬆️ ⬇️

Major accidents in data centers: causes and effects

Modern data centers are reliable, but any equipment breaks down from time to time. In a small note we collected the most significant incidents of 2018.



The impact of digital technology on the economy is growing, the volume of information being processed is increasing, new facilities are being built, and this is good while everything works. Unfortunately, the impact of disruptions in data centers on the economy has also increased since people started placing IT-critical IT infrastructure in them - this is the inevitable consequence of digitalization. We publish a small selection of the most notable accidents that occurred in different countries last year.



USA


This country is a recognized leader in the field of data center construction. In the US, most of the large commercial and corporate data centers that serve global services, because the consequences of incidents in them are most significant. In early March, due to a powerful cyclone, four objects of the Equinix operator faced problems in the power supply system. The areas were used for Amazon Web Services (AWS) equipment, the crash led to the inaccessibility of many popular services: GitHub, MongoDB, NewVoiceMedia, Slack, Zillow, Atlassian, Twilio, and Capital One, as well as Amazon Alexa virtual assistant, suffered.


In September, weather anomalies hit Microsoft data centers located in Texas, then a thunderstorm disrupted the power supply system of the entire region, and in a data center switched to power from the DGU it was not known why the cooling was turned off. It took several days to eliminate the consequences of the accident, and although this failure did not become critical due to load balancing, users around the world noticed a slight slowdown in Microsoft cloud services.


Russia


The most serious accident occurred on August 20 in one of the data centers of Rostelecom. Because of it, the servers of the Unified State Register of Real Estate stopped for 66 hours, and therefore they had to be transferred to a backup site. Rosreestr was able to restore the applications received through all channels of applications only on September 3 - the state organization is trying to recover a large sum from Rostelecom for violating the service level agreement.


On February 16, due to problems in the Lenenergo networks, the backup power supply system was turned on at the data center of Xelnet (St. Petersburg). Short-term interruption of the sinusoid led to disruptions in the work of many services: the large cloud provider 1cloud suffered, but the most noticeable problem for the Russian Internet audience was the inability to visit the social network site VKontakte. The most interesting thing is that it took about 12 hours to completely eliminate the effects of a short-term power failure.


European Union


In the EU in 2018, several serious incidents were recorded. In March, there was a failure in the data center of the KLM air carrier: the power supply was turned off for 10 minutes, and the power of the diesel generator sets was insufficient for the operation of the equipment. Some of the servers shut down, and the airline had to cancel or relocate dozens of flights.


This is not the only incident associated with air traffic - in April, a failure occurred in the power supply system of the data center Eurocontrol. The organization manages the movement of airplanes in the European Union, and while the specialists of 5 hours eliminated the consequences of the accident, passengers again had to endure delays and flights.


Very serious problems arise due to accidents in data centers serving the financial sector. The cost of interruptions in conducting transactions here is usually high, and the level of reliability of objects is appropriate, but this does not save from incidents. On April 18, the Nordic NASDAQ Stock Exchange (Helsinki, Finland) could not trade throughout Northern Europe during the day due to the unauthorized launch of a gas fire extinguishing system in the commercial data center DigiPlex, which was abnormally de-energized.


On June 7, interruptions in the work of the data center forced the London Stock Exchange (London Stock Exchange, LSE) to postpone the start of trading for an hour. In addition, in June, the services of the international payment system VISA were disconnected for the whole day due to a failure in the data center, and the details of the incident were not disclosed.


Japan


In the summer of 2018, at a subterranean level of the Amazon data center being built in the Tokyo suburb, a fire occurred where 5 workers died and at least 50 were injured. The fire damaged about 5,000 m 2 of the facility’s premises. The investigation revealed that the human factor was the cause of the fire: due to the careless handling of acetylene torches, the insulation ignited.


Causes of failure


The above list of incidents is far from complete, customers of banks and telecom operators suffer from accidents in data centers, go to cloud service providers offline, and even emergency services are disrupted. A slight interruption in service can lead to serious losses, while, according to the Uptime Institute, the majority of failures (39%) are connected to the power supply system. In the second place (24%) is the human factor, and in the third (15%) - the air conditioning system. The share of natural phenomena can be attributed to only 12% of accidents in data centers, and only 10% of them occur for reasons other than those listed.


Despite strict standards of reliability and safety, not a single object is insured against incidents. Most of them are due to power outages or human error. These two factors should first be paid attention to the owners of the data center and server rooms, and customers should understand that even market leaders cannot guarantee absolute reliability. If the equipment or cloud service serves critical business processes, you should think about a backup site.


Photo source: telecombloger.ru


')

Source: https://habr.com/ru/post/451834/


All Articles