
Last year, Delta lost more than $ 150 million. The reason for the losses was the failure of the Delta data center, which we
wrote about at one time. It's about the company Delta Air Lines, many thousands of passengers who could not fly anywhere because of a crash in DC, located in Atlanta, USA. Like almost any company, Delta Air Lines data centers have duplicate systems that start to work if something goes wrong. Tens of millions of US dollars were invested in backup systems, but at the right time they simply did not work properly.
Then there was no switching from the main power system to the auxiliary generator, and the servers simply shut down after discharging the UPS. This incident affected the performance of the DC company. What happened this month, almost a year later, was analyzed by vice president of Amazon Web Services, John Hamilton. In particular, he said that the problem arose because of several rare failures following one after another. But, according to him, this happens more often than usual.
That very rare coincidence in his career has happened twice already, and the Delta case is the third. And this particular case is the most indicative. Firstly, its negative effect is quite high. Secondly, the incident has already been analyzed and analyzed on the shelves, thirdly, all this does not happen so often, so few people have time to prepare for the onset of the “X hour”.
')
To begin with, it is worth remembering that Delta had to cancel 1000 flights at once in one day, 775 - the next day, and 90 - in another day. As mentioned above, the company lost about $ 150 million, although airlines already have not very high profitability, so you can only compensate for the loss within a few years.
By the way, problems in data centers occur much more often than they are said to be. Just in this particular case, everything came out, the airline with all the desire could not hide anything.
But what happened? The report stated that "the mechanism for switching the main power supply to the emergency one failed, as a result of which the backup system did not turn on." In order to better understand the nature of the problem, it is worth remembering what equipment is commonly used to switch.
In a normal situation, electricity enters the DC through medium-voltage transformers and automation to bespereboynik, which are the ultimate power source for critical equipment such as servers, data storage and network equipment. In the same usual situation, automation usually only monitors the quality of the supplied energy.
A Delta Airlines employee helps a passenger whose flight has been canceled understand the situation.If the automation detects a failure, it waits a few seconds (in most cases) to normalize the situation. If there is no energy or its parameters are not what is required, emergency generators come into operation. To enter into the work of the generator is also enough for a few seconds. As soon as it enters the optimal mode, and all the parameters of the generated energy correspond to the specified, the network switches to the generator, disconnecting from the main power source. In the course of these few seconds that are necessary for the automatics to assess the situation and further actions, uninterrupted batteries give the necessary current - in this case, they cannot do without them. As soon as the main source “comes to life”, the reverse switching takes place.
In most cases, everything goes as it should. Problems occur so rarely that the vast majority of companies never encounter a failure of automation in the energy infrastructure. But if automation fails, then the company may face problems and losses, as in the case of Delta. How can she let down? The fact is that the generator manufacturers use special software that monitors the voltage in the network during a failure. If it is too high or the automatics "do not like" something else, then the generator simply does not turn on. The fact is that its cost can reach a million dollars or even higher, and the equipment manufacturer believes that the best way out is not to risk the generator.
But in some cases, a million dollars is nothing compared to total losses from a crash, so data center engineers may prefer to start the generator, even with the likelihood of damage. In the case of Delta Airlines, the technicians could not do anything, because the automation decided to block an expensive generator (at the beginning it was not for nothing that several tens of millions of US dollars were invested in the backup system). 5-10 minutes, and the UPS is discharged, the server and other equipment are shut down Delta also had a fire.
And here is Amazon? The fact is that the vice-president of this company somehow ran into a similar problem. He drove out of the data center, away from a decent distance. And then he one after another began to receive messages about UPS outages. When he returned, he understood exactly what had happened - the situation was similar to the one that occurred in the Delta data center, only without a fire. It was surprising that the automation manufacturer refused to help remove the unit from the generator and start it, despite the fact that the data center team was ready to take the risk of damage to the equipment. As a result, Amazon also suffered losses, although not as significant as Delta. In the case of Amazon, contact was made with the manufacturer of automation and custom software was created that turned on the generator in any problematic cases, if the situation required it.
In most cases, the generator will work in normal mode, although the load may be slightly higher than normal. There is no sense to keep it in the conditions of a power outage in the data center, this is the wrong priority. When it comes to hundreds of millions of US dollars, the loss of a few more hundred thousand or a million plays a minor role. In the case of Delta, blocking the generator led to the consequences already described and the loss of not even hundreds, but one and a half hundred million US dollars.
