Amazon EC2 Failure Report

Amazon finally published a detailed report on the reasons for the failure on Thursday, April 21, as a result of which one of the accessibility zones on the east coast of the United States almost completely did not work for two days, and other zones in the same region were buggy for a while.

So, the root cause of the failure was an error in the network settings of the Amazon Elastic Block Store cluster (“EBS”) during normal scalability work. Network settings should have increased the capacity of the main network. One of the standard steps in this procedure is removing traffic from one of the congested routers in order to upgrade. Traffic must go to the main network. But due to an error, the change in traffic routing did not happen correctly: instead of the main network, it went to a low bandwidth network (EBS uses two networks: the main one for the traffic and the second for the communication of the EBS nodes in the cluster between themselves and replication).

The second network is not designed for such traffic volumes, so the EBS nodes have lost communication with each other. If the neighboring node does not respond, then the node begins to look for another one for replication. After the restoration of the network, this caused a chain of events, including an avalanche of mirror redundancy. Free space in the EBS cluster ended almost instantly. In addition, due to the work in the extreme mode, some EBS nodes began to fail, which again caused an increase in requests.

Another caveat: since the instance is temporarily blocked for reading and writing during the mirroring process (to preserve data integrity), and the whole EBS cluster in one of the zones was unavailable for reading and writing, in this situation all EC2 instances that addressed this the repository also became locked.
')
To restore these instances and stabilize the EBS cluster in this zone, we had to disable all control APIs (including Create Volume, Attach Volume, Detach Volume, Create Snapshot) for EBS for almost the entire duration of the restoration work.

On the first day, there were two periods of time when the damaged EBS cluster in the affected area adversely affected the operation of other zones in the east coast region, although this should not have happened even theoretically (according to the terms of the Amazon service, accessibility zones even in one region are completely independent from friend). But it happened here: due to the degradation of the EBS API, EBS calls to these APIs from other zones for the two time periods mentioned have worked with delays and returned a high level of errors. The fact is that the EBS API management system in the entire region is in fact a single service, even if distributed geographically among all accessibility zones. Thus, users in other zones also received error messages when they tried to create new instances.

The situation was aggravated by the fact that Amazon Relational Database Service (RDS) also relied on EBS to store databases and logs, so that after the failure of one EBS cluster, part of the RDS database was unavailable: up to 45% of instances in this zone who chose “ single-AZ ”in the settings, that is, it had a much more negative effect on hosting clients than directly EBS failure (from which 18% of the sections suffered at the peak), because one RDS database uses several EBS nodes to store performance to improve performance.

After 12 hours of struggle, we managed to isolate the situation in one EBS cluster, after which the meticulous restoration of this cluster began, and some sections had to be manually restored, and 0.07% were completely lost (0.4% of single-AZ) databases. They could be restored only from backup. Although the API functionality and access to the cluster was restored on April 23 (two days after the initial failure), the restoration of the last partitions continued until Monday.

Amazon promised all customers whose instances are located in the affected area, regardless of whether they got into the failed EBS cluster or not, 10 days of free use of EBS volumes, EC2 instances and RDS database instances for the current usage. You do not need to do anything to get compensation: it will automatically be reflected in the nearest invoice. You can find out whether the compensation applies to you on the AWS Account Activity page.

Independent experts praised Amazon for maximum openness and an extremely deep description of EBS and EC2 technical infrastructure. The only flaw in the report is a kind of understatement regarding the initial failure, that is, what exactly was the error in the network settings, nothing is said about it. Although it is not explicitly stated, one can understand from one phrase in the report that the human factor must be blamed for the mistake: “We will audit our process of changes [settings] and increase the level of automation to prevent similar mistakes in the future,” the official statement said. .

As for the problems with the work of EBS, this failure is called “a very complex operational phenomenon caused by several interdependent factors, which gives us many options on how to protect the service from any repetition of similar events in the future,” the report says.

Source: https://habr.com/ru/post/118434/

All Articles

Amazon EC2 Failure Report

More articles: