AWS in Down: Why the skies collapsed

April 21 at 01:41 Pacific time, there was a serious failure in one of the data centers of Amazon Web Services, the "cloud" for many sites. Some major projects (Reddit, Quora, Foursquare) went offline or badly hit. I have already seen a bunch of misinformation with a hint that the problems of the affected sites are associated only with the laziness of the engineers of these projects, but in this case the reason is different. And that's why.

AWS has two accessibility concepts: Regions and Availability Zones (AZ). There are five regions: two in the USA (the west and east coasts), one in Europe (Ireland) and two in Asia (Tokyo, Singapore). There are several AZ in each region, which should be isolated from each other and not have a common point of failure, except for a natural disaster or something of a similar scale.

AWS says that by “putting instances in separate AZs, you can protect your applications from crashing in some particular place.” By "place" they mean " physical independence of accessibility from each other and independent infrastructure." Apparently, these are separate data centers ( although, from the description of Amazon, this is not obvious - maybe these are different floors / rooms in a data center with independent power supply and connection via different channels - approx. Lane ). They should not collapse simultaneously with each other, unless a catastrophe happened that covered the entire region.

Availability zones also offer “low-cost, low-latency connectivity between availability zones in the same region.” Interregional connection, on the other hand, goes through an open Internet connection, and is relatively expensive, slow and unreliable.
')
These are the "rules of the game." So if you play according to AWS rules and set up a master / slave MySQL database (to take the most common example), then, logically, you should put master and slave in the same region, but make sure they are in different zones availability. You will not place them in different regions, otherwise you will have to drive traffic over slow and expensive channels between regions, and most likely you will have more problems with synchronizing databases. You risk only if, for example, a hurricane hits the east coast and destroys the data center (s), but with the exception of this, everything should be in order - at least if the AWS promises are kept.

So, in the end, we have a problem. Yesterday several zones were inaccessible in the east coast region. AWS violated its own promises for crash scenarios in availability zones. This means that AWS has a certain common point of failure (assuming that some absolutely incredible scenario did not happen, such as the simultaneous physical damage of several independent infrastructures). Sites that are still down are quite correct in fulfilling their “rules of the game” conditions; the problem is that AWS did not follow its own specifications. It happened because of incompetence or dishonesty, or something more forgivable, we just do not know at the moment. But the developers at Quora, Foursquare and Reddit are very competent and it will be wrong to send accusations towards them.

Of course, it is theoretically possible to protect even from catastrophic failures (several availability zones), but for most businesses these additional development costs and costs are not worth it (or may even be counterproductive, complicating the system). I am sure that all sites that are now in Down have backups. The problem is that their return to online can become difficult and risky - in practice, you need to move everything to another region, otherwise the network delay between your servers will be too big. On AWS, this is extremely difficult: on servers in different regions - a different set of options, different AMI identifiers, I think that reserved instances cannot be moved between data centers at all - in reality, transferring control to another region is almost impossible. Probably, it takes about as much effort as migrating to a completely different cloud, which is probably the best option for recovery after such a disaster. As far as we know, Quora started this process the very minute that the AWS crashed, and has not completed it so far - this can take a day.

So, in short, the blame here is exclusively AWS, which “guaranteed” the conditions that it violated. Errors happen, but here is the AWS error.

And this is not a cloud hosting error as such. This case shows that you need to carefully select the provider. I think many people will reconsider their choice in favor of AWS.

A few more thoughts:

The reason why so many sites are located in the east coast region is that it is here that AWS is first of all rolling out new features. It is also the cheapest. And this is probably the best traffic service region (normal performance for North America, tolerable performance for Europe)
The cause of the failure was actually the EBS (Elastic Block Store) disk arrays, which were disastrous in terms of reliability from the very beginning of their use. But this is a topic for a completely different article!
EBS arrays can only be in one availability zone and access to them can only be from this zone. Apparently, the Amazon RDS distributed database uses secret APIs that allow access to EBS from other zones, but these interfaces are not accessible to anyone other than RDS (hmm ... a Seattle company that uses secret APIs to gain a competitive advantage - it sounds like is that familiar?)

Source: https://habr.com/ru/post/118001/

All Articles

AWS in Down: Why the skies collapsed

More articles: