Cloud fall

In preparation for the announcement of ETegro Therascale, our new integrated solution for the data center focused on cloud services (we will try to talk about it in detail in the very near future) we are interested in such a moment as the incidence of the largest cloud services. The final set of information seemed so interesting to us that we decided to share it with you. There are no discoveries and secrets in it, moreover, the list does not pretend to be as complete as possible, but perhaps it will make you think about cloud services.

We begin, however, far from the largest, but well-known Selectel Habralyam. On the evening of September 24, due to problems with communicators, they began a complex failure that lasted 11 hours. We will not give details - they are well laid out in the company's blog .
')
A rather recent anecdotal incident with Windows Azure occurred. On August 2, this cloud service was unavailable for users from Western Europe for two and a half hours. The cause of the failure was the safety valve “safety valve”, designed to prevent cascading failures in the network structure, which did not work properly when the capacity was increased.

In June, Amazon suffered from nutritional problems and regular failure of generators. As a result, on the 29th, this resulted in a 20-minute shutdown of servers and the subsequent hour and a half restoration of their operation. This affected 7% of instances in one AZ US-East-1 region. Among the victims were such well-known companies like Netflix and Instagram. It is curious to note that as a result of a failure, a bug was discovered in ELB, which significantly reduced the load transfer rate to other AZ.

On February 29, about 7 hours, Windows Azure was unavailable. The problem in this case was the date that caused an error in the operation of the security certificate (well, just the “2K problem strikes back”).

And on January 20, problems in the Equinix data center in the notorious Silicon Valley ruined the lives of 5 million users of Zoho services for several hours. There was no power supply in the datacenter for just a few seconds, but fixing the databases did not take much longer.

And all this only for this year. And from 2011 you can immediately recall a lot.

For example, problems on August 7 with a 10-kW generator in the Irish data center, first mistakenly attributed to a lightning strike, knocked out Microsoft Business Productivity Online Suite and Amazon EC2 for 3 hours and demanded more than 24 hours for Amazon to recover. And the problems that followed the next day are already in the Region of the Americas due to problems with network channels.

And the previous 13-hour problems of the same Amazon EC2 in the US-East-1 region are problems with EBS (Elastic Block Storage). A separate joke was the fact that it happened on April 21, 2011 - exactly on the day when Skynet declared war on humanity in one well-known film. Artificial intelligence was, of course, nothing to do with it, but instances in Northern Virginia were able to recover only after 3 days.

But what we are all about is Amazon. In September 2011, with a difference of a day, Google Docs was inaccessible for half an hour first, and then almost all Microsoft cloud services collapsed for a few hours: Skydrive, Hotmail, Office365.

It is worth remembering about gmail, 0.02% of users of which in the last days of February 2011 found that their mailboxes are empty. Fortunately, there were no losses: the data was restored within 30 hours. But this incident once again reminded the IT world that software errors can affect even several copies of the same data, and backup to tape drives can save even from this, due to their functioning.

And all this is not a complete list, but only the largest cases. Analyzing the statistics of failures it is easy to notice that most of the cases occur for two reasons: power errors or software errors. I am glad that the iron, which we actually deal with, is not mentioned in these “reports”, and everything went without serious data loss, although, of course, hardly anyone can assess the losses from downtime. However, we deliberately refrain from making our own assessments, and instead ask you a question. And you, personally you and your company, how much do you trust cloud services and are ready to use them, or are you already using them?

Source: https://habr.com/ru/post/158317/

All Articles

Cloud fall

More articles: