In late February, Amazon
faced a problem that disrupted the work of not only large (and not so) websites, but also applications for the Internet of Things. It took about five hours to restore the functionality, and at that time the company could not update even the status of its servers. The launch of new EC2 instances in the “broken” AWS-region was also impossible.
Due to the inaccessibility of the Amazon S3 baskets in Northern Virginia, the work of such services as Docker's Registry Hub, GitHub, GitLab, Quora, Medium, Twitch.tv, Heroku, Coursera, Bitbucket, etc. was disrupted
/ photo by Emilio Küffer CC')
Amazon engineers were able to regain control of the situation during the day and update all the information. According to official statements, the functionality of the services was
restored within five hours after the appearance of the first error report.
As early as March 2, Amazon
provided a detailed report on the situation and the problems that caused it. The company states that the outage was caused by an employee who solved the problem with the payment system. He wrote the wrong team in the production environment, working on improving performance.
“The Amazon Simple Storage Service team solved the problem of low speed of the S3 billing system. At approximately 09:37 pst (19:37 Moscow time), a team member, using an approved plan, entered a team that was supposed to remove a small number of active servers in one of the S3 subsystems, say Amazon representatives. “Unfortunately, one of the operators was entered incorrectly, which resulted in the removal of a larger number of servers.”
S3 has experienced tremendous growth over the past few years, so the process of restarting and supporting services, as well as carrying out the necessary security checks, took a long time, more than expected in the support team.
Such situations can happen even with the most eminent providers. We ourselves experienced problems of the most diverse nature. The point is how the company reacts to such situations - the human factor plays a huge role. It is often the main cause of malfunctions.
One of the largest accidents that occurred here affected about 10% of the total number of clients. There was a complete failure of the network services on the border with our cloud, but we managed within a couple of hours to identify the malfunction and analyze the actions that led to its occurrence.
A quick analysis in such a situation is possible solely on the basis of having experience in solving such problems and having a sufficient number of competent specialists who are ready to get in touch even at night. Express evaluation is the first thing to do. This is exactly what we did, having collected an emergency conference call without any thought of waiting for the next working day.
This is followed by an alert to customers, under which it is important to provide the most complete information that we can gather at this point. As a result of a more detailed investigation, it is worthwhile to determine the amount of payments to customers as compensation in accordance with the SLA.
In the case of the unavailability of services within a few hours, the amount of payments for a particular client may not seem so significant, but for a new IaaS provider compensation can be a significant blow to finances. It is important to calculate your strengths and weigh reputational risks.
For us, good relationships with customers are much more valuable than short-term savings. Therefore, we decided to “round up” the SLA payments for the most part. This move, in combination with the quick resolution of the problem, contributed to the influx of a huge number of reviews with a positive assessment of our actions.
As a result of such events, you should always work on the bugs. Amazon says it will take precautions to avoid a repetition of this situation in the future, including limiting the ability of debugging tools and dividing the service into smaller cells in order to reduce potential damage. The company also plans to improve its audit processes to ensure that all necessary audits are organized.
Note that many services suffered due to an Amazon error, since not all developers distribute applications across multiple data centers. This is necessary so that the failure of one key point does not pull the entire platform behind it. This factor should be paid attention to when
choosing an IaaS provider. Its IT infrastructure should take into account possible failures and balance the load in case of failure of any element.
PS In the next series we will talk about the experience of working with various payment gateways, the corresponding difficulties with connecting and our solutions.PPS Our other materials: