After reading “Cloudmouse deleted all virtual servers” at one time and some comments in the style of
“I’m guilty myself, I had to trust trusted clouds”, I decided to tell my horror story with a highly respected cloud from Amazon (AWS). In the
podcast of the radio, I briefly talked about this, but here, it seems to me, the details and impressions of the entire nightmare occurred.
I have been using AWS for working and personal projects for a relatively long time (over 3 years) and consider myself to be a fairly advanced user. From the rich list of AWS services, I had to work, or at least try out most of this abundance, but several basic services are undoubtedly used most often. These are EC2 (instances / virtual machines) in VPC (private networks), their associated EBS (volumes), ELB (load balancer) plus Route53 (DNS). From this five, you can collect different configurations of virtual machines connected to the network, and if you add S3 for data storage to this case, then apparently this will be a small gentleman's set of the most popular AWS services.
The reliability of these systems is different, and they are given different SLA, mostly very impressive. From the point of view of practical use, I did not have any critical problems with AWS. This does not mean that everything is absolutely uninterrupted, but with a properly organized system, where a reasonable user does not put all the eggs in one basket and, at a minimum, distributes services between AZ (accessibility zones) from all rare falls and problems, it was possible to exit without any special losses. and with minimal headache.
From what I came across in real use, the most common situation was a planned reboot (“your Amazon EC2 instances were scheduled to be rebooted for required host maintenance”) and a planned migration to new equipment (“EC2 has detected degradation of the underlying hardware "). In both cases nothing crippling happened and the instance was available after a reboot with all the data on EBS. A couple of times a strange thing happened with the IP address (Elastic IP) and he suddenly got rid of the instance, and once I completely lost routing to one of the virtual machines. All these cases were from the category of “yes, it happens, but rarely and not painfully” and did not cause any particular fear / anger. If something unexpected happened, then I could always contact their support service and get help or at least , clear explanation.
')
And then it happened. January 26, one of the instances, which was supposed to automatically rise around 5 pm, refused to start. Attempts to start it from the AWS console entered the “initializing” state and after a few seconds returned to the hopeless “stopped”. No logs were created, because before the OS download, the case apparently did not reach. In this case, no explanation of what exactly happened, at first glance was not detected. On closer inspection, I managed to notice a suspicious message in front of the volume list - “Internal error” with an error code. Going to the section where all my EBS volumes are visible, I found both volumes from the deceased instance in the “red state” and with the laconic message “Error”.
It was strange, unpleasant, but not deadly. After all, as a true paranoid, I save snapshots for each volume from each instance every day and keep them from one week to six months. Recovering snapshots from AWS is a trivial task. However, it turned out to be really strange and very scary - all the snapshots of these two sections also turned into an “error” and it was impossible to use them. When I say “all”, I mean the whole story, all 7 days that have been kept. I do not know about you, but it was difficult for me to believe my eyes. I have never seen anything like this before and have never heard of anything like it. In terms of the degree of unreality, this did not even cause a panic - I was sure that it was the failure of their console and, of course, there could not be such that both EBS volumes and snapshots were lost at the same time. After all, the theory that all this is stored side by side or even on a single disk array, which suddenly died, completely contradicts their description of how this economy works.
Calling in support (this is a separate and paid service), I got on the Indian technician. Prior to this, all my calls to the support service fell on the local and very competent specialists, who usually pleased their comprehension. This one was also intelligent, but very slow. After I explained what was wrong, he disappeared for about 15 minutes, having gone to his checks. From time to time, he returned to report that he and a team of specialists are investigating the problem. There were several such deep dives in the study, and I spent about an hour with him on the phone. The end result was disappointing - all was lost. Of course, I demanded an explanation and a full analysis of the causes of the incident, but all he could tell me was “accept our apologies, but nothing can be done, your data has been lost”. To my question, “how is it, I did everything right, I had snapshots, how did it happen?”, He offered to return the money for keeping these lost snapshots, which of course laughter through tears. However, I continued to demand explanations, and he reluctantly admitted that the problem was connected with the failure of new instas instas (C4) and it was already fixed. He did not explain exactly how it was connected and what exactly was repaired, but promised to send an e-mail with a full report and all the answers.
From the report they sent the next day:
During a recent integrity check we discovered unrecoverable corruption in 1
of your volumes. We have changed the state of the volume to “error.” In
addition, snapshots made from these volume are also unrecoverable, so we have
changed their status from “Completed” to “Error.” You will no longer be charged
you can use it at your convenience.
Please note that you will not be able to launch instances using AMIs.
referencing these snapshots. Instructions for removing AMIs is available here
(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/deregister-ami.html).
Although EBS volumes are designed for reliability, including being backed by
dangibility risks
multiple component failures occur. We publish our durability expectations on
the EBS detail page here (http://aws.amazon.com/ebs/details).
We apologize for this inconvenience this may have caused you. If you have any
please contact us at:
aws.amazon.com/support .
The answer here and does not smell. In addition to the actual error ("in 1 of your volumes" when they were 2) and the standard unsubscribe, there is nothing useful here. However, I could not put up with a similar answer and used our PM from AWS (such Amazon attaches to more or less significant customers), telling him about the catastrophe that occurred. I called already late at night and left a message. At the same time, he did not particularly choose the words and did not hide the degree of his shock, and made it very clear what exactly I was thinking about. After 30 minutes, he called me back and offered to start a conversation between me and all those of the AWS who have the answers in the morning.
By the way, with all the severity of failure, he did not cause any painful consequences for our combat systems. Firstly, this node was not the only one, and secondly it was completely dock-cut and almost completely immutable. To restore, and in fact, build a new one, it was a matter of 10 minutes - ansible rolled everything you need to launch the containers and manage them, and then deliver and launch all the necessary containers. Since this instance is part of the end-of-day data processing system, there was no unique data on it, and it took everything that was needed for work from external sites. However, if this happens to an instance on which really important and unique data is available, and an independent backup has not yet been built, then this can be a very big problem.
An impressive team arrived at the phone call from Amazon. In addition to PM and the solution architect attached to us, there were several specialists from the EBS group, a couple of engineers from the support service (it seemed to me, the one with whom I spoke and his boss) and a certain person who was introduced only by name. The conversation lasted about an hour and took place in the direction I set. I was interested in the answers to 3 questions -
what exactly happened ,
what they do so that it does not happen again and
what can I do to protect my systems from similar things in the future . I also tried to understand for myself whether the Amazon shares my horror sensation from such an event.
Yes, of course, they clearly understood that this is something out of the ordinary. The preoccupation and the utmost seriousness towards the incident sounded in every word, and more than once they said in plain text that this was indeed a critical problem and that such a thing should not have happened. From the analysis of the incident, I understood the following - this disaster did not happen on the 26th, but a week ago, during the initial creation of the instance. And during the whole week the volume was partially destroyed or, at least, something was found in it, after discovering which they decided to make it inaccessible. All my attempts to clarify what exactly was broken there did not lead to a special success - all they could say was that integrity was destroyed at the logical level and it was impossible to fix a similar problem. Here the connection with the lost snapshots became clear - they were all removed from the problem volumes and therefore were marked as failed.
Thus, it turns out that during the week I used a virtual machine with volumes, with which from the very beginning there was something wrong. And for a week their systems did not detect any problems, as, indeed, I did not observe anything strange. Of course, I asked a reasonable question - why didn’t they notice their detection systems for a whole week and what happened that they noticed it? Well, in addition - if I lived on such a faulty service for a week, why is it so harsh and in a hurry now? Why should they not warn in advance and give me time to take action before the data and virtual machines are deleted?
The answer to the discovery question was given in the vein, that this was exactly the root of all the problems, and not at all the problem with the C4 type, as the first-line support engineer told me. I must say that this statement sounded somewhat uncertain, maybe something there was and was tied to C4, but they did not admit. At the same time, everyone fervently assured that these new C4s are quite reliable, and I can safely use them. Looking ahead, I will say that over the past six months I have used them repeatedly, very actively, in many tasks with large CPU requirements, and these types of instances did not cause any more strange problems.
But the answer to the question "what was all this haste for," I did not get at all. In my opinion, they were forbidden to discuss this with the customer for reasons that I can only guess about. In the order of pure conspiracy, I can assume some kind of leakage of someone else's data on my volumes, and then all this harsh history can be at least somehow substantiated.
In the form of an answer to the question “what do they do so that this does not happen again” they assured that they are doing everything possible, but they did not give any details, referring to the fact that I refused to sign a non-disclosure agreement, and more detailed answers in this case are impossible . Incidentally, this was not the first time they offered to sign an NDA, but I always refused because of the absurdity of the form of NDA that they offered and according to which (if you follow its letter literally) I could not even leave a critical review on the Amazon site where they sell books and all sorts of things. Regarding the last section (“what can I do to protect my systems”). Here they talked a lot, willingly and mostly trite. In fact, nothing, I can not do here, but I must always be prepared for the worst, which I understood even without this team of specialists.
Summing up, in the part where it is customary to present something instructive, let me remind you once again - everything ever breaks down and even the best of cloud providers can fail. And this must be prepared every day. In my case, the combination of partial readiness and luck saved me, but I learned a few lessons from this incident, scrolling through my head “how would I survive if another part of the system broke down in several places at the same time”. As a result of the incident, I took a number of rather paranoid actions, such as repeatedly backing up certain data to other, non-Amazonian clouds, revised all the unique nodes and made them at least duplicated, greatly shifted the entire cloud infrastructure towards disposable, i.e. anything can be killed and rebuilt without sacrificing overall functionality.
This incident could be the most serious blow to my initiative to move from 3 real data centers to the cloud, if by that time the whole company would not have been exhausted by supporting their hardware to the extreme. During the year and a half that I worked in this organization, we survived a fire (real, with smoke, fire and the loss of a whole rack), attacks from other computers in neighboring racks, numerous iron malfunctions, wild speed subsidences due to external reasons, a sysadmin drunk on week and other great adventures. So even such a serious incident could not shake ours, or rather, my determination to completely move to the clouds, which we completed in April of this year.