If electricity is turned off in your home, then most likely nothing serious will happen to your computer. It can of course burn the power supply or cover the disk - it is unpleasant for you, but not fatal. But what happens if the electricity turns off in a large data center serving hundreds of thousands of users?
February 24 in the data center, in which
Google App Engine worked, the electricity went down. For more than two hours, App Engine was unavailable. A full account of this incident can be
found on the mailing list , and the translation of the failure history and some conclusions below.
')
Early morning.
7:48 . Monitoring graphs start to show that there are some problems in the main data center and that the number of errors is steadily increasing. At about the same time, messages from users about problems with access to App Engine appear on the mailing list.
7:53 . Site Reliability Engineers (Site Reliability Engineers) send a message to a large number of engineers on duty that the main data center has turned off the electricity. In the data centers for such a case, there are backup generators, but in this case, about 25% of the machines in the data center did not receive backup power on time and fell. At this time, our duty engineer received a signal to the pager.
8:01 . By this time, the duty engineer determined the amount of failure and found that the App Engine is not working. According to the procedure, he alerted product managers and engineering managers about the need to inform users about the failure. A few minutes later, the first message from the App Engine team ("We are investigating this issue.") Appears on the mailing list.
8:22 . After analysis, we establish that although the power supply of the datacenter has been restored, many machines did not survive the outage and are unable to maintain the traffic. In particular, it has been established that GFS and Bigtable clusters are in disrepair due to the loss of a large number of machines, and therefore the data storage (Datastore) in the main data center cannot be used. A team of engineers on duty discusses the emergency transfer to an alternative data center. The decision is made to conduct emergency work on the procedure of a sudden power outage.
8:36 . The App Engine team sends additional outage messages to the
appengine-downtime-notify group and to
the App Engine status site .
8:40 . The main duty engineer detects two conflicting sets of procedures. This was the result of a recent change in operating processes due to the Datastore relocation. The discussion with the other engineers on duty did not lead to a consensus, and the engineers are trying to get in touch with those who were responsible for changing the procedures to resolve the situation.
8:44 . While everyone is trying to figure out which of the emergency procedures is correct, the duty engineer is trying to transfer traffic in read-only mode to an alternative data center. The traffic is translated, but due to an unexpected configuration problem in this procedure, it does not work correctly.
9:08 . Different engineers are engaged in diagnosing problems with read-only traffic in an alternative data center. However, in the meantime, the main duty engineer receives data that lead him to the idea that the main data center has recovered and can continue to work. The engineer, however, did not have clear instructions for making this decision, which could tell him that the restoration to this point in time is unlikely. In an attempt to restore service, the traffic is transferred back to the main data center, while the rest of the engineers continue to investigate the problem of read-only traffic in an alternative data center.
9:18 . The main engineer on duty establishes that the main data center has not recovered and cannot service the traffic. At this point, the engineers on duty are clear that the signal was false, the main data center is still inoperable, and that you need to focus on the alternative and emergency procedure.
9:35 . Contact is established with an engineer familiar with the emergency procedure and he begins to lead the process. Traffic is transferred to an alternative DC, first in read-only mode.
9:48 . The service starts for external users from an alternative DC in “read only” mode. Applications that correctly work with “read only” periods should work correctly now, albeit with reduced functionality.
9:53 . After consultation, now in online mode, with the appropriate engineer, the correct documentation of the emergency procedure has been clarified and confirmed and is ready for use by the engineer on duty. The actual read and write transfer procedure begins.
10:09 . The emergency procedure completes without any problems. Traffic maintenance is restored in normal mode, for reading and writing. At this point, the App Engine is considered to be working.
10:19 . A message is sent to the appengine-downtime-notify list stating that AppEngine is working normally.
You can exhale.
As the great Joel bequeathed,
stop talking about backups , let's talk about restorations. When trouble happens, you need a lot of gear to work smoothly. First, your monitoring system (What? You don’t know what a monitoring system is? Then you probably don’t have to think about the remaining gears) should alert you a few minutes after the start of trouble, and not two weeks later in the form of surprised emails from users. Secondly, you must have spare capacity. Thirdly, you should know for sure or have precise instructions on how to use these backup capacities. If the App Engine team had one (correct) documentation set, then the outage would have lasted just 20 minutes instead of two hours.
However, in my opinion, App Engine coped with the problem well. Imagine yourself in the place of an engineer who has the message
“Good morning, friend!” Right now your company is losing hundreds of dollars in money and God knows how much is reputed. And you know, this is your problem. Good luck! ”