Netflix reduced failover time from 45 to 7 minutes at no cost
Image: Florida Memory . Improved by Opensource.com. CC BY-SA 4.0In the winter of 2012, Netflix experienced a
long crash , going off for seven hours due to problems with the AWS Elastic Load Balancer service in the US-East region (Netflix runs on AWS - we do not have our own data centers. All of your interaction with Netflix happens through AWS, besides the streaming video itself. As soon as you hit
Play , the video stream starts downloading from our own CDN). During the failure, no packet from the US-East region reached our servers.
To prevent this from happening again, we decided to create a failover system that is resistant to the failure of the underlying service providers. Failover is a fault tolerance system when backup equipment is automatically activated in the event of a failure of the main system.
Changing the region reduces the risk
We have expanded into three AWS regions: two in the US (US-East and US-West) and one in the European Union (EU). Reserved enough resources to switch if one region goes down.
')
A typical failover is as follows:
- Understand that one of the regions is experiencing problems.
- Scale two rescue regions.
- Proxy traffic to rescuers from the problem region.
- Change DNS from problem region to rescuers.
We study each step.
1. Determine the presence of a problem
We need metrics, but rather one metric that speaks about the health of the system. Netflix uses the stream starts per second business metric (stream starts per second, abbreviated SPS). This is the number of clients that successfully launched streaming.
The data is segmented by region. At any time, you can build an SPS graph for each region - and compare the current value with the value for the last day or week. When we notice a drop in SPS, we know that clients are not able to start streaming - hence, we have a problem.
The problem is not necessarily related to the cloud infrastructure. This may be bad code in one of the hundreds of microservices that make up the Netflix ecosystem, a broken submarine cable, etc. We may not know the reason: we just know that something is wrong.

If the SPS crash occurred in only one region, then this is an excellent candidate for failover. If in several regions, then no luck, because we can only evacuate one region at a time. That is why we deploy microservices in the regions in turn. If there is a problem with the deployment, you can immediately evacuate and fix the problem later. Similarly, we want to avoid failure if the problem persists after traffic is redirected (as happens in the case of a DDoS attack).
2. Scale rescuers
After we have identified the affected region, we need to prepare other regions (“rescuers”) for transferring traffic. Prior to the evacuation, the infrastructure in the rescuer regions should be scaled accordingly.
What does scaling mean in this context? Netflix traffic pattern changes throughout the day. There are peak hours, usually around 6−9 pm, but in different parts of the world this time comes at a different moment. The peak of traffic in the US-East region is three hours earlier than in the US-West region, which is eight hours behind the EU region.
In the case of an emergency shutdown of US-East, we send traffic from the East Coast to the EU region, and traffic from South America to US-West. This is necessary to reduce the delay and the best quality of service.
Taking this into account, you can use linear regression to predict traffic that will be sent to rescuer regions at this time of day (and day of the week), using the historical scaling data for each microservice.
After we have determined the appropriate size for each microservice, we start scaling for them, set the desired size of each cluster - and let AWS do its magic.
3. Proxy for traffic
Now that the clusters of microservices are scaled, we begin to proxy the traffic from the affected region to the rescuer regions. Netflix has developed a high-performance inter-regional border proxy server called Zuul, which we have laid out with
open source .
These proxies are designed to authenticate requests, load off, retry failed requests, etc. The Zuul proxy can also perform proxying between regions. We use this feature to redirect traffic from the affected region, and then gradually increase the amount of redirected traffic until it reaches 100%.
Such progressive proxying allows services to use their scaling rules to respond to incoming traffic. This is necessary to compensate for any change in traffic between the time when the scaling prediction is made and the time needed to scale each cluster.
Zuul does the hard work by redirecting incoming traffic from the victim to healthy regions. But there comes a time when you need to completely abandon the use of the affected region. This is where DNS comes into play.
4. Change DNS
The last step in an emergency evacuation is to update the DNS records pointing to the affected region and redirect them to the working regions. This will completely transfer traffic there. Customers who have not updated their DNS cache will still redirect Zuul in the affected region.
This is a general description of the process, how to evacuate Netflix from the region. Previously, the process took a lot of time - about 45 minutes (if you're lucky).

Acceleration of evacuation
We noticed that most of the time (approximately 35 minutes) is spent waiting for the rescue regions to scale. Although AWS can provide new instances for several minutes, but in the process of scaling, the lion's share of time is taken up by starting services, warming up, and processing other necessary tasks before
UP is
registered in discovery .
We decided it was too long. I wish the evacuation took less than ten minutes. And I would like to optimize the process without additional operating load. It is also undesirable to increase financial expenses.
We reserve capacity in all three regions in case of failure of one. If we are already paying for these capacities, why not use them? So started Project Nimble (project "Shustrik").
The idea was to maintain a pool of instances for each microservice in the “hot” reserve. When we are ready to migrate, we simply inject the “hot” reserve into clusters to accept the current traffic.
An unused reserved tank is called a “trough”. Some Netflix development teams sometimes use a part of the “feeder” for their batch jobs, so we simply cannot take it
entirely into a hot spare. But you can maintain a shadow cluster for each microservice, so that at every time of the day there are enough instances to evacuate traffic, if such a need arises. The remaining instances are available for batch jobs.
During the evacuation, instead of traditional AWS scaling, we inject instances from a shadow cluster into a working cluster. The process takes about four minutes, in contrast to the previous 35.
Since such an injection is fast, you do not need to be carefully moved using a proxy so that the scaling rules have time to react. We can simply switch DNS - and open the gateways, thereby saving a few more precious minutes during idle time.
We added filters to the shadow cluster so that these instances do not fall into the metrics reports. Otherwise they will pollute the metrics and bring down the normal working behavior.
We also removed the UP registration for the instances from the shadow clusters, changing our discovery client. Instances will remain in the shadows until evacuation begins.
Now we make a failover of the region in just seven minutes. Since the existing reserved capacity is used, we do not incur any additional infrastructure costs. Failover software is written in Python by a team of three engineers.