How Netflix Finds Failed Servers

Let's see how the engineers at Netflix service are identifying faulty servers.

Netflix is a US based streaming media company and film and television showcase. Founded in 1997 and in January 2016, the company has over 69 million customers worldwide. Only in North America and the share of this service accounts for 34% of peak traffic sent to end users.

/ photo by Emran Kassim CC
')
Previously, we talked about how Netflix planned to increase the total data center area by almost 5 thousand square meters with the help of industrial systems manufacturer Schneider Electric. Bluebird’s underground data center is used as the point at which access networks belonging to major national providers like AT & T and Verizon connect. In this case, the emphasis was placed on security due to the underground location, which makes the data center invulnerable to frequent tornadoes and storms in the region.

The company's engineers begin their story about troubleshooting with the series “ Daredevil, ” which is part of the Marvel cinematic universe. One of the main characters, due to his heightened feelings, is able to notice what others do not see, for example, he can “feel” when a person is lying. It was this idea that formed the basis for automating the detection of deviations in the operation of servers.

" Under the hood, " Netflix can find a lot of non-standard automation solutions. Not so long ago, we talked about one of the discoveries of technology journalists, who discovered a truly astounding approach to marking video content with 77 thousand different descriptions and tags.

The tagged story ended with Todd Yellin, a system developer, invited journalists to his office and tried to convey to them the essence of his content description system. According to the engineer, only the development of documentation for the new project, which received the name "Netflix Quantum Theory", took several months of work of the company's specialists.

Today’s topic is not nearly as hard as the Sellin system. Currently, Netflix is running on several tens of thousands of servers, and usually less than one percent of them are malfunctioning. A slow or faulty server is much worse than an unworkable one, since its negative impact on the system’s operation can be so small that it remains within the tolerance of the monitoring systems and can be skipped by the engineer when checking schedules.

Is there a way to automatically detect such deviations based on time data? To find a needle in a haystack, the company's engineers took advantage of cluster analysis , which is a machine-based method of “ learning without a teacher .” The purpose of cluster analysis is to unite the most similar objects into groups. To solve the problem, the spatial data clustering algorithm with the presence of DBSCAN noise (Density-Based Spatial Clustering of Applications with Noise) came up.

This method involves the Atlas Dynamic Telemetry platform, which prepares data for processing by the DBSCAN algorithm, and it already returns a set of servers with a suspected malfunction. In addition to the special metric, the owner of the service determines the minimum amount of time that must pass before the server is regarded as faulty.

The team checked the effectiveness of this approach on the basis of weekly data, which were analyzed manually for which servers should fall into the category of suspicious and which should be given attention. This move made it possible to determine the installation parameters of the system and increase the likelihood of finding truly faulty machines.

The current version of the system uses a mini-batch approach when a data window is formed and used to make a decision. Compared to real-time processing, this algorithm has a drawback: the accuracy of determining the outliers is related to the size of the window. You can improve the parameter selection process if you implement two additional services: a data marker for building training data sets and a model server that will calculate performance indicators and retraining models using an acceptable data set collected by the marker.

The cloud-based infrastructure of Netflix is constantly expanding, and the automation of operational solutions opens up new opportunities, improving service availability and reducing the number of situations where human intervention is required.

PS Materials on how we improve the work of our own virtual infrastructure provider 1cloud :

Source: https://habr.com/ru/post/267201/

All Articles

How Netflix Finds Failed Servers

More articles: