In our blog on Habré, we tell not only about the technological aspects of the work of the cloud service 1cloud , but also address the issues of organizing the workflow.For example, recently we discussed the work of those.support
Today we decided to diversify the Friday Habr's tape with the analysis of materials on the subject of the device service Netflix, which supplies viewers with films and series based on streaming media technologies. ')
Netflix was founded in 1997 by Mark Randolph (Marc Randolph) and Reed Hastings (Scotts Valley, California). At first, it employed 30 employees involved in the provision of 925 movies and TV shows by subscription. Today, Netflix is the world's leading Internet TV network serving tens of millions of customers.
The company actively spreads to the Network information about applied technologies, infrastructure and unusual solutions. For example, in their blog , Netflix employees talked about what the company is doing so that the content provided is always of high quality. For this, they developed a whole methodology. The content delivery network (CDN) of the company takes into account what devices its customers use and supports a wide range of Internet connections - from mobile Internet at speeds of less than 0.5 Mb / s to high-speed Internet (over 100 Mb / s).
Netflix uses adaptive streaming technology that regulates the quality of audio and video in accordance with the user's Internet speed, but the company also provides customers with the opportunity to set the video quality on their own. A large number of combinations of codecs and bitrate values in Netflix imply "the obligatory encoding of one film in 120 different ways before transferring it to any streaming platform."
The entire video encoding process is carried out in the cloud, which gives the company great opportunities for scaling: if there is a sharp need to process more films, then cloud technologies make it easy to use additional resources. The reverse is also true - if necessary, the amount of computing power can be reduced.
Tasks for processing long videos are divided into smaller ones and are performed in parallel. This can significantly reduce the time for their implementation. The figure below shows the whole process. After receiving the source video is encoded in different formats and in different quality and only then gets into the CDN.
Netflix receives source files with video from its own film studio or from its partners, but this does not exclude the likelihood that some of the files received will have defects arising during storage or multiple data processing. For this reason, at the initial stage, it is checked whether the sources meet the necessary requirements: this is how content that can cause negative viewing impressions is revealed.
If the source file does not meet the requirements, the system automatically notifies the company's partners about the problem and requests a new source. If everything is in order, metadata is attached to the file for encoding. To improve the efficiency of working with large files (for example, video with 4K resolution), they are fragmented - the file is conventionally divided into small parts, which are checked in parallel.
During the test, errors are detected and metadata are generated that contain the temporal characteristics of the fragment. Upon completion of the analysis, all the pieces are collected in a special aggregator, which decides whether to transfer the file to the next stage of the production process. As a result, 4K video is checked for defects in less than 15 minutes.
With regard to the encoding of the video file, it is the same as in the case of quality control, is carried out in parallel. After successful processing of the fragments, the special program assembles all the parts into a single whole.
Prior to the use of parallel processing techniques, the system processed the film with a resolution of 1080p for several days. Today, the same film can be fully tested and encoded in various formats in just a few hours.
Before quality control was integrated into the Netflix system, all errors “surfaced” when customers contacted technical support. Most often, users were very unhappy, and this affected the financial position of the company. After the introduction of the quality control system, the Netflix team began to identify all errors in a timely manner, and the reliability of the coding of the source code that was checked today is more than 99.99%. If a company is faced with a problem that could not be identified with the help of automation algorithms, employees will start developing new verification mechanisms to prevent this from happening in the future.
Based on its content delivery and delivery system, Netflix conducts large-scale video file experiments, for example, comparing different types of codecs and optimizing the encoding process. According to the company's employees, Netflix solves the tasks necessary for optimal delivery of high-quality video streams, while also benefiting the community.
The Netflix Open Connect content delivery network provides its services to larger ISPs with more than 100,000 customers. Special embedded equipment with low power and high recording density for a certain fee can transfer content stored in data centers of Internet service providers from the Netflix site, thereby reducing the cost of transferring data from the Web. This hardware runs on the FreeBSD operating system using the nginx server and the BIRD routing service.
Monitoring failed servers
Currently, Netflix is running on several tens of thousands of servers, and less than one percent of them are malfunctioning. A faulty server is not necessarily disconnected, and therein lies the main danger: it responds to requests during health checks, and its system metrics remain within the normal range, but it functions at a level far from optimal.
A slow or bad server is much worse than an inoperable one. Even a small negative impact can be enough to cause discomfort among customers of the service. For this reason, Netflix is concerned with automating the search for faulty servers that are not detected by monitoring systems.
In the video below, Cody Rioux talks about what prompted the company to tackle these problems, and also shares the technical details of the algorithms used (with examples).
The company widely uses the so-called cluster analysis, which is a method of machine learning without a teacher. When choosing a cluster algorithm, the Netflix guys chose the spatial data clustering algorithm with the presence of DBSCAN noise (Density-Based Spatial Clustering of Applications with Noise).
DBSCAN accepts as input a set of points and marks as cluster points those that have a large number of neighbors. Points that are in areas with lower density are considered to be statistical emissions. If a certain point belongs to a cluster, then it must be located at a certain distance from other points of this cluster, which is determined by a special function (Naftali Harris) in its blog gave an excellent visualization of the DBSCAN method.
DBSCAN is used as follows: Atlas's main dynamic telemetry platform tracks a given metric and generates a data window. This data is then transmitted to the DBSCAN algorithm, which returns a set of servers with suspected malfunction. The image below shows the data that is input to the DBSCAN algorithm. The area highlighted in red is the current data window.
When the servers are identified, control is transferred to the alert system, which takes one (or several) of the following: contact the owner of the service by email, disconnect the server from the service without stopping it, collect expert data for further investigation, stop the server, allowing it to be replaced.
It is impossible to perfectly determine the failed servers, but the probability of this is quite high . A non-ideal solution is quite acceptable in a cloudy environment, since the cost of the error here is quite small. Since the servers being shut down are immediately replaced with “fresh” ones, erroneous stopping the server or shutting it down has a minor effect on the system, if it has any.
Netflix's cloud infrastructure is constantly expanding, and automating operational solutions opens up new opportunities, improving service availability and reducing the number of situations where human intervention is required. Determining faulty servers is just one example of this automation.
Recommended Algorithms
In 2009, Netflix held a competition called the Netflix Prize. She opened access to her personal data and gave participants the opportunity to try to improve the algorithm for predicting the assessment, which the viewer will put the film, based on the ratings of other viewers. The winning team managed to increase the efficiency of the algorithm by 10.06%.
As for the recommender system, it consists of several algorithms, some of which are responsible for the process of creating a personalized home page. For example, an estimate prediction algorithm that a user will deliver to a movie, a video ranking algorithm for each row and a movie grouping algorithm.
Currently, movies and shows on the Netflix home page are organized in the form of subject lines (each of which has a specific name), and service users can scroll the page in both horizontal and vertical planes. This approach allows the user to quickly decide for himself whether a whole group of films is worth his attention, or he can go on to the next line.
Machine learning is used to create a personalized page - the algorithm is trained on the basis of historical information about all the home pages ever created and information about user interaction with them.
Netflix has a fairly large selection of features that can be used to represent a string for a learning algorithm. Since a line is a set of videos, you can use as attributes all the properties that they possess both together and separately. This may be the usual metadata or more useful recommendations, showing how this or that picture will be of interest to a specific user.
It is extremely important to evaluate its quality when generating a page with any algorithm. Coming up with assessment metrics, Netflix employees scooped up ideas in an area like information retrieval. For example, a company has found a use for a value called “fullness” (Recall), which is the ratio of the number of relevant objects (films that the client has chosen) in the sample to the total number of relevant objects.
It was decided to use the Rec [m; n] value, which is the number of relevant objects in the first m lines and n columns on the page divided by the total number of relevant objects. Thus, Rec [3; 4] is the quality of the video (for a specific user) presented on the device, the display of which allows you to display only 3 lines of 4 videos each. The advantage of such a metric is that, by fixing one of the values (m or n) and changing the other, you can see how the value of completeness changes as you scroll the page.
When forming the homepage, it is also very important to understand how users view it, that is, to which points on the screen they pay the most attention. Placing the most up-to-date videos in the most frequently viewed places (most often this is the upper left corner of the screen) should help reduce the search time for a movie for the evening.
As for the diversity of genres, this is a whole separate story. If you are a Netflix user, then you most likely have noticed that sometimes strange, sometimes absurd films are offered to you. Documentary dramatic films about the fighters with the system? The play about the royal family, based on real events? Foreign stories about the Satanists of the 80s?
And what is interesting, it turns out, Netflix has 76897 different ways to describe movies. The company has created special teams whose members tagged each movie with their meta tags. This process turned out to be so complicated that participants were given a 36-page guide, which described how to evaluate films by the presence of sexual content, the amount of blood, how to determine the romanticism of the film and assess the level of acting. Even the morality of the characters was assessed.
The Atlantic journalist Alexis Madrigal (Alexis Madrigal) decided to look into the situation in more detail and discovered that Netflix has an absurdly large number of genres (their number exceeded 90,000). To shed some light on the situation, Alexis wrote a script that “pulled” all existing genres from the company's website.
“I immediately noticed that the site does not have films of all the genres represented,” notes Madrigal. “The presence of a genre in the database means that, according to the algorithm, such films may appear later, or there are already materials that fall under the description, but have not yet been added to the site.” For example, category # 91300 was called “Good romantic TV shows in Spanish” and was empty, category # 91307 bore the title “Beautiful Latin American comedies” and contained two films, and in category No. 6307 “Beautiful romantic dramas” 20 films were presented.
However, if you try to analyze all genres, then the grammar of Netflix will become quite transparent. For example, information about the “oskranosnosti” picture is always recorded at the beginning: “Oscar-winning romantic dramas”, but temporary periods always follow at the end: “Oscar-winning romantic dramas of the 50s”. Categories that reveal the content of the film are also placed near the end: "Oscar-winning romantic films about the wedding."
It turned out that for each descriptor there is a strict hierarchy. In short, the genre is formed according to the established pattern:
Location + Adjectives + Noun + Based on ... + Filmed at ... + From the director ... + O ... + For ages X to Y
All 76897 genres that Madrigal's bot found were created from these basic components. Netflix Vice President Todd Yellin told more about the process of forming “micro-genres”. In 2006, Yellin, with a team of engineers, began work on documentation called the Netflix Quantum Theory. In this context, a quantum is a small “energy packet” (micro tag) that is part of each movie.
Yellin said that genres are limited to three main factors:
The name of the genre should not exceed 50 characters due to the peculiarities of the user interface.
In order for an algorithm to form a genre, an “critical mass” of content must be accumulated that would fit its description.
Registered genres are syntactically correct.
In the world of Netflix there are no genres consisting of more than five descriptors. Four descriptors are very rare, but they can be found: "Cult horror films about crazy scientists of the 1970s." Three descriptors are found quite often: “Good foreign comedies for incorrigible romantics.” Two descriptors are used very often: "Films saturated with riddles." Very often there are single descriptors: "Unusual films."
However, the magic of Netflix lies in the fact that, it turns out, not all tags were placed by people, some of them are developed by the system itself. For example, the adjective "encouraging" is attached to films that have a certain set of features, the most important of which is a happy ending. It turns out that “encouraging” is not a direct tag, but a computed category based on a set of tags.