📜 ⬆️ ⬇️

"Under the hood" Netflix: An analysis of world cinema



/ photo by Brian Cantoni CC

Earlier in our blog, we already talked about how big data is changing the face of companies and discussing interesting ways to use cloud services . Today we will talk about how the cinematic landscape has changed with the arrival of services like Netflix on the market.
')
If you are a Netflix user, then most likely you have noticed that films of strange genres are sometimes offered to you. Alexis Madrigal of The Atlantic has discovered that a video content provider can mark up their films and TV shows with 77 thousand different descriptions and tags.

Of course, even partial reverse engineering of a recommender algorithm of a company like Netflix takes considerable time, but at the first stage of work, Alexis made sure that the company carefully analyzed and marked each film and TV show.

According to Todd Yellin, the person who invented this system, the company paid for watching movies and collecting relevant metadata, which was produced on the basis of a specially designed training tool to evaluate various aspects of artistic works.

Netflix created a database of cinematic preferences of Americans, which served as a useful help in creating their own television shows like "The House of Cards."

The data collection work was done using UBot Studio, which simplifies writing scripts for the web, and an ordinary Asus laptop, which had to work for about a day in order to master this task. Below we give only a couple of examples from the resulting list of genres:

Independent Cinema: Emotional Sport Movies
Spy and adventure films of the 1930s
Cult horror movies with evil children
Cult sports films
1970s sentimental European dramas

The primary analysis of the data showed that Netflix has its own vocabulary, and the descriptions also indicated the origin of the idea for the script for a particular work. By the number of entries, it became possible to establish the fact that marriage and the life of the elite of society turned out to be the most popular topics.

The basic pattern by which the genre is formed, the researchers presented as follows:

Location + Adjectives + Noun + Based on ... + Filmed at ... + From the director ... + O ... + For ages X to Y

For a more complete grammar decoding, AntConc was used, a free program developed by a professor from Japan. Typically, this software is used by linguists in digital liberal arts centers to process large amounts of text.

AntConc essentially transforms text into a manageable data set. The program can count the number of words in the text, for example, in the Netflix database. So, by searching for phrases starting with “For ...” you can see that the company has content for children of ages from 0 to 2 years old, from 0 to 4, from 2 to 4, from 5 to 7, about 8 to 10 and from 11 to 12 years.

On the basis of the dictionary was proposed a number of grammars. In the course of the work, the number of admissible adjectives in the headings was adjusted and experiments were conducted with various grammatical structures, but the essence of the original approach was never achieved. So, it was decided to meet with representatives of the company, which helpfully offered to talk with the direct developer of this system.



/ photo by Austen Squarepants CC

Todd Yellin invited journalists to his office and tried to convey to them the essence of his content description system. The old way of recommending Netflix content is very different from the current one. According to the engineer, only the development of documentation for the new project, called the “Netflix Quantum Theory”, took several months of work of the company's specialists.

The main bet was made on descriptions that will be compiled in clear language and make recommendations more accurate. Part of the “micro tags” was made “scalar” (from 1 to 5), and the genres were limited to three main factors (by the way, there are no genres with more than five descriptors):

1) up to 50 characters in the title;
2) the condition for the accumulation of a sufficient amount of content for a particular genre;
3) the condition of syntactically "correct" genres.

Of course, journalists could not take into account such nuances and their generator produced rather funny descriptions, but the research itself leads us to speculate that machine learning, algorithms and syntax have great potential for both improving and reducing the ability of people to understand what is happening around. for no. In this case, the eternal question “what to see?” Can lead us to very, very controversial results.

PS We try to share not only our own experience with the service of providing virtual infrastructure 1cloud , but also to talk about related areas of knowledge.

Do not forget to subscribe to our blog on Habré, friends!

Source: https://habr.com/ru/post/258753/


All Articles