City AD: schoolchildren and students

Hi, Habr. This year, we have quite successfully passed experiments to involve young programmers in AD:

started a hackathon where schoolchildren and students competed on an equal footing (schoolchildren won), helped organize the NTI Olympiad based on big data.
opened the direction of hellish miracles in summer schools . They learned how parametric modeling was mastered, how the schoolchildren wrote the recommendation system of the Rain news, mastered the basics of Mitnick's social engineering, we will tell in the next article.
organized meetings for the "bitten" in Yandex with the Hedgehog. The Hedgehog (Alexander Panin) did not resist the charm of the young "dataseintists" on the hackathon, since every Saturday one of the talks turns into a Small AD to the sounds of the harp, on which the Hedgehog plays during the breaks.

School

Inspired by the persistence of the guys, we decided to start involving older students. Conceived school right in Moscow, it will take place from August 1 to August 8 at the Faculty of Computer Science of the Higher School of Economics; everyone who is under the age of 22 is invited to participate.

The school program consists of two blocks: an intensive with case analysis from leading industry experts and team work on a project with an experienced curator.

Selection

To participate, it is necessary to pass the selection - to solve the real problem faced by our partner E-Contenta when developing the recommendation engine for Tviz.tv. Until July 25, we make decisions in any way - it is interesting to look at non-standard ideas, perhaps, who will surpass the partner’s decision. Experienced participants have the opportunity to declare themselves and win a grant for free education.

Our goal is to give young people the opportunity to immerse themselves in Data Science not for 180 thousand in "adult" courses. Selection is aimed primarily at testing motivation.

Qualifying task

The main task is to analyze the interests of users based on the available data and build an algorithm for recommending new films.
In the beginning, it is proposed to solve several simple analytical problems (tasks 1 and 2) using the proposed data, and then try to get a solution to the creative problem.
The ideal solution to the task should contain: an answer, a code that reproduces this answer, and a description of what you have done. The solution is recommended to be sent in the Python language (2.7 or 3.4 / 3.5). You can use any library if you are ready to explain in a conversation how they work. If you “write off” (borrow materials from the Internet) - please refer to the source. Borrowing at the specified source is not punished, if you are able to explain what exactly is happening in the code and why it solves the problem for which you use it. Cheating without specifying the source is always punishable.
The code that works not only for you is welcome (especially for the advanced direction) (for example, it is useful to pin the dependencies in requirements.txt ). It is also desirable to have an obvious / documented entry point. Git-repositories with signed commits, laid out in IPFS, will be considered separately :)

Data

Archive data can be downloaded here . Part of the data was anonymized, part is missing - in the real world the server is broken, the data disappears. An example of reading data .

In it you will find:

A sample of which viewers liked which movies.

train_likes.csv

A table with lines like user_id, item_id, channel, time.

user_id - viewer id
item_id - movie id
channel - channel identifier
time - time (timestamp)

Each such four means that at the moment of time user user_id liked the movie item_id ,
which went on channel .

Movie descriptions

items.json

Metadata about movies in json format (opens with any standard json module). Each line must contain:

id - movie id
duration - the duration of the movie
year - production year ratio
genre - genre number (categorical variable, 10 genres in total)
f_ {number} - additional features (see below).

The coefficients of the duration and year - the arithmetic transformation of the duration of the film and the year of release, made in order to save personal data.

Signs of f_* are various anonymized signs of the film. Examples of such signs are “Production Country — USA” or “Budget is greater than $ 100k” (yes — 1, no — 0 or not present).

Important. Not all films have a description line: about 2/3 of the films that were stuck are described.

Movie schedule

schedule.csv

A table with lines of the form time_end, time_start, item_id, channel .

time_end - transfer end time
time_start - the start time of the transfer
item_id - movie id
channel - channel id

Task 1

Not all channels and not all movies are equally popular. First, calculate how many average user likes ( train_likes ) are in one channel. Also count the number of movies that have at least 5 likes. The recommended output format is one real number (average number of likes per channel) and one integer (number of movies with 5+ likes).

Task 2

Viewers usually have their own genre preferences, and these preferences may vary from channel to channel. Note that not all films are known for their genres - such films will have to be counted separately. First, calculate the amount of likes for each genre and separately for movies where the genre is unknown. Next, calculate the same amount of likes by genre for each of the top 10 most favorite channels.
The recommended output format is a line of numbers: the number of likes of each genre ascending the number of the genre, at the end the number of likes for films with an unknown genre. Next, 10 of the same lines for each of the top 10 channels in popularity. It is necessary to clarify which channel is shown on each line.

Task 3

Your main goal is to learn how to recommend movies to users so that they like them. There are many ways to do this using fairly simple assumptions. For example:

You like movies like those that you already had. If there is a movie in the sample that is featured by ( items.json ) is similar to other movies you like, then you will probably like this movie too.
Similar users like similar movies. If users who like the same movies as you usually like 1 more movie that you don’t know about, you will most likely like this movie.
Channels form their programs in such a way that at one time there are programs designed for approximately the same audience.
And so on

Parting words and help

Your first task is to create a minimum workable solution. If you have not previously used machine learning, you can try using the intuitive considerations above and a hard-coded predictive algorithm.

Not all such ideas will work, so you will need to check whether this or that hypothesis works. For example, to check how the idea of “looking at similar users” works, you can use the metric mAP @ k or NDCG .

You can also offer your quality assessment methodology and tell us why you chose it. Be sure to describe all your attempts, good luck and failures in the report: we will know what to teach you. Those who cannot attend school or are over 22 years old, but would like to exercise, can also send us solutions, send feedback and invite you to visit.

A lot of good about collaborative filtering under the spoiler ↓

Baseline to help

tl; dr: https://github.com/goto-ru/2016.09-school-baseline .

The baseline we have prepared is based on the second idea - collaborative filtering.
In our algorithm, we use only data about likes, forgetting about the signs of the film. This is a very rough assumption, for sure, entering information about the content of the film will significantly improve the result, try it.

The idea behind the root of our method is as follows:

user likes the same movies as similar users
the movie is watched by the same users that watch movies similar to it

In other words, each user has some unknown set of interests that determines which films he likes. The film, in turn, also has a profile that is responsible for how much the audience likes it. We do not know the psychological profile of each user and the audience of each film, but we can restore it using the above ideas.

Cosine measure

First, we will consider an intuitive way of solving the problem, based on how the user is suitable for the movie audience.
We present the data in the format of the -> . The recommendation of the item movie to the user user is calculated as follows:

For each movie polynkannogo user, we find other people who liked the movie.
Let's put all such “friends on likes” together and call them the neighbors (neighborhood) of the user.
For an item movie, find out its audience - many users who liked it
The suitability of a movie to a user is how much the user's “like friends” like this movie.

Detailed description of the algorithm .

This method is probably better than random fortune telling, however, it treats all the "neighbors" of the user in the same way, guided only by the number of times they have watched the same films.
Since users usually watch quite a few films, it will often turn out that two users with almost perfectly coinciding interests do not have “common” films, although the films of one user would have liked the other if he knew about them.

Svd

In order to correct this shortcoming, we will try to move from "users who have watched movies watched by another user" to explicitly obtain "user interests" and "movie audience". This move will greatly improve the quality of the result.

So, math .

The rich inner world of the user and the artistic depth of the film must somehow be encoded, and even so that it is easy to understand how close two users are, two films or how much the film fits the user.

We probably don’t want to do this with our hands, so like all normal mathematics we say that the user's soul is limited to a vector of several numbers. Knowing the complexity and versatility of human nature, we will allocate as much as 100 numbers (of any real numbers). What exactly are the numbers - do not yet know.

For fidelity, we will also match 100 numbers. In the language of mathematicians, we have just introduced 2 vectors - the vector of "user interests" and the vector of "film audience". In the language of programmers, we declared 2 arrays, did not specify them at all, and we show off without any reason.

Now let's understand what we want from these vectors.
Let's say that vectors are "similar" if their scalar product is large. How big is this scalar product, too, do not say - we are mathematicians! We can only say that similar users have more than dissimilar ones. The same with movies - let similar films have a scalar product larger, and unlike films smaller.

Finally, the most important thing is that we have a user vector, we have a movie vector. Let's want the scalar product of the vector of the user to the vector of the film to be as close as possible to 1 if the user likes this movie, and to 0 if they do not like it or not.
Why so - yes, that's what they wanted. It might have been desirable that if the film was not to be liked, it was -1, and if they didn’t watch it, it was 0, but unfortunately, we only have “likes” of users - there are no dictionaries in the sample.

And now let's move to a larger scale. We have several tens of thousands of users and even more - movies.
We also know that such users like this movie.

We will be generous, give 100 numbers to each user, and even to each film. We can do it this way - the number in float32 weighs 4 bytes, and we need these numbers according to rough estimates (number of users + number of films) ~ 10 ^ 5.
We need the main condition to be fulfilled: the scalar product of the user's soul to the movie audience should be closer to 1 if the user likes the movie, and 0 if not like.

Let's call a vector $i$ of that user $U_ {i}$ vector $j$ of that film $V_ {j}$ . For simplicity of notation, $U$ - All users, $V$ - all movies.

Achieving perfect equality is most likely not possible, and the fewer numbers we give per soul / film, the worse the approximation will be, but we will be satisfied if the scalar product is just close enough to the correct value.
In the language of mathematicians, such a task is called an optimization problem: we want to select such vectors of users and films so that the error is minimal:

$\ sum_ {i, j} ((U_ {i}, V_ {j}) - like_ {i, j}) \ to min,$

Where $like_ {i, j} = 1$ if user $U_ {i}$ liked $V_ {i}$ otherwise 0.

Finally, to fit the methods of well-known mathematicians, we introduce for each of the 100 components of the "importance" vectors - the same for all users / films. Formally, this is the S vector, in which again there are 100 numbers.

$\ sum_ {i, j} ((U_ {i}, S, V_ {j} - like_ {i, j}) \ to min.$

Gradient descent

One way to solve the problem can be described like this:
First we write in $U$ and $V$ all users and movies are random numbers.
Next in the loop:

We select the user $U_ {i}$ and film $V_ {i}$ .
We consider the scalar product of vectors $U$ , $S$ , $V$ .
We consider an "error" if we have not gotten (negative error) - we move the numbers to $U_ {i}$ , $S$ and $V_ {i}$ so that the dot product is slightly increased. If we overdo it (the error is positive), we move the same numbers in the opposite direction.
Finally, if the error is zero, change nothing.

Now select the next user and repeat the process until U, S and V are set to approximately the same values. Due to the fact that every time we choose a new user and a movie, such reductions-increases on average over many iterations will reduce our “mistake”, and to the best degree, fulfill our “Wishlist”. Thus, after many iterations, we get the vectors suitable for our problem.

From deliberately limiting the number of numbers in the vectors to 100, not allowing us to perfectly fit all likes and non-likes, we are only better off:

We train our model on those likes that are in the sample. If we give ideal units on polikan films and ideal zeros on non-polished ones, we cannot recommend the user any films other than those that he has already watched.
If our algorithm is specifically co-ordinated, then on some films from among the non-tiled ones, the algorithm will produce a large scalar product - by mistake. In other words, it seems that the interests of the user fall under the typical audience of the film, but the trouble is that in fact the user did not like this film.
In fact, this is no mistake - if the movie is so suitable for the user, but he hasn’t liked it yet, then it's time to recommend the movie to the user, and the user is likely to like it.

Moving from idea to implementation and from concept to specific methods and data structures: in a simple form, it suffices to take only user_id <-> film_id from train_likes.csv ; each such pair means that the user user_id watched the film_id (and there is also time and channel in the same file, which we now ignore). We build from these relationships a sparse adjacency matrix, to which we apply Truncated Singular Value Decomposition. This method compresses the original matrix, trying to find such a hidden feature space where the feature vectors of users and films they like are close; we get the kth order approximation. After this compression, we reconstruct the original matrix - with some error. It is thanks to this error that everything works: well, films that the user has not watched, but who like users like him, will now have a value not of 0 , but 0.4 , because we could not save all the data under compression, and the user "mixed" with those similar to himself.

The code is based on the above: matrix decomposition and reproducible notebooks to watch for free without SMS .

Once again we emphasize that the above uses only the minimum possible data: whether the user watched a movie or not. Considering additional metadata (for example, genre or time of display), one can significantly improve the quality of prediction; a combination of collaborative and content filtering methods (through initialization, for example, a matrix of feature films derived from metadata) is the next logical step ~~on a slippery data analysis track~~ in solving this problem.

Solution interface

Recommended interface: it is desirable that your program has a function ( example ) that takes parameters (user_id, film_id, additional information) and returns the predicted confidence of the model in like. To calculate the ranking metrics you also need a ranking interface : (user_id, k) -> Sequence[confidence] .
The main thing that will affect the assessment is the performance of the solution (significantly better than at random), the validity of the assessment of the quality of the decision, and only then the final quality. A simple working solution that is honestly analyzed is always better than an insane mixture of everything that you found on the Internet, about which it is not clear how it works and an overestimation of quality. Machine learning is desirable but not necessary.
In addition to such a system, I would like to receive a report from you in an arbitrary format, from which you can understand exactly what you did and why, as well as how you evaluated the quality of your recommendation system. The recommended maximum report size is 3000 characters, including spaces: (assert len(u""" """) <=3000) .

We are waiting for all possible solutions, and even with great impatience - solutions without ML. The main thing is motivation, and ignorance of ML is a sure sign that the school will be useful.

Source: https://habr.com/ru/post/305526/

All Articles