In previous articles I touched on the topic of simple ratings. In the comments I was asked to paint the topic of ratings, which give each user their own.
We need to use the ratings of other users to predict the rating of the movie for the current user. Those. Our task is to forecast the estimates of a particular user.
Introduction
The user on the basis of which the estimates are considered we will call the
criticThe user for which we will consider the assessment we will call the
user .
')
Although users and critics are in the same database and overlap it is more convenient for us to call them differently. If the user has one vote, then we can predict something. But a one-voice critic is useless. You can also add as a critic an average movie rating or IMDB rating and movie search.
We believe that the average rating of any film is statistically reliable (the film has many votes), as the average rating of any critic (the critic has voted many times). We can predict these two values and remove some of the critics and films with a small number of ratings from the rating.
We will assume that all grades are on a 10-point scale from 1 to 10. This is true for any scale, although the more grades there are, the better. However, for ratings, I have doubts that it will work. For the “I like” or “purchase” ratings, the method will work, however, there are other options.
Choosing the best critic
Let's start with a simple example. We go to the cinema every Saturday. However, in order not to go to the cat in the bag, we first read the column of film critics in 5-6 newspapers. Recently, we are too lazy to read 5 newspapers and we need to choose one, the ratings of the film critic of which are as similar as yours Those. Your tastes and tastes of the film critic coincide as much as possible.
For example, here is a table of two film critics.
Your ratings: | 5 | 8 | 7
Film critic 1: | 5 | 8 | four
Film critic 2: | 4 | 6 | eight
At the first film critic, all the estimates are the same, except for one. But this estimate is very different. The other, all estimates are slightly different. The question is who is closer.
The question is how to get a numerical estimate of the proximity of the taste of the critic and yours? There are infinitely many metrics that define this. The two simplest ones are Euclidean (the distance between two points from the school curriculum) and Mankhettenov (in honor of the New York area).
Euclid vs Manhattan
The Manhattan metric is so named because it reflects the distance that you need to walk in a large city with perpendicular streets, when you can only move parallel to the axes of coordinates.

If you count Manhattan:
Film critic 1: / 5-5 / + / 8-8 / + / 7-4 / = 3
Film critic 2: / 5-4 / + / 8-6 / + / 7-8 / = 4
First is better
If you count Euclid
Film critic 1: (5-5) ^ 2 + (8-8) ^ 2 + (7-4) ^ 2 = 9
Film critic 2: (5-4) ^ 2 + (8-6) ^ 2 + (7-8) ^ 2 = 6
The second is better. I did not take the root, but the inequality will remain whether we take it or not.
Metrics reflect distance in multidimensional space. In mathematics, the metric is considered to be a characteristic of the space, which, as it were, is given. And which one is better philosophical question. From a philosophical point of view, the simpler the hypothesis is, the more likely it is to be correct. From this point, Euclid is better - it does not imply any obstacles that prevent it from moving not parallel to the coordinate axis. A square is a smooth function and is a special case of multiplication, which in turn is a special case of summation. A module is a conditional function. Moreover, there is a method of least squares, with the help of which they even discovered the planet Ceres.
If you talk in practice, it is not so simple. The square increases the significance of single strong shifts, as with the last movie in the table. On the one hand, this is good - big shifts are more indicative. However, they increase the significance of random errors, both of the user and the critic.
=\begin{cases}%20&%20\text{%20Manhattan:%20}\sum{/X_{i}-Y_{i}/}%20\\%20&%20\text{%20Euclid:}%20\%20\%20\%20\%20\%20\%20\%20\sqrt{\sum{(X_{i}-Y_{i})^2}}%20\end{cases})
The distance between the user and the critic will be the distance between their ratings for the same films. Since, different critics have a different number of films for which both the critic and the user voted, then we need to divide by the sum of matches this is called the average distance. Also, since there are statistical errors, we will replace the average distance with its prediction (hereinafter referred to as PSR).
(\bar{D}-\bar{d}))
D a large average distance of all critics' ratings from all user ratings (can be replaced by a constant) - we replace them with an unreliable part of the distance. f (n) is the estimate of the unreliable part of the distance. 0.5 / sqrt (n) in the simplest case.
Courage and snobbery
Another problem of Euclid, that he punishes for courage. The smaller the scatter of the average assessment of the critic, the greater the chance that a random user will choose this critic. For example, if the average rating for the site is 5, then the critic voting only by the fives will receive the most users. And the critic voting in regular intervals 1 and 9 will receive least of all users. Whether it is necessary to balance critics so that regardless of courage they get an equal chance on the user - you need to look in practice.
One can consider snobbery (the average distance between a user's rating and the average movie rating) of users and critics and take this into account when searching for criticism. In the simplest form, add a virtual movie “snobbery” to the user and criticism.
A variant with several critics
He is solved similarly. The question is how to balance their scores. You can calculate the weighted average for them and let each of them have a coefficient inversely proportional to its RPS. Here is the prediction formula for the assessment that the user will give to the film.

di c point prediction of the distance from the user to the critic. Ri is the critic rating of this movie.
As you see, there is nothing difficult in theory - there will be problems in practice when you need to optimize all this by speed and adjust the rating.