So, the topic of rating systems
continues to haunt the
minds of users. There are more and more new schemes, formulas, tests. And each time it all comes down to the same question: how to combine the average user rating with our confidence in this rating. For example, if one film received 80 positive and 20 negative votes, and another - 9 positive and 1 negative, which film is better? Without pretending to create a new universal rating system, I still offer one of the possible approaches to the solution of this particular issue.
Approximation by normal distribution
In general, the wording itself — to evaluate a certain value and our confidence in it — suggests the use of a probability distribution model, for example, a normal distribution.
What is a normal distribution ?!For those who skipped a pair of mat. statistics, I remind you that it is a normal distribution, and indeed the probability distribution. Suppose we came to a stop and saw a bus leave in front of us. We know that the next one will arrive in about 15 minutes (at the 15th minute). Well, maybe on the 16th. Or vice versa, on the 14th. In principle, the driver can hurry and arrive at 12 minutes, but the likelihood of this is much lower. The graph below just shows the probability distribution of the bus arrival at every minute: most likely it will arrive at the 15th minute, with a slightly lower probability - at the 14th or 16th, and with very little probability at the 12th or 18th.

')
It should be understood that the value along the Y axis is not a probability, but
a probability density function (PDF). The probability itself is calculated as the area under the graph between two values X1 and X2, for example, the probability that the bus will come between 15 and 16 minutes in this case is equal to 0.248. But more about that later.
Normal distribution is characterized by two parameters - the average value (mean, here - 15 minutes) and variance (variance, variation), which shows the degree of uncertainty of the average value: the greater the variance, the wider the graph, and the less we are sure when this bus will come.
Rating, as a rule, is just a number, some final result of evaluation. And we will actually evaluate the estimated quality of the film (coffee grinders, articles, users - it must be emphasized). The graph below shows the distribution plots for two
hypothetical films .

The first film (blue line) caused conflicting reviews (the average distribution value is 0.5). In contrast, the second film (the green line) received more positive than negative ratings, but there were far fewer people voted, therefore, as a result, we are much less certain (the variance is much greater than in the first graph).
In principle, the normal distribution in itself already makes it possible to model a rating well (the
central limit theorem gives a theoretical justification for this). However, in statistics there is a more convenient tool for this.
Beta distribution
Like the normal one, the beta distribution is defined by two parameters, alpha> 0 and beta> 0 (written as X ~ B (alpha, beta)). However, unlike the normal, always bell-shaped, beta distribution has much more flexibility. In particular, for alpha = 1 and beta = 1, this distribution becomes uniform (the dark blue line in the figure below), for alpha <1 and beta <1, the distribution function takes the form of a well (green line), and for alpha> 1 and beta> 1 becomes similar to normal (red and light blue lines).

programming exerciseIt would be unfair to continue to show graphs and not tell how to draw them and play with the parameters, so here and below I will show examples of code for generating each image. The examples will be in Python using the NumPy, SciPy and matplotlib libraries (all three are available from pip), but they can be easily transferred to
R ,
Matlab /
Octave ,
Java, and even
JavaScript .
For all examples, the following imports will be needed:
from numpy import * import scipy.stats as ss import pylab as plt
The previous chart was generated by the following code:
x = arange(101) / 100. plt.plot(x, ss.beta(1, 1).pdf(x)) plt.plot(x, ss.beta(.7, .7).pdf(x)) plt.plot(x, ss.beta(5, 5).pdf(x)) plt.plot(x, ss.beta(10, 5).pdf(x)) plt.show()
In addition, the beta distribution has several interesting properties:
- It is limited to a finite interval. If we want to “lock in” possible values in the range from 0 to 1, then the beta distribution is just what we need.
- It is symmetrical with respect to its parameters. Graph B (alpha, beta) will be a mirror image of graph B (beta, alpha).
- alpha and beta act on different sides of the density plot. With increasing alpha, the graph shifts and leans to the right, with increasing beta - vice versa, to the left.
- Dispersion with increasing any of the parameters decreases.
User Ratings
And what if we use as parameters alpha and beta, respectively, the number of positive and the number of negative user ratings? In this case, initially, the beta distribution can be initialized by units for both parameters (which, generally speaking, will correspond to the
Laplace smoothing ). In this case, initially our assessment of the quality of the film will be evenly distributed (we know nothing about it), and each vote will increase one of the parameters, reduce variance and shift the graph either to the right (alpha parameter, positive feedback) or to the left (beta parameter negative reviews). At the same time, our assessment of the quality of the film will never go beyond the interval [0..1] and, in fact, will show the likelihood that the film will please the
average viewer .
Consider a few examples. When a new
film appears, about which no one has yet expressed its opinion, its alpha and beta parameters will be equal to one, and the density graph will be equivalent to the uniform distribution graph:

It turned out that the information about the film was uploaded by the director himself. I downloaded it myself, and voted myself. Naturally, positively. Yes, and five of his assistants asked for help. The result: alpha = 1 + 1 + 5 = 7, beta = 1.

The former wife of the director saw the page of the film and decided to spoil the rating, having voted negatively along with her lover. The result: alpha = 7, beta = 1 + 2 = 3:

programming exercise plt.plot(x, ss.beta(7, 3).pdf(x)) plt.show()
After 8 votes, the average score, taking into account the Laplace smoothing, will be equal to alpha / (alpha + beta) = 7/10 = 0.7. However, the graph shows that the dispersion of the resulting distribution is still high, which means our confidence in such an assessment is low.
Suppose that, during the first week of hire, another 90 people voted for the film, and so that the alpha parameter eventually turned out to be 70 and beta - 30. The average rating would be 70/100 = 0.7, as before, but the schedule is significantly will change:

programming exercise plt.plot(x, ss.beta(70, 30).pdf(x)) plt.show()
The variance in the second graph is much smaller. Those. with an increase in the number of votes, our confidence in assessing the quality of the film also increases.
Rating
All this is good, but the user does not want to see some strange graphics. He needs a rating - a figure by which he can determine whether to watch a movie
or better to go read a book . In principle, having the parameters of the beta distribution, you can calculate the average estimate and variance, and somehow try to combine them (for example, divide the average estimate by the logarithm of the variance). But you can go and more statistically correct way.
To make the conversation more substantive, take for example 2 films: one from the previous section with the distribution B (70, 30) and another,
more popular , with the distribution B (650, 350). Distribution plots are shown below:

programming exercise plt.plot(x, ss.beta(70, 30).pdf(x)) plt.plot(x, ss.beta(650, 350).pdf(x)) plt.show()
On the one hand, the average of the ratings for the first film is higher - 0.7 versus 0.65. However, the second film was watched by many more people, so it is not yet known what the rating of the first film after the same number of reviews would have been. So how do you compare them?
One of the comparison options is to calculate the
minimum trust quality of the film, a number indicating the minimum rating a film can receive after an infinite amount of reviews. In statistics, it is not customary to bring everything to the absolute, therefore, as a level of trust, we take not 100%, but the standard 95%. This means that we want to be 95% sure that the film is
no worse than X. Graphically, this means that 95% of the area under the graph should be on the right of X:

programming exerciseVirtually all statistical libraries for all implemented distributions provide a probability function (cumulative probability function, CDF), which takes as input a value and returns the probability that a random variable is
less than that value . Those. in fact, the CDF from some value of X returns the area under the graph between zero and X. This is different from what we need in two aspects.
First, we need the area on the other hand - from X to 1. Fortunately, as mentioned above, the beta function is symmetric about its parameters, so instead of direct beta distribution B (alpha, beta) we can work with the inverse - B (beta, alpha).
Secondly, we need a function that, for a given degree of confidence (a percentage of the entire area of the graph), will return the desired value of X. Most often, the mat. In packages, this function is called inverse CDF or something like that, but SciPy uses the name PPF (percent point function, also found in the literature under the name quantile funtion).
In total, the following code can be used to get the value of the minimum trust quality of a movie:
dist1 = ss.beta(70, 30)
After the calculations, it turned out that with a 95% probability the first film will ultimately enjoy at least 0.6227 from all viewers, but the second one - 0.6250 of them. The difference is only two thousandths, but if you use these ratings in the rating, the second film, even with a lower average rating, will be higher in the list.
The same calculations can be repeated for the films indicated at the very beginning of the post: for a film with an 80/20 proportion, the minimum trust quality will be 0.731, and for a film with a 9/1 proportion - 0.717, i.e. the number of votes again outweighs the average estimate. However, it is worth adding the second film only one vote “for”, and our coefficient for it becomes equal to 0.741, bringing it to the first place.
Variations, advantages and disadvantages
All the coefficients indicated here are taken, by and large, by eye. Although, it seems, they give a rather sane result, in a real application it makes sense to try different values for them. For example, with a large number of users voting for movies, it makes sense to increase the parameters not by 1, but, for example, by 0.5 for each vote. Or even introduce a damping factor, when each next voice has less weight than the previous one - this way you can slow down the growth of the coefficients.
In addition, it is possible to improve the initial rating relative to the film. In this article, I proceeded from the fact that initially we know nothing about the film itself or about other films in our system, so a uniform distribution is assigned to the film at the beginning (alpha = 1, beta = 1). However, in practice, we, as a rule, already know something about the film in advance and can use this information as an a priori estimate. For example, we can calculate the average rating for previous films of this director and initialize the parameters of the beta distribution accordingly. Even if we do not know anything about the director (producer, screenwriter, cast), we can use the average rating for all films in our database.
In principle, the method can be extended for more graded assessments, for example, for a scale from 0 to 10. In this case, ratings above 5 will be added to the alpha parameter, below 5 to the beta, and if evaluated exactly 5, both alpha and beta increase by 0.5 (hello Habr!).
Finally, one can vary the required degree of confidence in the answer or even change the approach, using instead of the minimum confidence quality the area under the graph within a certain fixed interval.
Beta distribution schedule for this article.