About sorting content based on user ratings: Part 3

In the last article, I derived a formula that predicts a rating based on article ratings and an average site rating. I thought in this article, I will show the quality of its forecast, improve the forecast due to the variance. However, there is another problem.

This formula predicts article rating in the future.

Minority issue

However, even if our forecast is 100% correct, it will not say anything about the assessment that all users of the site would give to the article. He predicts what average rating will put users who go to it and vote.

Those. There are two conditions:

User must go to the article
User must vote

')
The second condition is not so significant - we can assume that a random percentage of users vote. However, for the first condition we cannot make such assumptions: users read the article title and its announcement. For example, on Habré, an article about photoshop will be of interest only to a minority, however, if it is well written, all the designers will give it 5 stars and it will overtake other articles with a wider audience. The porn site example shows this problem better. On which part of the pictures all sorts of perversions that are interesting only to a minority, and the majority are disgusted.

Instead of the problem of "one voice" comes the problem of "minorities". If such a problem has arisen on your site, then you need to make the rating linearly dependent on the number of votes. For example, instead of the root of n, write simply n in the formula at the beginning of this article. However, in this case, the rich will get richer.

You can exclude distortions from the overall rating, however, what to do if there are no topics that like the absolute majority of users like on Habré.

Just a plus

This picture best describes the rating with one possible outcome: just a plus. There is no one voice problem, but the “rich get rich” problem is strong, since the number of votes depends linearly on the views. We can reduce it.

Suppose we face the task of increasing the number of clicks on “I like” at the expense of sorting: this is a good indicator of the positive emotions that cause the site. In addition, clicking on this button attracts additional traffic to the site with social. networks, if this button is in VKontakte or facebook. This is an excellent indicator and it is less abstract than the rating.

We can predict exactly the same as they did with a rating of five stars. To do this, we calculate the average CTR and using some of the formulas from the wiki . For example, the formula on the glass (it's the same formula as Wilson). Then we find out exactly what part of the weight is unreliable and replace it with the average.

Wilson is too long and, therefore, I will give an example for a normal approximation.

We again get the root of n. Since the formula from the previous article is derived from the normal distribution, and this is a normal approximation of the binomial one. However, in many cases, the issue will be worse than when sorting simply by the number of pluses.

What do we predict?

Suppose we developed the Cassandra algorithm, which makes a prediction with 100% accuracy. Even if the article was not a single transition. But in most cases, the number of clicks will decrease due to a decrease in the number of page views.

We did not take into account that the number of transitions to an article depends on its topic and on its title and announcement. Maybe a good article on a publicly available topic overlaps with a great article from a narrow topic that is interesting only to a tenth of users.

To avoid this, you need to somehow predict the CTR announcement of the article. The easiest method is to calculate the number of transitions from the main menu to the article category (or google analytics the number of visits to the category page) and count the CTR of the article proportional to the popularity of the category. CTR itself is of little interest to us, we are most interested in how many CTR of one category is more than another. If you multiply all the rating by a certain number, the sort order will not change.

However, the CTR of the article depends not only on its announcement and topic, but also on its place on the page. We can assume that each article has a certain CTR in 10th place, and from 1 to 9 there are certain coefficients on how much it increases. However, if the issue is conservative, the results of the coefficients calculation will be statistically unreliable, since only one article will be located at a certain place.

This implies the need to "dilute" the issue. Either you need to carry out automatic A / B testing or calculate the difference CTR between positions replacing all positions in turn by one article. Which of these options - faster and easier - I will think at leisure.

Source: https://habr.com/ru/post/150931/

All Articles

About sorting content based on user ratings: Part 3

Minority issue

Just a plus

What do we predict?

More articles: