Ranking Quality Metrics

In the process of preparing the task for the entrance test for the GoTo summer school , we found that there is practically no qualitative description of the main ranking metrics in Russian (the task concerned a particular case of the ranking task - building a recommender algorithm). We in E-Contenta actively use various ranking metrics, so we decided to correct this more misunderstanding by writing this article.

')
The ranking task now arises everywhere: sorting web pages according to a given search query, personalizing the news feed, recommending video, goods, music ... In a word, the topic is hot. There is even a special direction in machine learning, which is engaged in the study of self-learning ranking algorithms - learning to rank . To choose the best from the variety of algorithms and approaches, it is necessary to be able to evaluate their quality quantitatively. The most common metrics for ranking quality will be discussed later.

Ranking task brief

Ranking is the task of sorting a set of items for reasons of relevance . Most often, relevance is understood in relation to a certain object . For example, in an information retrieval task, an object is a request, elements are all sorts of documents (links to them), and relevance is a document matching a request, in the recommendation task, the object is the user, the elements are one or another recommended content (goods, video, music). ), and relevance is the probability that the user will use (buy / like / view) this content.

Formally, consider N objects inline_formula

and M elements inline_formula

. Reusalt of the element ranking algorithm inline_formula

for the object inline_formula

- this is a display inline_formula

that matches each element inline_formula

weight

characterizing the degree of relevance of an element to an object (the greater the weight, the more relevant the object). At the same time, a set of scales inline_formula

sets permutation inline_formula

on the element set inline_formula

(we assume that the set of elements is ordered) based on their sorting by decreasing weight inline_formula

.

To assess the quality of ranking, it is necessary to have some “standard” with which one could compare the results of the algorithm. Will consider inline_formula

- reference relevance function characterizing the “real” relevance of elements for a given object ( inline_formula

- the element is perfect, inline_formula

- completely irrelevant), as well as the corresponding permutation inline_formula

(descending inline_formula

).

There are two main ways to get inline_formula

:
1. Based on historical data. For example, in the case of recommendations of content, you can take views (likes, purchases) of the user and assign the corresponding elements 1 ( inline_formula

), and all the rest - 0.
2. Based on expert judgment. For example, in the search task, for each request, a team of assessors can be drawn who manually evaluate the relevance of the documents to the request.

It is worth noting that when inline_formula

takes only extreme values: 0 and 1, then the permutation inline_formula

usually do not consider and take into account only the set of relevant elements for which inline_formula

.

The purpose of the ranking quality metric is to determine the degree of relevance estimates obtained by the algorithm. inline_formula

and the corresponding permutation inline_formula

correspond to the true values of relevance inline_formula

. Consider the basic metrics.

Mean average precision

Mean average precision at K (map @ K) is one of the most commonly used ranking quality metrics. To understand how it works let's start with the "basics".

Note: "* precision" metrics are used in binary problems, where accepts only two values: 0 and 1.

Precision at K

Precision at K (p @ K) - accuracy on K elements - the basic metric of the ranking quality for a single object. Suppose our ranking algorithm yielded relevance scores for each item. inline_formula

. Having selected among them the first inline_formula

items with the greatest inline_formula

You can calculate the proportion of relevant. This is exactly what precision at K does:

Note: under understood element which is the result of the permutation turned up on th position. So, - the element with the greatest , - element with the second largest and so on.

Average precision at K

Precision at K is a metric that is easy to understand and implement, but it has an important disadvantage - it does not take into account the order of the elements in the “top”. So, if out of ten elements we guessed only one, then no matter where it was: on the first, or on the last, in any case inline_formula

. It is obvious that the first option is much better.

This deficiency eliminates the ranking metric average precision at K (ap @ K) , which is equal to the sum of p @ k in indices k from 1 to K only for the relevant elements , divided by K:

So, if of the three elements we were only relevant in the last place, then inline_formula

if you guessed only the one that was in the first place, then inline_formula

and if everything was guessed, then inline_formula

.

Now and map @ K we can handle.

Mean average precision at K

Mean average precision at K (map @ K) is one of the most commonly used ranking quality metrics. In p @ K and ap @ K, the ranking quality is evaluated for a single object (user, search query). In practice, a lot of objects: we are dealing with hundreds of thousands of users, millions of search queries, etc. The idea of map @ K is to calculate ap @ K for each object and average:

Note: this idea is quite logical, if we assume that all users are equally needed and equally important. If this is not the case, then instead of simple averaging, you can use a weighted one by multiplying ap @ K of each object by its corresponding “importance”.

Normalized Discounted Cumulative Gain

Normalized discounted cumulative gain (nDCG) is another common metric for ranking quality. As in the case of map @ K, let's start with the basics.

Cumulative Gain at K

Consider again one object and inline_formula

items with the greatest inline_formula

. Cumulative gain at K (CG @ K) is a basic ranking metric that uses a simple idea: the more relevant elements in this top, the better:

This metric has obvious flaws: it is not normalized and does not take into account the position of the relevant elements.

Note that, unlike p @ K, CG @ K can also be used in the case of non-binary values of the reference relevance. .

Discounted Cumulative Gain at K

Discounted cumulative gain at K (DCG @ K) - a modification of the cumulative gain at K, taking into account the order of the elements in the list by multiplying the relevance of the element by a weight equal to the inverse logarithm of the position number:

Note: if takes only values 0 and 1, then , and the formula takes a simpler form:

The use of the logarithm as a function of discounting can be explained by the following intuitive considerations: in terms of ranking, the positions at the top of the list differ much more than the positions at the end. So, in the case of the search engine between positions 1 and 11 there is a whole gulf (only in a few cases out of a hundred, the user goes further on the first page of the search results), and between positions 101 and 111 there is no particular difference - few people reach them. These subjective considerations are beautifully expressed using the logarithm:

Discounted cumulative gain solves the problem of taking into account the position of the relevant elements, but only exacerbates the problem with the lack of normalization: if inline_formula

varies within inline_formula

then

already takes values on the not quite understandable segment. The following metric is intended to solve this problem.

Normalized Discounted Cumulative Gain at K

As the name suggests, normalized discounted cumulative gain at K (nDCG @ K) is nothing more than a normalized version of DCG @ K:

Where

- this is the maximum (I - ideal) value inline_formula

. Since we agreed that inline_formula

takes values in inline_formula

then

.

In this way, inline_formula

inherits from inline_formula

taking into account the position of the elements in the list and, thus takes values in the range from 0 to 1.

Note: by analogy with map @ K you can count averaged over all objects .

Mean reciprocal rank

Mean reciprocal rank (MRR) is another commonly used ranking quality metric. It is given by the following formula:

Where - reciproal rank for inline_formula

th object is a very simple in its essence value equal to the opposite wound of the first correctly guessed element .

Mean reciprocal rank varies in the range [0,1] and takes into account the position of the elements. Unfortunately, he does this only for one element - the 1st correctly predicted, not paying attention to all subsequent ones.

Rank Correlation Metrics

Separately, it is worth highlighting the ranking quality metrics based on one of the rank correlation coefficients . In statistics, the rank correlation coefficient is a correlation coefficient that does not take into account the values themselves, but only their rank (order). Consider the two most common rank correlation coefficients: the Spearman and Kendall coefficients.

Kendall rank correlation coefficient

The first of these is the Kendall correlation coefficient, which is based on the calculation of
(and inconsistent) pairs of permutations - pairs of elements to which the permutations assigned the same (different) order:

Spearman's rank correlation coefficient

The second — the Spearman's rank correlation coefficient — is essentially nothing more than the Pearson correlation calculated on the values of the ranks. There is a rather convenient formula expressing it from ranks directly:

Where

- Pearson correlation coefficient.

Metrics based on rank correlation have a drawback already known to us: they do not take into account the position of the elements (even worse than p @ K, since correlation is considered for all elements, and not for K elements with the highest rank). Therefore, in practice, extremely rarely used.

Metrics based on cascade behavior patterns

Up to this point, we have not gone into how the user (hereinafter, we consider the particular case of the object — the user) examines the elements proposed to him. In fact, we implicitly made the assumption that the viewing of each element is independent of the views of other elements - a kind of "naivety." In practice, the elements are often viewed by the user alternately, and whether the user looks at the next element depends on his satisfaction with the previous ones. Consider an example: in response to a search query, the ranking algorithm offered the user several documents. If the documents for positions 1 and 2 turned out to be extremely relevant, then the probability that the user will view the document at position 3 is small, since he will be quite satisfied with the first two.

Similar patterns of user behavior, where the study of the elements proposed to him occurs sequentially and the probability of viewing an element depends on the relevance of the previous ones, called cascading ones .

Expected reciprocal rank

Expected reciprocal rank (ERR) is an example of a ranking quality metric based on a cascade model. It is given by the following formula:

E R R @ K = s u m l i m i t s_{k = 1}^{K} f r a c 1 k P (t e x t r m o b j e c t w i l l s t o p a t a n e l e m e n t w i t h r a n k $ k $),

$ERR @ K = \ sum \ limits_ {k = 1} ^ K \ frac {1} {k} P (\ textrm {object will stop at an element with rank $ k $}),$

where rank is understood in descending order inline_formula

. The most interesting thing about this metric is probabilities. When calculating them, assumptions of a cascade model are used:

P (t e x t r m o b j e c t w i l l s t o p a t e l e m e n t w i t h r a n k $ k $) = p_{k} p r o d l i m i t s_{i = 1}^{k - 1} (1 - p_{i}),

$P (\ textrm {object will stop at element with rank $ k $}) = p_k \ prod \ limits_ {i = 1} ^ {k-1} (1-p_i),$

Where

- the probability that the user will be satisfied with the object with the rank inline_formula

. These probabilities are calculated based on the values inline_formula

. Since in our case inline_formula

, we can consider a simple option:

which can be read as: the true relevance of the element in the position after sorting descending .
In the general case, the following formula is most often used:

Pound

PFound is a ranking quality metric proposed by our compatriots and using a cascade-like model:

Where

,
, if a and 0 otherwise
- the likelihood that the user will stop viewing for external reasons.

In conclusion, here are some useful links on the topic:
- Wikipedia articles on: learning ranking , MRR , MAP and nDCG .
- The official list of metrics used at ROMIP 2010.
- Description of MAP and nDCG metrics on kaggle.com.
- Original articles on the cascade model , ERR and PFound .

Written using StackEdit .
Many thanks to SeptiM for the amazing habratex .

Source: https://habr.com/ru/post/303458/

All Articles

Ranking Quality Metrics

Ranking task brief

Mean average precision

Precision at K

Average precision at K

Mean average precision at K

Normalized Discounted Cumulative Gain

Cumulative Gain at K

Discounted Cumulative Gain at K

Normalized Discounted Cumulative Gain at K

Mean reciprocal rank

Rank Correlation Metrics

Kendall rank correlation coefficient

Spearman's rank correlation coefficient

Metrics based on cascade behavior patterns

Expected reciprocal rank

Pound

More articles: