📜 ⬆️ ⬇️

Ranking Quality Metrics

In the process of preparing the task for the entrance test for the GoTo summer school , we found that there is practically no qualitative description of the main ranking metrics in Russian (the task concerned a particular case of the ranking task - building a recommender algorithm). We in E-Contenta actively use various ranking metrics, so we decided to correct this more misunderstanding by writing this article.

Ranking Quality Metrics



')
The ranking task now arises everywhere: sorting web pages according to a given search query, personalizing the news feed, recommending video, goods, music ... In a word, the topic is hot. There is even a special direction in machine learning, which is engaged in the study of self-learning ranking algorithms - learning to rank . To choose the best from the variety of algorithms and approaches, it is necessary to be able to evaluate their quality quantitatively. The most common metrics for ranking quality will be discussed later.


Ranking task brief


Ranking is the task of sorting a set of items for reasons of relevance . Most often, relevance is understood in relation to a certain object . For example, in an information retrieval task, an object is a request, elements are all sorts of documents (links to them), and relevance is a document matching a request, in the recommendation task, the object is the user, the elements are one or another recommended content (goods, video, music). ), and relevance is the probability that the user will use (buy / like / view) this content.

Formally, consider N objects inline_formula and M elements inline_formula . Reusalt of the element ranking algorithm inline_formula for the object inline_formula - this is a display inline_formula that matches each element inline_formula weight inline_formula characterizing the degree of relevance of an element to an object (the greater the weight, the more relevant the object). At the same time, a set of scales inline_formula sets permutation inline_formula on the element set inline_formula (we assume that the set of elements is ordered) based on their sorting by decreasing weight inline_formula .

To assess the quality of ranking, it is necessary to have some “standard” with which one could compare the results of the algorithm. Will consider inline_formula - reference relevance function characterizing the “real” relevance of elements for a given object ( inline_formula - the element is perfect, inline_formula - completely irrelevant), as well as the corresponding permutation inline_formula (descending inline_formula ).

There are two main ways to get inline_formula :
1. Based on historical data. For example, in the case of recommendations of content, you can take views (likes, purchases) of the user and assign the corresponding elements 1 ( inline_formula ), and all the rest - 0.
2. Based on expert judgment. For example, in the search task, for each request, a team of assessors can be drawn who manually evaluate the relevance of the documents to the request.

It is worth noting that when inline_formula takes only extreme values: 0 and 1, then the permutation inline_formula usually do not consider and take into account only the set of relevant elements for which inline_formula .

The purpose of the ranking quality metric is to determine the degree of relevance estimates obtained by the algorithm. inline_formula and the corresponding permutation inline_formula correspond to the true values ​​of relevance inline_formula . Consider the basic metrics.


Mean average precision


Mean average precision at K (map @ K) is one of the most commonly used ranking quality metrics. To understand how it works let's start with the "basics".

Note: "* precision" metrics are used in binary problems, where inline_formula accepts only two values: 0 and 1.


Precision at K


Precision at K (p @ K) - accuracy on K elements - the basic metric of the ranking quality for a single object. Suppose our ranking algorithm yielded relevance scores for each item. inline_formula . Having selected among them the first inline_formula items with the greatest inline_formula You can calculate the proportion of relevant. This is exactly what precision at K does:



Note: under inline_formula understood element inline_formula which is the result of the permutation inline_formula turned up on inline_formula th position. So, inline_formula - the element with the greatest inline_formula , inline_formula - element with the second largest inline_formula and so on.


Average precision at K


Precision at K is a metric that is easy to understand and implement, but it has an important disadvantage - it does not take into account the order of the elements in the “top”. So, if out of ten elements we guessed only one, then no matter where it was: on the first, or on the last, in any case inline_formula . It is obvious that the first option is much better.

This deficiency eliminates the ranking metric average precision at K (ap @ K) , which is equal to the sum of p @ k in indices k from 1 to K only for the relevant elements , divided by K:


So, if of the three elements we were only relevant in the last place, then inline_formula if you guessed only the one that was in the first place, then inline_formula and if everything was guessed, then inline_formula .

Now and map @ K we can handle.


Mean average precision at K


Mean average precision at K (map @ K) is one of the most commonly used ranking quality metrics. In p @ K and ap @ K, the ranking quality is evaluated for a single object (user, search query). In practice, a lot of objects: we are dealing with hundreds of thousands of users, millions of search queries, etc. The idea of ​​map @ K is to calculate ap @ K for each object and average:


Note: this idea is quite logical, if we assume that all users are equally needed and equally important. If this is not the case, then instead of simple averaging, you can use a weighted one by multiplying ap @ K of each object by its corresponding “importance”.


Normalized Discounted Cumulative Gain


Normalized discounted cumulative gain (nDCG) is another common metric for ranking quality. As in the case of map @ K, let's start with the basics.


Cumulative Gain at K


Consider again one object and inline_formula items with the greatest inline_formula . Cumulative gain at K (CG @ K) is a basic ranking metric that uses a simple idea: the more relevant elements in this top, the better:


This metric has obvious flaws: it is not normalized and does not take into account the position of the relevant elements.

Note that, unlike p @ K, CG @ K can also be used in the case of non-binary values ​​of the reference relevance. inline_formula .


Discounted Cumulative Gain at K


Discounted cumulative gain at K (DCG @ K) - a modification of the cumulative gain at K, taking into account the order of the elements in the list by multiplying the relevance of the element by a weight equal to the inverse logarithm of the position number:


Note: if inline_formula takes only values ​​0 and 1, then inline_formula , and the formula takes a simpler form:


The use of the logarithm as a function of discounting can be explained by the following intuitive considerations: in terms of ranking, the positions at the top of the list differ much more than the positions at the end. So, in the case of the search engine between positions 1 and 11 there is a whole gulf (only in a few cases out of a hundred, the user goes further on the first page of the search results), and between positions 101 and 111 there is no particular difference - few people reach them. These subjective considerations are beautifully expressed using the logarithm:


Discounted cumulative gain solves the problem of taking into account the position of the relevant elements, but only exacerbates the problem with the lack of normalization: if inline_formula varies within inline_formula then inline_formula already takes values ​​on the not quite understandable segment. The following metric is intended to solve this problem.


Normalized Discounted Cumulative Gain at K


As the name suggests, normalized discounted cumulative gain at K (nDCG @ K) is nothing more than a normalized version of DCG @ K:


Where inline_formula - this is the maximum (I - ideal) value inline_formula . Since we agreed that inline_formula takes values ​​in inline_formula then inline_formula .

In this way, inline_formula inherits from inline_formula taking into account the position of the elements in the list and, thus takes values ​​in the range from 0 to 1.

Note: by analogy with map @ K you can count inline_formula averaged over all objects .


Mean reciprocal rank


Mean reciprocal rank (MRR) is another commonly used ranking quality metric. It is given by the following formula:


Where inline_formula - reciproal rank for inline_formula th object is a very simple in its essence value equal to the opposite wound of the first correctly guessed element .


Mean reciprocal rank varies in the range [0,1] and takes into account the position of the elements. Unfortunately, he does this only for one element - the 1st correctly predicted, not paying attention to all subsequent ones.


Rank Correlation Metrics


Separately, it is worth highlighting the ranking quality metrics based on one of the rank correlation coefficients . In statistics, the rank correlation coefficient is a correlation coefficient that does not take into account the values ​​themselves, but only their rank (order). Consider the two most common rank correlation coefficients: the Spearman and Kendall coefficients.


Kendall rank correlation coefficient


The first of these is the Kendall correlation coefficient, which is based on the calculation of
(and inconsistent) pairs of permutations - pairs of elements to which the permutations assigned the same (different) order:


Spearman's rank correlation coefficient


The second — the Spearman's rank correlation coefficient — is essentially nothing more than the Pearson correlation calculated on the values ​​of the ranks. There is a rather convenient formula expressing it from ranks directly:


Where inline_formula - Pearson correlation coefficient.

Metrics based on rank correlation have a drawback already known to us: they do not take into account the position of the elements (even worse than p @ K, since correlation is considered for all elements, and not for K elements with the highest rank). Therefore, in practice, extremely rarely used.


Metrics based on cascade behavior patterns


Up to this point, we have not gone into how the user (hereinafter, we consider the particular case of the object — the user) examines the elements proposed to him. In fact, we implicitly made the assumption that the viewing of each element is independent of the views of other elements - a kind of "naivety." In practice, the elements are often viewed by the user alternately, and whether the user looks at the next element depends on his satisfaction with the previous ones. Consider an example: in response to a search query, the ranking algorithm offered the user several documents. If the documents for positions 1 and 2 turned out to be extremely relevant, then the probability that the user will view the document at position 3 is small, since he will be quite satisfied with the first two.

Similar patterns of user behavior, where the study of the elements proposed to him occurs sequentially and the probability of viewing an element depends on the relevance of the previous ones, called cascading ones .


Expected reciprocal rank


Expected reciprocal rank (ERR) is an example of a ranking quality metric based on a cascade model. It is given by the following formula:

ERR@K= sum limitsk=1K frac1kP( textrmobjectwillstopatanelementwithrank$k$),


where rank is understood in descending order inline_formula . The most interesting thing about this metric is probabilities. When calculating them, assumptions of a cascade model are used:

P( textrmobjectwillstopatelementwithrank$k$)=pk prod limitsi=1k1(1pi),


Where inline_formula - the probability that the user will be satisfied with the object with the rank inline_formula . These probabilities are calculated based on the values inline_formula . Since in our case inline_formula , we can consider a simple option:


which can be read as: the true relevance of the element in the position inline_formula after sorting descending inline_formula .
In the general case, the following formula is most often used:





Pound


PFound is a ranking quality metric proposed by our compatriots and using a cascade-like model:


Where





In conclusion, here are some useful links on the topic:
- Wikipedia articles on: learning ranking , MRR , MAP and nDCG .
- The official list of metrics used at ROMIP 2010.
- Description of MAP and nDCG metrics on kaggle.com.
- Original articles on the cascade model , ERR and PFound .

Written using StackEdit .
Many thanks to SeptiM for the amazing habratex .

Source: https://habr.com/ru/post/303458/


All Articles