In the process of preparing the task for the entrance test for the GoTo summer school , we found that there is practically no qualitative description of the main ranking metrics in Russian (the task concerned a particular case of the ranking task - building a recommender algorithm). We in E-Contenta actively use various ranking metrics, so we decided to correct this more misunderstanding by writing this article.
')
The ranking task now arises everywhere: sorting web pages according to a given search query, personalizing the news feed, recommending video, goods, music ... In a word, the topic is hot. There is even a special direction in machine learning, which is engaged in the study of self-learning ranking algorithms - learning to rank . To choose the best from the variety of algorithms and approaches, it is necessary to be able to evaluate their quality quantitatively. The most common metrics for ranking quality will be discussed later.
Ranking task brief
Ranking is the task of sorting a set of items for reasons of relevance . Most often, relevance is understood in relation to a certain object . For example, in an information retrieval task, an object is a request, elements are all sorts of documents (links to them), and relevance is a document matching a request, in the recommendation task, the object is the user, the elements are one or another recommended content (goods, video, music). ), and relevance is the probability that the user will use (buy / like / view) this content.
Formally, consider N objects and M elements . Reusalt of the element ranking algorithm for the object - this is a display that matches each element weight characterizing the degree of relevance of an element to an object (the greater the weight, the more relevant the object). At the same time, a set of scales sets permutation on the element set (we assume that the set of elements is ordered) based on their sorting by decreasing weight .
To assess the quality of ranking, it is necessary to have some “standard” with which one could compare the results of the algorithm. Will consider - reference relevance function characterizing the “real” relevance of elements for a given object ( - the element is perfect, - completely irrelevant), as well as the corresponding permutation (descending ).
There are two main ways to get : 1. Based on historical data. For example, in the case of recommendations of content, you can take views (likes, purchases) of the user and assign the corresponding elements 1 ( ), and all the rest - 0. 2. Based on expert judgment. For example, in the search task, for each request, a team of assessors can be drawn who manually evaluate the relevance of the documents to the request.
It is worth noting that when takes only extreme values: 0 and 1, then the permutation usually do not consider and take into account only the set of relevant elements for which .
The purpose of the ranking quality metric is to determine the degree of relevance estimates obtained by the algorithm. and the corresponding permutation correspond to the true values of relevance . Consider the basic metrics.
Mean average precision
Mean average precision at K (map @ K) is one of the most commonly used ranking quality metrics. To understand how it works let's start with the "basics".
Note: "* precision" metrics are used in binary problems, whereaccepts only two values: 0 and 1.
Precision at K
Precision at K (p @ K) - accuracy on K elements - the basic metric of the ranking quality for a single object. Suppose our ranking algorithm yielded relevance scores for each item. . Having selected among them the first items with the greatest You can calculate the proportion of relevant. This is exactly what precision at K does:
Note: underunderstood elementwhich is the result of the permutationturned up onth position.So,- the element with the greatest,- element with the second largestand so on.
Average precision at K
Precision at K is a metric that is easy to understand and implement, but it has an important disadvantage - it does not take into account the order of the elements in the “top”. So, if out of ten elements we guessed only one, then no matter where it was: on the first, or on the last, in any case . It is obvious that the first option is much better.
This deficiency eliminates the ranking metric average precision at K (ap @ K) , which is equal to the sum of p @ k in indices k from 1 to K only for the relevant elements , divided by K:
So, if of the three elements we were only relevant in the last place, then if you guessed only the one that was in the first place, then and if everything was guessed, then .
Now and map @ K we can handle.
Mean average precision at K
Mean average precision at K (map @ K) is one of the most commonly used ranking quality metrics. In p @ K and ap @ K, the ranking quality is evaluated for a single object (user, search query). In practice, a lot of objects: we are dealing with hundreds of thousands of users, millions of search queries, etc. The idea of map @ K is to calculate ap @ K for each object and average:
Note: this idea is quite logical, if we assume that all users are equally needed and equally important.If this is not the case, then instead of simple averaging, you can use a weighted one by multiplying ap @ K of each object by its corresponding “importance”.
Normalized Discounted Cumulative Gain
Normalized discounted cumulative gain (nDCG) is another common metric for ranking quality. As in the case of map @ K, let's start with the basics.
Cumulative Gain at K
Consider again one object and items with the greatest . Cumulative gain at K (CG @ K) is a basic ranking metric that uses a simple idea: the more relevant elements in this top, the better:
This metric has obvious flaws: it is not normalized and does not take into account the position of the relevant elements.
Note that, unlike p @ K, CG @ K can also be used in the case of non-binary values of the reference relevance..
Discounted Cumulative Gain at K
Discounted cumulative gain at K (DCG @ K) - a modification of the cumulative gain at K, taking into account the order of the elements in the list by multiplying the relevance of the element by a weight equal to the inverse logarithm of the position number:
Note: iftakes only values 0 and 1, then, and the formula takes a simpler form:
The use of the logarithm as a function of discounting can be explained by the following intuitive considerations: in terms of ranking, the positions at the top of the list differ much more than the positions at the end. So, in the case of the search engine between positions 1 and 11 there is a whole gulf (only in a few cases out of a hundred, the user goes further on the first page of the search results), and between positions 101 and 111 there is no particular difference - few people reach them. These subjective considerations are beautifully expressed using the logarithm:
Discounted cumulative gain solves the problem of taking into account the position of the relevant elements, but only exacerbates the problem with the lack of normalization: if varies within then already takes values on the not quite understandable segment. The following metric is intended to solve this problem.
Normalized Discounted Cumulative Gain at K
As the name suggests, normalized discounted cumulative gain at K (nDCG @ K) is nothing more than a normalized version of DCG @ K:
Where - this is the maximum (I - ideal) value . Since we agreed that takes values in then .
In this way, inherits from taking into account the position of the elements in the list and, thus takes values in the range from 0 to 1.
Note: by analogy with map @ K you can countaveraged over all objects .
Mean reciprocal rank
Mean reciprocal rank (MRR) is another commonly used ranking quality metric. It is given by the following formula:
Where - reciproal rank for th object is a very simple in its essence value equal to the opposite wound of the first correctly guessed element .
Mean reciprocal rank varies in the range [0,1] and takes into account the position of the elements. Unfortunately, he does this only for one element - the 1st correctly predicted, not paying attention to all subsequent ones.
Rank Correlation Metrics
Separately, it is worth highlighting the ranking quality metrics based on one of the rank correlation coefficients . In statistics, the rank correlation coefficient is a correlation coefficient that does not take into account the values themselves, but only their rank (order). Consider the two most common rank correlation coefficients: the Spearman and Kendall coefficients.
Kendall rank correlation coefficient
The first of these is the Kendall correlation coefficient, which is based on the calculation of (and inconsistent) pairs of permutations - pairs of elements to which the permutations assigned the same (different) order:
Spearman's rank correlation coefficient
The second — the Spearman's rank correlation coefficient — is essentially nothing more than the Pearson correlation calculated on the values of the ranks. There is a rather convenient formula expressing it from ranks directly:
Where - Pearson correlation coefficient.
Metrics based on rank correlation have a drawback already known to us: they do not take into account the position of the elements (even worse than p @ K, since correlation is considered for all elements, and not for K elements with the highest rank). Therefore, in practice, extremely rarely used.
Metrics based on cascade behavior patterns
Up to this point, we have not gone into how the user (hereinafter, we consider the particular case of the object — the user) examines the elements proposed to him. In fact, we implicitly made the assumption that the viewing of each element is independent of the views of other elements - a kind of "naivety." In practice, the elements are often viewed by the user alternately, and whether the user looks at the next element depends on his satisfaction with the previous ones. Consider an example: in response to a search query, the ranking algorithm offered the user several documents. If the documents for positions 1 and 2 turned out to be extremely relevant, then the probability that the user will view the document at position 3 is small, since he will be quite satisfied with the first two.
Similar patterns of user behavior, where the study of the elements proposed to him occurs sequentially and the probability of viewing an element depends on the relevance of the previous ones, called cascading ones .
Expected reciprocal rank
Expected reciprocal rank (ERR) is an example of a ranking quality metric based on a cascade model. It is given by the following formula:
where rank is understood in descending order . The most interesting thing about this metric is probabilities. When calculating them, assumptions of a cascade model are used:
Where - the probability that the user will be satisfied with the object with the rank . These probabilities are calculated based on the values . Since in our case , we can consider a simple option:
which can be read as: the true relevance of the element in the positionafter sorting descending. In the general case, the following formula is most often used: