Performance metrics for vertical search results based on a click model

In 2014, a Yandex report was published, revealing the details and conclusions of an experiment on the impact of user behavior on metrics for evaluating the effectiveness of issuing. The translation was carried out with the support of the Research Department of the ALTWeb Group , which studies the influence of behavioral factors on the order and ranking methods. The results of their own research ALTWeb Group uses for the development and implementation of modern solutions in the field of digital commerce. Publications from open sources are used for scientific purposes.

The presented Yandex report reveals one of the aspects of the influence of behavioral factors on the formation of an approach to building search results pages. The text of the study is provided entirely and for informational purposes.
')

annotation

Modern search engines show users heterogeneous information, originating from sources of various types, also called "verticals". Evaluation of this type of system is an important and difficult task, the solution of which has yet to be found. In this report, we consider the hypothesis that the use of models that capture the data on user behavior in the search with respect to the heterogeneous issue pages allows us to improve the quality of offline metrics. We offer two metrics for evaluating vertical sources of information that are based on a user-defined model of clicks for parallel search, and give them an estimate based on the logs of user queries collected by the Yandex search engine. In our work, we show that, depending on the type of vertical, the proposed metrics correlate more closely with the behavior of the user online, rather than other latest technology.

Category and subject area

H 3.3 [Storage and access to data]: Search and output of data.

Keywords

Clique model, evaluation, parallel search.

1. Introduction

When evaluating a web search system, it is considered that users get a result page with ten snippets, also known as “ten blue links,” and that these snippets are viewed from the top down. However, existing search engines go beyond the “ten blue links” paradigm and show the user non-uniform information from various search algorithms, also known as verticals (for example: images, news, maps, etc.). In this case, user behavior is significantly different from that on the standard issue page [3, 10]. Although changes in user behavior should be taken into account when drawing up non-uniform pages of issue, nevertheless, little research has been done in this direction so far [12].

The quality of the issue page can be evaluated in two ways: online or offline. Online assessment such as split testing collects feedback directly from users. Typically, feedback data includes clicks, time on the page, mouse movements, and other indicators. The quality assessment of the system is based on these signals. Also, the quality of the search can be evaluated offline manually based on the entire issuance page (SERP) and / or its parts. Such an assessment can be made both with direct application and without offline efficiency metrics. Not so long ago, a mixed evaluation method was proposed where offline metrics are built on the basis of user behavior patterns, parameters of which are taken from search query logs [4]. Thus, the evaluation is made offline and gives results immediately. However, in the course of such an assessment, feedback data from users is used (in the form of clicks), which makes it possible to take into account the preferences of real users.

In this article, we consider the problem of estimating heterogeneous search algorithms based on the above facts. In particular, we are developing a performance metrics model for heterogeneous issue pages based on the click model.

The main question of our research is the following: is it possible to improve the quality of offline metrics of efficiency for web search using data on user behavior in the situation of issuing various vertical results?

The practical benefits of this study are as follows. First, we develop two performance metrics for vertical issue results based on a click model for a combined search [3, 10]. Secondly, we give an assessment of the effectiveness of the proposed metrics based on a wide sample of search logs, based on search sessions with various types of vertical issue results, namely images, video, maps and news.

2. Metrics based on a custom click model

Internet search performance metrics should reflect how users perceive the quality of the proposed issue. Accordingly, these metrics are increasingly relying on user behavior data. Traditional metrics, such as various methods for assessing accuracy, suggest that users are interested in relevant documents, and therefore focus on the relevance parameter. In addition to relevance, more advanced metrics such as nDCG [7] and RBP [8] assume that users view the results from top to bottom and rank the relevance documents accordingly on the issue page.

Recently, however, a number of metrics for assessing issuance have been proposed to be based on a user click model. Such a model evaluates the possibility of clicking on each document provided to the user in the issue. Metrics based on the model of user behavior, in turn, use these probabilistic capabilities to measure the quality of the results produced by the search. The metric Expected Reciprocal Rank metric (ERR) [2] (ranking used by Yahoo - approx. Transl.) Uses a simplified version of the click-model DBN [1] where the user views the results of the page from top to bottom until he finds the relevant document and will leave the search. Expected Browsing Utility (EBU) [11] (Method proposed by Microsoft - approx. Transl.) Is also based on a simplified DBN model but, unlike ERR, which uses predefined parameter values, the EBU determines parameters directly from the click logs.

Chuklin and coauthors [4] proposed a general way to convert click models into metrics for evaluating the effectiveness of building an issue page. They applied this idea to existing search models, such as DBN [1], DCM [6] and UBM [5]. As a result, a number of metrics were proposed, based on usefulness and effort. All of them gave higher accuracy rates compared to standard methods that do not take into account click models.

3. Metrics for vertical data sources

The metrics mentioned above have shown their effectiveness in the standard user scenario. However, the existing offline evaluation methods for Internet search do not take into account the presence of vertical results on the issuing page. Recent studies in this area have shown that user behavior deviates significantly from the standard scenario in this case. [1, 10].

The following click models have been proposed that reflect these deviations: Combined Climate Model (FCM) [3], Clique Model for Vertical Search (VCM) [10]. These models showed a greater approximation to real results and a smaller degree of error compared with click models for standard search. However, the relevant metrics for assessing the effectiveness of the issue have not been developed. We will try to fill this gap by converting FCM and VCM models into corresponding offline metrics based on click models.

We believe that these metrics correlate better with the data of online experiments compared to existing offline metrics in the case when there are data from various (vertical) sources on the search results page.

Both FCM and VCM complement the tility Browsing Model (UBM) for Internet search [5] (although the use of DCM and DBN is also acceptable).

Consequently, [4], UBM can be used to create a metric based on the utility of the search - and we will focus our attention on similar evaluation methods in this document.

Performance based metrics (UBM) can be defined as follows:

where N denotes the number of documents on the output page, P (C _k = 1) means the probability that the k-th document will have a click, and r _k denotes the relevance of the k-th document on the issue page.

In expression (1), the relevance expressed by rk is offline parameters, while the probability of a clique P (C _k = 1) is calculated based on the user-defined model of clicks. We use the definition of the relevance of r adopted in the work on ERR [2] based on the relevance of the degree of R as: r = (2R 1) = 2R _max .

According to the UBM click-through model, a document is clicked only if it is noticed and attractive to the user:

where E and A are arbitrary variables, recording the occurrence of events that the document is seen and has appeal. In the UBM model, the attractiveness depends on the document and the request q, and the fact that the document is seen probably depends on its location and distance from the place of the last click.

During offline web search evaluation, clicks are not available, so the distance d from the last click on the document is not available. Therefore, this distance should not be considered in order to calculate the finite probability of clicks. According to [4], P _UBM (C = 1) can be defined by the following formula:

where for simplicity it will be assumed that

FCM based metric

Research on user behavior in a combined search shows that the presence of vertical results affects the likelihood of opening other documents on the issue page [3, 10]. In order to build a model that demonstrates this difference from the standard search, FCM introduces an additional hidden variable F, which indicates whether user behavior changes when there are vertical results in the output. In this paper, we will call this “vertical attractiveness.” The probability that the document will be considered, according to the FCM model, will be the following equation:

where t represents the type of vertical search result, v represents its position, and l is the distance between the vertical results and the rest of the search results, which can be both positive and negative. Thus, the probability of a document being considered in the FCM model can be calculated as follows:

In order to obtain the probability of a clique P _FCM (C = 1), one must substitute the probability of considering P _FCM (E = 1) in equation (2) instead of P _FCM (E = 1) = γ _kd . In this case, the uFCM metric can be represented by adding P _FCM (C = 1) to equation (1).

VCM based metrics

Like FCM, VCM assumes that the likelihood of reviewing a document changes when an attractive vertical search result is present on the output page (F = 1). Also, VCM assumes that in this case the user views the vertical result first and only then pushes the other results in a top-down direction. This is controlled by a hidden variable B. Thus, VCM models the probability of consideration as follows:

Thus, the equations describe three possible scenarios for a consideration path for the output page:

(i) starting at the top of the document down (F = 0),

(ii) starting at the vertical, then clicking again at the top of the output page (F = 1; B = 1), and

(iii) from the vertical to the end of the page of output (F = 1; B = 0).

The overall probability of consideration in VCM is calculated as the average of the probabilities of considering these three paths:

where d, d 'and d "denote the distances between the last damned documents according to each of the paths.
The general probability of a click in a VCM model cannot be substituted directly into expression (2) because it uses different distances for different ways of user behavior. Therefore, it is necessary to isolate the probability of a click for each path and thus remove each of the distances from the equation. Then the total probability of a click for a VCM model can be represented as follows:

where P _i denotes the probability of consideration in the i-th path. The metric uVCM is calculated by substituting P _VCM (C = 1) into expression (1).

4. Evaluation

4.1 Experimental Conditions

In order to evaluate the effectiveness of the proposed metrics for the search, including vertical issuance, we collected user search sessions from click logs based on the large commercial search engine Yandex. As in [3, 10], we used three types of vertical results: images and video as multimedia verticals, news as a text vertical, and maps as a mixed mixed composition vertical containing text and visual data. We identified sample sessions containing one of these vertical results in November 2013. The first 10 documents in issue in each session were rated by users on a standard five-step scale (ideally, perfectly, well, not bad, bad). The collected sessions were sorted by user IDs and arranged in packages for training and testing (see table 1). The uneven distribution of the number of sessions is explained by the frequency of appearance of vertical results in samples of sessions selected from the click logs.

Table 1

According to [2, 4], we have assessed the quality of the proposed metrics, based on their compliance with online metrics, such as UCTR and Max / Mean / MinRR. The UCTR is a binary variable indicating whether the click was during a session or not (the opposite of the exit situation from the session). MeanRR is the average inverse ratio of clicks per session. MaxRR is the inverse ratio of the last click. For these online metrics only clicks in search results are considered.

Considering that for the same query, the search results page may differ depending on the user, his location and other similar user factors, we focused our attention on the structures [2] that represent the request with a fixed issue page (see statistics in table 1). Offline metrics give the same values for the same structures, while online metrics give an average for all sessions with the same structure. The measurable relationship between offline and online metrics is calculated on the basis of all the structures, as shown in [2]:

Where N is the total number of configurations, nc is the number of different configurations c, m _i is the value of metric m _i for configuration c, and

represents the numeric value of the variable m _i .

We compare our metrics for output with vertical results with two types of input data:

(i) static offline metrics where the parameters are constant (DCG and ERR), and
(ii) a click-based model for web search, where parameters are taken from click-logs (EBU, uDCM, uDBN and uUBM). Considering these model parameters, the probability of attractiveness for the user P (A = 1) (and the probability of satisfaction P (S = 1) for DBN) is considered dependent only on the degree of relevance of the document to this query, as in [4].

4.2 Conclusions and discussion

The measurable relationship between offline and online metrics for different types of vertical search results is shown in Tables 2-5, with the best values in bold.

Tables 2 and 3

Table 2 presents the results for the news vertical. News snippets contain mostly text data and, therefore, are similar to standard web snippets. As a result, most offline metrics (with the exception of DCG) have a corresponding correlation with online metrics. At the same time, the proposed metrics for issuing with regard to vertical results such as uFCM and uVCM are slightly higher than other metrics.

Tables 3 and 4 show the results for the multimedia vertical, namely for the search results for images and video. In both cases, the uFCM shows higher correlation values with all online metrics compared to the original data. This result is intuitive, given that the behavior of users, according to the logs, changes significantly when visual stimuli (for example, an image) are present in a vertical result [3, 10]. The FCM model records these changes, which, in turn, are the result of a higher correlation of values between uFCM and online metrics.

The uVCM metric ranks second in terms of success in terms of correlating values with values of online experiments. However, it does not correlate with uFCM. This can be explained as follows. The click models FCM and VCM use the parameter of vertical attractiveness of the document for the user.

, which shows how users' behavior differs from the standard web search script when a vertical result of type t is present in rank v. The lower the value

the closer the vertical model is to the corresponding UBM model. After using FCM and VCM on verticals of images and videos, we observed that the expected value

turns out to be relatively high, which in turn means that FCM is largely derived from UBM. In contrast, the meaning

for VCM turned out to be very low, much more close to UBM. Indeed, Tables 3 and 4 show that the correlation of uVCM with online metrics is close to that of uUBM.

Tables 4 and 5

Table 5 presents the results for the vertical map search, giving data in text and visual formats. DCG has the highest correlation with online RR based metrics, followed by uDCM (which has the highest correlation with UCTR) and EBU. We used A \ B testing to observe this correlation. Testing was conducted on real users within the framework of the used search system, where the vertical map was turned off for a week period.
This experiment showed that the level of exit from the search (when the vertical result is shown in the search) was much higher compared to the level of continuation of the search (the vertical result is not given to the user). We see two reasons explaining this phenomenon: (i) users are satisfied with the information presented in the output (address, phone, hours of operation, etc.) and exit the search without a click, which is considered a positive result of the exit from the search. (ii) some users consider the above result for the vertical of cards as a banner (especially if issuing vertically takes the top line of the search) and skip this result, which is considered as a variant of banner blindness.

For route requests, this results in no issue clicks. In both cases. Online metrics such as MeanRR and UCTR do not provide a full picture of user behavior. Thus, the low correlation of offline metrics observed in the results from table 5 cannot be interpreted as a negative result. Other means of assessing the quality of offline metrics should be applied in this case (for example, classifying the search results as “positive” and “negative” as in [9] and calculating the correlation indicators for the latter type only), which will be the subject of our next work.

As a result of our work, we found several important trends. First, they confirm the results of studies of previous works devoted to the behavior of users in a parallel search, namely: the behavior of users depends on the type of vertical result included in the output page, where visually appealing verticals, such as video, affect the user's behavior more than text verticals such as news. Verticals representing mixed content, such as maps, provoke more complex user behavior that requires further investigation.

Secondly, in response to the research question formulated in Section 1 of this work, we showed that, depending on the type of vertical, the proposed metric for parallel search based on the click model has a higher value of correlation with online user behavior compared to offline metrics for web search. In particular, uFCM has the highest correlation in the case of visually appealing verticals included in the output page, such as images and video. The metric uVCM, by contrast, is more conservative and closer to the corresponding UBM model.

5. Conclusions and further research

In this paper, we looked at the problem of offline evaluation of a heterogeneous search results environment, where standard search results compete with vertical search results. We investigated the question of how data on user behavior on the example of such kind of mixed pages of issue can help improve the quality of offline metrics. From this point of view, we looked at the existing click models for parallel search, namely FCM and VCM, and converted them into issuing efficiency metrics based on click models. Experimental results showed that, depending on the type of vertical, the proposed metrics have higher rates of correlation with online metrics, especially if the visually attractive vertical results, such as images and videos, are shown in the output.

In our future work, we plan to deepen the proposed metrics for evaluating not only the web results, but also the issue pages as a whole, including vertical results, sponsored search and other components. We also plan to explore in more detail the behavior of users in the case of displaying the results of the vertical map on the issuing page. First of all, we would like to understand the reason for the high exit rates from the search, which we observed in this case, after which we plan to develop methods for separating positive and negative outputs for a more accurate assessment of the quality of offline metrics.

Thanks a lot. The authors of the study would like to thank Evgeny Krokhlev and Sergey Protasov for discussions, in which we found inspiration for creating our work and technical support. This study was partially funded by a grant from the Swiss Science Foundation P2T1P2_152269 <organization, grants and programs involved in the creation of the work, see the original text - note. Translation>

Bibliography

[1] O. Chapelle and Y. Zhang. A dynamic bayesian network click model for web search ranking. In WWW '09, pages 1–10, 2009.
[2] O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. Expected recip-rocal rank for graded relevance. In CIKM '09, pages 621–630, 2009.
[3] D. Chen, W. Chen, H. Wang, Z. Chen, and Q. Yang. Beyond the ten blue links: In WSDM '12, pages 463–472, 2012.
[4] A. Chuklin, P. Serdyukov, and M. de Rijke. Click model-based infor-mation retrieval metrics. In SIGIR '13, pages 493–502, 2013.
[5] GE Dupret and B. Piwowarski. A user browsing model to predict search engine click data from past observations. In SIGIR '08, pages 331–338, 2008.
[6] F. Guo, C. Liu, and YM Wang. Efficient multiple-click models in web search. In WSDM '09, pages 124–131, 2009.
[7] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Information Systems, 20 (4): 422–446, 2002.
[8] A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Information Systems, 27 (1): 2: 1– 2:27, 2008.
[9] Y. Song, X. Shi, RW White, and A. Hassan. Context-aware web search abandonment prediction. In SIGIR '14, 2014.
[10] C. Wang, Y. Liu, M. Zhang, S. Ma, M. Zheng, J. Qian, and K. Zhang. Incorporating vertical results into search click models. In SIGIR '13, pages 503–512, 2013.
[11] E. Yilmaz, M. Shokouhi, N. Craswell, and S. Robertson. Expected browsing utility for web search evaluation. In CIKM '10, pages 1561–1564, 2010.
[12] K. Zhou, T. Sakai, M. Lalmas, Z. Dou, and JM Jose. Evaluating heterogeneous information access. In Proc. MUBE workshop, 2013.

Source: https://habr.com/ru/post/237285/

All Articles