How to set the order of visiting new pages by a search robot based on the prediction of the popularity of a web page (Part II)

What is the probability of the popularity of a new web page?

We return to the extensive topic of behavioral factors and continue to publish a translation of the Yandex report on the construction of a mathematical model for choosing the optimal policy for indexing new pages. The algorithm based on machine learning and probability theory makes it possible to predict the popularity of a new page in the format of visits (user clicks) recorded by the browser toolbar and operates with an additional concept of the decline of this popularity, which is also predicted for this algorithm. The translation is published with the support of the analytical department of the ALTWeb Group and the SERPClick special project team , which allows you to increase the ranking of web pages by directly influencing behavioral factors. The second part of the article is abbreviated.

3. Algorithm
')
3.1 System
First, let's describe the general system of the proposed algorithm. According to previous studies in this area [7, 16, 20], we set a single source for calculating the cost of indexing across all pages. In other words, we take into account the fixed time.

that the robot will need to load one page. So every

seconds, the robot selects the page to load from its queue of pages for indexing. The queue of pages for indexing is constantly updated as new links to pages are discovered. The page queue for indexing can be updated for a number of reasons. New URLs may be detected by a robot or may come from browser logs, for example. In this paper, we do not consider the problem of finding new addresses, and in the course of our research we took new addresses from the browser toolbar.

3.2 Metric
The following metric was proposed [12] to measure the efficiency of the algorithm. Page usefulness

that we get from indexing page i as time passes

after its appearance is expressed in the total number of clicks that this page will receive from the issuing page after it (page i) is indexed. It is important to note that we only consider pages that are of interest in the short term. In this case, the quality of the algorithm at time t (arithmetic average of the intervals T) will be expressed as follows:

It is also important to note that the number of clicks on the output page depends on a number of reasons, including the features of user behavior and the ranking system, and not just the indexing algorithm. Therefore, it is not always obvious how to interpret the change in the number of clicks when changing the indexing policy. So, for example, does the number of clicks change only because of a change in the indexing policy of the pages, or does it also depend on the ranking method, and then this number may be different when different equally good indexing policies are applied? In fact, we believe that it is wiser in this case to refer to user logs that we can get from the toolbar data and select the number of visits as the main guideline, which will replace our click rate - this will be enough to measure the effectiveness of indexation. Thus, the utility

which we get if we index page i with a delay

will further indicate the number of visits to this page after it has been indexed.
As it will be shown in section 4.1, we evaluate the success of the algorithm on a database collected in one month. Thus, let T = 1 month, and

will mean the average benefit (or usefulness) of applying this indexing policy per second for that month.

3.3 Indexing Strategy

As we agreed in section 3.1, every

seconds, our algorithm should select a page from the page queue for indexing. Our task is to choose the appropriate URL each time in order to index the page, provided that the indexing queue changes dynamically.
For this, we will take into account the dynamics of change in popularity as follows. In [12] it was shown that the utility

which is the result of indexing the page u with some delay

- exponentially decays with time.
Thus, this utility can be approximated by

with some parameters

and

.

We can see that when

, we'll get

, which will express the overall popularity of the page u, that is, in our case, the total number of users visiting the page since its inception, and

will mark the decline of its popularity.

For each page u we predict the parameters.

and

as described in section 3.4. Finally, the next strategy will be chosen. Every

seconds we select the page with the maximum expected utility. That is, for every page u we every

seconds, we calculate its rating, which is later used to rank all new pages in the queue for indexing:

Where

This is the time that has passed since the discovery of the page u. Then we index the page with the highest rating in the queue.
The time when the page was discovered for simplicity is approximated with the time when it was created. Therefore, we assume that the first visit to this page, recorded in the user logs of the toolbar, is close to the time the page was created.

3.4 Expected page popularity

In this section, we will look at the parameter prediction method.

and

for a specific URL u. We will look at the decision tree model based on gradient boosting with a list of characteristics described later in this section. Recall that

is the total number of visits to page u recorded through the toolbar. Since the distribution of all

for all pages u in our data block represents large numbers, then we deal with large values

. In order to draw up an indexing policy, we need to know the order of magnitude, while determining the exact number is not so important. As is the case with the probability distribution expressed in large quantities [19], we predict the value

by reducing the standard deviation on the training set.

We also determine the decline in popularity.

. Let be

will mean the number of visits over time t (in our experiments, measured in hours) after the URL u was found. We train our decision tree model which predicts the correlation

to the number of all visits to the page u, that is, the value

(in this case, we reduce the standard deviation). Then we can determine the level of recession

in the following way. Based on the equation

follows that

. Calculate the logarithm and get

and then we can define

through

Thus, the predicted benefit of indexing page u with a delay

after its appearance will have the following meaning:

Bibliography

_{1. Abiteboul, S., Preda, M., Cobena, C .: Adaptive on-line page importance compu-} _tation. _{In: Proc.} _{WWW Conference (2003)} _{2. Abramson, M., Aha, D .: What's in a URL?} _{Genre classification from URLs.} _In: _{Conference on Artificial Intelligence, pp.} _{262-263 (2012)} _{3. Bai, X., Cambazoglu, BB, Junqueira, FP: Discovering urls through user feed-} _back. _{In: Proc.} _{CIKM Conference, pp.} _{77-86 (2011)} _{4. Baykan, E., Henzinger, M., Marian, L., Weber, I .: A comprehensive study of fea-} _{tures and algorithms for url-based classification.} _{ACM Trans.} _{Web (2011)} _{5. Baykan, E., Henzinger, M., Weber, I .: Eficient discovery of authoritative resources.} _{ACM Trans.} _{Web (2013)} _{6. Cho, J., Schonfeld, U .: Rankmass crawler: a crawler with high personalized pager-} _{ank coverage guarantee.} _{In: Proc.} _{VLDB (2007)} _{7. Edwards, J., McCurley, KS, Tomlin, J .A .: Adaptive model for optimizing perfor-} _{mance of an incremental web crawler.} _{In: Proc.} _{WWW Conference (2001)} _{8. Fetterly, D., Craswell, N., Vinay, V .: The impact of crawl policy on web search} _{effectiveness.} _{In: Proc.} _{SICIR Conference, pp.} _{580-587 (2009)} _{9. Hastie, T., Tibshirani, R., Friedman, J .H .: The elements of statistical learning:} _{data mining, inference, and prediction: with 200 full-color illustrations.} _Springer, _{New York (2001)} _{10. Kan, MY: Web page classification without the web page.} _{In: Proc.} _{WWW Con-} _{ference, pp.} _{262-263 (2004)} _{11. Kumar, R., Lang, K., Marlow, C., Tomkins, A .: Eficient discovery of authoritative} _resources. _{Data Engineering (2008)} _{12. Lefortier, D., Ostroumova, L., Samosvat, E., Serdyukov, P .: Timely crawling of high-} _{quality ephemeral new content.} _{In: Proc.} _{CIKM Conference, pp.} _{745-750 (2011)} _{13. Lei, T., Cai, R., Yang, JM, Ke, Y., Fan, X., Zhang, L .: A pattern tree-} _{based approach to learning url normalization rules.} _{In: Proc.} _{WWW Conference,} _pp. _{611-620 (2010)} _{14. Liu, M., Cai, R., Zhang, M., Zhang, L .: User browsing behavior-driven web crawl-} _ing. _{In: Proc.} _{CIKM Conference, pp.} _{87-92 (2011)} _{15. Olston, C., Najork, M .: Web crawling.} _{Foundations and Trends in Information} _{Retrieval 4 (3), 175-246 (2010)} _{16. Pandey, S., Olston, C .: User-centric web crawling.} _{In: Proc.} _{WWW Conference} ₂₀₀₅ _{17. Pandiy, S., Olston, C .: Crawl ordering by search impact.} _{In: Proc.} _{WSDM Con-} _{ference (2008)} _{18. Radinsky, K., Svore, K., Dumais, S., Teevan, J., Bocharov, A., Horvitz, E .: Model-} _{ing and predicting behavioral dynamics on the web.} _{In: Proc.} _{WWW Conference,} _pp. _{599-608 (2012)} _{19. Tsur, O., Rappoport, A .: What's in a hashtag ?: content based prediction of} _{microblogging communities.} _{In: Proc.} _{WSDM Conference,} _pp. _{643-652 (2012)} _{20. Wolf, JL, Squillante, MS, Yu, PS, Sethuraman, J., Ozsen, L .: Optimal crawling} _{strategies for web search engines.} _{In: Proc.} _{WWW Conference (2002)}

This and other translations can be found in the blog of the ALTWEb Group at Habré. The analytical department of the company conducts research and reviews of current search problems and uses the knowledge gained in the development of its own products, such as the currently unparalleled domestic (almost) or foreign (completely) analogue product to improve site ranking based on behavioral factors: SERPClick . Thank you for the provided translation. Subscribe to our blog if you want to always be up to date!

Source: https://habr.com/ru/post/239453/

All Articles

How to set the order of visiting new pages by a search robot based on the prediction of the popularity of a web page (Part II)

More articles: