How to set the order of visiting new pages by a search robot based on the prediction of the popularity of a web page (Part I)

Crawling (crawling) on your web page (web) search robot

This report was published in the Yandex Technologies hub in March of this year. A study conducted by a group of Yandex, aims to determine the order of indexing new pages. The first part of the work deals with previous research on this topic. The method proposed by the research group suggests taking into account the prediction of user behavior for this page, which again brings us back to the topic of the relationship of behavioral factors with ranking and indexing speed. The translation is published with the support of the SERPClick project team, aimed at improving the ranking of your site by directly influencing the behavioral factors for your site.

')

Article summary

In this document, we focus on search engine standards for new sites. Since it is not possible to index all new pages immediately after they appear, the most important (or popular) pages should be indexed first. The most natural indicator of the importance of a page is the number of visitors to it. However, the popularity of new sites cannot be determined immediately, and therefore it needs to be predicted based on the characteristics of a new page or site. In this document, we consider several methods for predicting the popularity of new pages using previously studied search robot performance indicators, as well as offering new settings to measure this effectiveness, more close to the real situation. In particular, we compare the short-term and long-term popularity of new pages based on data on the decline in popularity. In the course of the experiments, we were able to establish that the data on the decline in popularity can be successfully used to set the order of checking the pages of a search robot. Further research will need to focus on more fine-tuning this mechanism.

Keywords: indexing sequence, new web pages, popularity prediction.

1. Introduction

Planning the routes of the search robot is responsible for choosing which address will be selected from the waiting list and visited by the search robot. Although the same strategy can have several goals, first of all it is aimed at the implementation of the following two tasks:
loading of newly discovered webpages that are not yet reflected in the index; and
updating copies of pages where important updates have appeared.

In our work, we focus on the first task: indexing new web pages. It is impossible to index all new pages immediately after they appear due to the rapid growth rate of the number of pages in the network and limited resources, even from reputable search engines. Therefore, the most important pages should be indexed first.

There are several ways to measure the importance of a page, which allow you to specify a specific sequence of visits to the pages for a search engine and at the same time measure the success of indexing. Among the many indicators of page importance, such as
the reference graph, for example in PageRank as the most promising method, is also present
User activity in the search, recorded in the logs of the search engine.

The goal of any approach [to calculating page importance] is to determine the overall usefulness of indexed pages for a search engine. From this point of view, it is justified to use as a measure of the popularity of a page - the number of user transitions (or visits) of a particular page, or its popularity. This is the so-called approach based on data on the behavior of users in the search, proposed in [14]. It has already been proven that the popularity of almost any page is short-lived: they are popular for some time after its creation, and then the interest of users decreases with time. In this document, we focus only on such pages with short-term interest of users to them, and predict the maximum for this indicator after the page is indexed.

The popularity of the new page can not be known in advance, and therefore it must be predicted, based on the parameters of the page, which are known at the time of its discovery in the network. We analyzed the problem of predicting popularity for new pages, in particular, we took into account the dynamics of popularity, predicting both the very popularity of the page and its decline for new URLs. The predefined page indexing proposed earlier in [14] is based on predicting the popularity of the page as a whole, and therefore does not take into account the dynamics of this indicator over time. In fact, with this approach, if we take two new pages, one of which is popular today and the other will be even more popular, but after a few days, the search robot will index the first last page and thus miss the current traffic for the search engine.
We believe that data on the dynamics of popularity can be effectively used to optimize the behavior of the search robot, but at the same time it is difficult to predict this dynamics.

We predict the total number of visits, which will be recorded on a new page with time. Unlike [14], our prediction is based on a model that takes into account indicators from various sources, including the page address and domain itself. We predict the dynamics of the development of a page’s popularity over time with the help of the corresponding exponent, as suggested in [12].

We give an assessment of the functionality of various ways of how to prioritize the indexing of pages based on predicting the popularity of pages. The algorithm that we propose in this paper takes into account the projected decline in the popularity of web pages and dynamically shuffles the queue for indexing in accordance with the dynamics of popularity. It is worth mentioning that the method of the task of indexing based on data on user behavior requires us to experimentally evaluate it in real conditions, where it is necessary to take into account the changing nature of the task itself: the delay in indexing, the appearance of new pages and previously popular pages that are larger do not get visits. As far as we know, such experiments have not yet been conducted. We compare various prioritization strategies for the search robot, testing them in real conditions and compare the results obtained with the dynamic indexation success indicators proposed in [12].

We concluded that the indexing order strategy, which takes into account the decline in page popularity, is more effective than methods that rely solely on popularity as such. This conclusion confirms our assumption that it is more important to index those pages that are popular right now - in order not to lose this part of the traffic that can pass through a search engine.

Summarizing all the above, this study is useful due to the following two points:

- We solve the problem of predicting the overall popularity and the degree of decline in popularity for new web pages, and also offer an effective method for predicting the overall popularity, which is more effective than the method of predicting the overall popularity currently used.

- In real-world conditions, we test various indexation adjustment strategies based on data on user behavior and find evidence that a strategy that takes into account the change in popularity is more effective than a strategy based only on overall popularity and thus offers an effective forecasting method downturn in popularity of the new page.

Further presentation is constructed in the following order:
In the next section, we look at the previous research on how to index new pages and forecast page popularity. In Section 3, we describe the principles and method of the indexing algorithm that we propose in this paper. In Section 4, we present the results of testing the new algorithm and compare it with the strategy currently used. Section 5 summarizes the work.

2. Previous studies

There are already a number of works devoted to the prediction of popularity for various elements of the Internet: these are texts, news, users of social networks, tweets, Twitter hash tags, videos, etc. However, only a few works are devoted to the popularity of pages, which is calculated based on user visits. . One of them offers a model that predicts the appropriate popularity for a particular query, the number of clicks from a search on a given page, and also considers a query-page pair. This model is based on data (from logs) on the previously known dynamics of the given query and clicks on the corresponding document. Therefore, this approach cannot be applied to the solution of the popularity prediction problem for new pages for which the search engine does not yet have sufficient data from the logs, since they have not yet been indexed.

Another study is about recently discovered pages and predicting the traffic that will pass through them. However, the forecast is based only on the address of the page. This is a really important aspect for planning a further sequence of indexing pages, because we need to predict the popularity of a page even before starting to load it.
Our work is like a continuation of this study, since we predict the popularity of new pages in the dynamics, and for this we use a combination of the forecast of the overall popularity of the page with the prediction of the decline in its popularity.

Also, our machine learning algorithm greatly improves the current approach to predicting the overall popularity of a page. Since the problem of determining popularity, solved by analyzing the page address is relatively new, there are several studies on predicting various parameters of the page itself based on its address even before the content is loaded, such as:

webpage category
tongue
theme
genre

Some of these papers offer an approach that can be successfully used in building our popularity prediction model.

The innovative work [16] suggests evaluating the effectiveness of indexation based on the usefulness of indexed pages for search users, which relies on a certain ranking method and search query logs. The authors determine the quality of the output page as an indicator of the average number of all user requests and compare the changes in this indicator for various methods of constructing an indexing strategy for a search robot. They offer an algorithm that allows you to effectively re-index pages in order to timely update their local copies. The benefit of re-indexing a specific page is estimated based on the logs, which reflect the benefit [for the search engine] of its previous indexations. In connection with this limitation of work, it does not consider the procedure for indexing new pages.
Our work, on the contrary, focuses on predicting the utility of a new page, which should be based on the parameters of its URL, which we can determine without loading the page. The question of the order in which to send new URLs for indexing was considered in [17]. In our work, as in [16], the measurement of the effectiveness of the entire algorithm is based on the following factor: the usefulness of indexed pages within the existing ranking method and taking into account search query logs. In the application of this to new pages, their expected utility should be calculated based only on the page address, inbound links, domain indicators and the corresponding anchors.

The method for evaluating the indexing strategy proposed in [16] and [17] can be interpreted as the expected number of clicks that will fall on the indexed page with the existing ranking method and on the basis of search queries logs that we record during a certain time period. Indeed, if a certain amount of data about requests Q consists of requests and their frequency, the authors define the overall usefulness of the page p as:

where f (q) is the frequency of the request q, and I (p, q) can be defined as the probability that the document p will receive clicks on the output page generated by the current ranking method in response to the request q received from the user. It is believed that a certain amount of data about requests Q we get from user-generated logs of user requests for a certain period of time, close to the present moment. Thus, the usefulness of the page p is the expected frequency of user transitions to this page from the issue. Unlike [16] and [17], we not only measure the current popularity of pages, but also the overall usefulness of these pages for increasing search engine performance, such as the number of future visits. Thus, our method of measuring quality is calculated on the basis of overall efficiency, in which the search engine “wins” if it indexes one or another page, and not only on this efficiency at the moment. In particular, our approach takes into account the fact that a page becomes less popular with its own rate of losing that popularity.

In [12], strategies were proposed for indexing a page to which user interest has recently appeared. Also, the problem of the distribution of the power of the search robot for indexing new and reindexing old pages (in order to find new links) is considered. However, in [12], the popularity of new pages was predicted only on the basis of data on the domains that refer to it (more precisely, the page on which the link was found). Our work also offers a prediction model that allows you to decide who to index first, even if the links were found on the same page or similar clusters of pages.

From translators: the text below describes an algorithm for solving the problem with all the corresponding mathematical calculations. Did you have enough of the above part of the article for your reference, or would you like to know all the details of the study in detail? Your opinion is important to us!

Source: https://habr.com/ru/post/239153/

All Articles

How to set the order of visiting new pages by a search robot based on the prediction of the popularity of a web page (Part I)

Article summary

1. Introduction

2. Previous studies

More articles: