📜 ⬆️ ⬇️

PageRank disassembled for formulas

Approximately 95% of the text in 25 billion documents indexed by Google , made up of a small dictionary of ten thousand words. This means that almost any search query will produce millions of documents. Thus, the calculation of the relevance of the document is a nontrivial mathematical task. For this, a combination of the most complicated mathematical methods is used. In addition, the content of the web is constantly changing, so the relevance score needs to be constantly recalculated. Central to the ranking system of Google is the PageRank algorithms.

We all know that the end result of PageRank is a certain “importance” of the PR page, which takes values ​​from PR0 to PR10 and is calculated by analyzing the incoming links. Their quantity and quality indicate the importance of this page for the online community.

The PR level that we see is a highly rounded value, and the exact indicator is known only to Google programmers. The PR is changed on a logarithmic scale, that is, the value of PR5 is an order of magnitude larger than PR4.

What formulas are used to calculate the PR? This is described in a detailed article on the website of the American Mathematical Society.
')
This is how PageRank works. Suppose that there are lj links on the Pj page. If one of these links leads to the Pi page, then Pj will transmit 1 / lj to its “importance” to the Pi page (the transfer of karma on the Habré works approximately according to the same algorithm).

The importance level (i.e., PR) of the Pi page is the sum of all such values ​​from all incoming links. If we present a set of pages that refer to the Pi page as Bi , then the “importance” Pi is calculated using the following formula:



All this looks like a chicken and egg problem. To find out the PR of a page, we first need to know the PR of all the pages that link to it. However, mathematical methods allow us to solve this problem.

For this, a hyperlink matrix is ​​created. , in which row i of column j has the following form:



This is a stochastic matrix, that is, a matrix in which all columns and / or rows are rows of non-negative real numbers giving a total of one.

Generate a vector whose elements are PR values, that is, the “importance” of all pages. According to our conditions, the vector is stationary.

Consider the situation on the example of a small matrix of eight web pages, hyperlinks between which are displayed by arrows.



This situation corresponds to the matrix



and stationary vector



The calculation shows that page 8 wins the popularity contest. Here is the same picture, where the most "authoritative" pages are colored in a lighter color.



This is roughly how PageRank works, from a mathematical point of view. These are only the basic principles of the algorithm. Details can be found in the original article .

Source: https://habr.com/ru/post/4809/


All Articles