About the removal of unimportant parts of the pages during site indexing

The question of separating the necessary and useful content from the rest of the thingies is quite often confronted by those who collect this or that information on the Web.

I think there is no special reason to dwell on the algorithm for parsing HTML into a tree, especially since in a generalized form such parsers learn to write a course for 3-4 universities. A regular stack, some chips to skip arguments (except for those that are needed later), and an output tree as a result of parsing. The text is divided into words directly in the process of analysis, and the words are sent to a separate list, where, apart from general information, all the positions in the document are also stored. It is clear that the words in the list are already in 1st normal form, I have already written about morphology, here I will just copy it from the previous article.

First, on the basis of the morphological dictionary of Zaliznyak, we choose the largest base, cut off the ending, substitute the 1st dictionary form. This whole process is assembled into a tree for quick parsing, the final leaves contain variants of endings. We run along the word, going down parallel to the tree on the basis of the met letter, until we reach the lowest possible sheet - there, on the basis of the endings, we substitute the normalized one.
If no normal form was found, then we apply stemming - based on the text of the books downloaded from lib.ru, I built a table of the frequency of occurrence of the endings, looking for the most common of the suitable (also a tree) and replace it with the normal form. Stemming works well, if the word was not in the language another 5-10 years ago - it will easily disassemble “crawlers” into “crawler”

')
After much experimentation with HTML parsing, I noticed that the same blocks in HTML obviously make up the same subtrees - roughly speaking, if there are 2 pages and 2 trees and between them make XOR, then only the desired content will remain. Well, or if this is easier - the intersection of most of these trees on one site will give a probabilistic model - the more blocks a block encounters, the less important it is. All that is found in more than 20% -30% - I throw out, it makes no sense to waste time on duplicate content.

Obviously, the solution was born: learn to count a certain CRC from a subtree, for each subtree it is then easy to count the number of repetitions. Then, when re-paring, zeroing the vertices of the tree that happened too often is easy, and you can always reassemble the page text from the remaining tree (although this is not actually required anywhere).

So in 2 runs on all pages of the site - first we collect the statistics, then we index it - the issue of isolating patterns is easily solved. We also get a lot of advantages - constructions like <td></td> , <b></b> and the rest of the meaningless ones will be thrown out first

The full content and list of my articles will be updated here: http://habrahabr.ru/blogs/search_engines/123671/

Source: https://habr.com/ru/post/123882/

All Articles

About the removal of unimportant parts of the pages during site indexing

More articles: