How it is done: parsing articles

For me, there has always been some kind of magic about how Getpocket, Readability and VKontakte pass links to pages and offer ready-made articles for viewing without ads, sidebars and menus. However, they are almost never wrong. And recently a similar task has matured in our project, and I decided to dig a little deeper. I’ll say at once that this is “white” parsing, webmasters themselves voluntarily use our service.

In an ideal world, all the information on the pages should be semantically tagged. Smart people came up with many useful things like Microdata, OpenGraph, Article tags, Nav ... etc, but I wouldn’t be in a hurry to rely on the awareness of webmasters in terms of semantics. Enough to see the code pages of popular sites. Open Graph is by the way the most popular format, everyone wants to look beautiful in social. networks
')
The isolation of the title of the article and the picture remains beyond the scope of my post, since the title is usually taken from the title or og, and the picture if it is not taken from the og: image is a separate story.

We turn to the most interesting - the isolation of the body of the article.
It turns out that there is quite a scientific paper devoted to this problem (including from Google employees). There is even a CleanEval competition with a set of test pages from which to extract data, and algorithms compete in who will do it more precisely.

The following approaches are distinguished:

Extracting data using only html document (DOM and text level). This technique will be discussed below.
Extract data using a rendered document using computer vision. This is a very accurate algorithm, but also the most complex and gluttonous. You can see how it works, for example, here: www.diffbot.com (project of the guys from Stanford).
Extracting data at the site level entirely, comparing the same type of pages and finding the differences between them (differing blocks are in fact the necessary content). This is a big search engines.

We are now interested in approaches to retrieving an article with only one html document in hand. In parallel, we can solve the problem of defining pages with lists of articles with pagination. In this article we are talking about methods and approaches, and not the final algorithm.

Parse will be the page http://habrahabr.ru/post/198982/

List of candidates to become an article

We take all the layout elements of the page structure (for simplicity, a div ) and the text that they contain (if any). Our task is to get a flat list of the DIV element -> text in it

For example, the menu block on Habré:

Gives us an element containing the text "posts q & a events of company hubs"

If there are div elements nested, their content is discarded. Child divs will be processed in turn. Example:

We will receive two elements, in one text © habrahabr.ru , and in the second. Support service Mobile version

We assume that in the 21st century, elements that are semantically intended for the markup of a structure (div) are not used for marking paragraphs in the text, and this is true on the top 100 news sites.

As a result, we get a flat set of wood:

And so, we have a set in which we need to classify an article. Further, with the help of various rather simple algorithms for each element we will lower or increase the probability coefficient of the presence of an article in it.

We do not throw a DOM tree; we will need it in the algorithms.

Find duplicate patterns.

In all elements of the DOM tree, we find elements with repeating patterns in the attributes (class, id ..). For example, if you look at the comments:

It becomes clear what a repeating pattern is:

Same set of element classes
Same text substring in id

We are pessimizing all these elements and their “children”, that is, we set a certain reduction factor depending on the number of repetitions found.
When I talk about "kids", I mean that all nested elements (including those that fell into our classification kit) will be pessimized. Here, for example, the element with the text of the comment also falls under the distribution:

The ratio of links and plain text in the element.

The idea is clear - in the menu and in the columns we see solid links, which clearly does not look like an article. We run over the elements from our set and assign a number to each.
For example, the text in the elements in the Freelance block (we have already received a minus for the repeating class), get a minus to catch up for an ugly link-to-text ratio of one. It is clear that the smaller this coefficient, the more the text looks like a meaningful article:

The ratio of text markup to text

The more blocking of any markup (lists, hyphenation, span ...), the less chance that this is an article. For example, advertising is probably a respected SEO company, not very similar to the article, as the whole is a list. The lower the ratio of markup to text, the better.

The number of points (sentences) in the text.

Here we have almost crawled into the territory of numerical linguistics. The fact is that in the titles and menu points are practically not put. But in the body of the article a lot of them.
If any menus and lists of new materials on the site are still through the previous filters, then you can finish off by counting points. Not very many of them in the best block:

The more points, the better, and we increase the chances of this element to receive the proud title of the article.

The number of blocks with text about the same length

Many blocks with text of about the same length are a bad sign, especially if the text is short. We pessimize such blocks. The idea works well on a similar layout:

The example is not Habr, since this algorithm works better on more stringent grids. On comments on Habré will work for example not very well.

The length of the text in the element

Here, a direct relationship - the longer the text in the element, the greater the chance that this article.

Moreover, the contribution of this parameter to the final assessment of the element is very significant. 90% of the cases of parsing an article can be solved by this method alone. All previous research will raise this chance to 95%, but at the same time they eat the lion's share of processor time.

But imagine: a comment about the size of the article itself. If you simply define an article by the length of the text, an embarrassment happens. But there is a high chance that the previous algorithms will slightly cut the wings of our graphomanal commentator, since the element will be pessimized for the repeated pattern in id or class.

Or one more case - a weighty drop-down menu made using



"boilerplate algorithm", "readability algorithm"

Source: https://habr.com/ru/post/200394/

All Articles