Clearing web pages of information noise

Greetings to all!

My previous articles were mainly about the theoretical part of Data Mining, today I want to talk about a practical example used in a Ph.D. thesis (in this regard, this example cannot be considered a full-fledged working project at this stage of development, but it can be considered a prototype).

We will clear the web page of the "information noise".
')

So what's the problem?

The problem is that a good half of websites contain a bunch of unnecessary information on the pages - the so-called “information noise”. This includes navigation, related links, design elements, and, of course, advertising.

At the moment, all decisions related to content filtering are tied at the technological level - blocking pop-up windows, JavaScript, Flash, searching for forbidden words, or removing ad units from the database of registered hosts. Of course, there are other ways, but now is not about that ...

The concept of information blocks

The meaning of this concept is the following - to consider a web page not as a whole, but as a set of information blocks, which we will consider as a unit of displayed information. This approach allows you to select the necessary sections of content and analyze them in the context of the entire page. Further on this in more detail.

Determining the importance of user information

It is clear that not all information is equally useful (important) for the user. It is also clear that the assessment of importance is a subjective concept, but here I do not pretend to be absolute. But, intuitively, the importance of information can be divided into three types:
* important information (main content)
* unimportant information (related links, most viewed, “they also buy with this product”, etc.)
* information garbage (header, basement, advertising, etc.)

You can, of course, break into more levels, but, I think, for the time being we will manage these three.

Problems and tasks

It was necessary to solve the following tasks to achieve the goal:
* division of the web page into information blocks
* creating a model to assess the importance of a particular unit
* logic for deciding which unit to assign which type of importance

Splitting a web page into information blocks

Not quite a trivial task, as it may seem at first glance. Early approaches were related to 1) analyzing the DOM model 2) analyzing a large number of pages within the site and defining the so-called. "Template" site. As a rule, these approaches did not give a good result.

The solution is to use the VIPS (Vision-based Page Segmentation Algorithm) algorithm from Microsoft Research Asia . In short, they use a combined approach, namely, the analysis of the DOM model and its own segmentation rules, derived by experts or experimentally.

Seen cons:
* library on unmanaged C ++, so there is no tight integration with new controls (for example, with WebBrowser), so I had to play around with integration
* the algorithm uses the property Granularity - i.e. minimum distance between information blocks. It is clear that these distances will be different for different sites. Now you need to pick up hands. Automatic selection of granularity - the subject of individual research.
* the output of the algorithm produces something like XML, but much worse, for a very long time I had to write a parser, "understanding" this format

Despite this, VIPS is quite convenient + for lack of alternatives they took it as a basis.

Creating a model to assess the importance of a particular unit

Everything is much more interesting here, as this part was done completely by myself.

The main task was to determine the evaluation criteria and the rules by which different blocks differ. Or more specifically: by what signs can you distinguish the set of links and the main content? In this case, the percentage of words that are references is certainly higher than in ordinary text. Further, the average length of the sentences of the main content is mostly larger than in the link unit. So, the analysis of the main parameters (characteristics) of the blocks can tell what block we are in front of.

The model is based on multivariate analysis and regression. I selected about 20 parameters that could theoretically influence the definition of the type of content. Next, it was necessary to determine the importance of each parameter in the regression model.

Among the parameters were the following:
* average sentence length
* number of words
* number of stop words
* number of links, pictures, lists, etc.
* relative parameters such as the number of link words / the number of all words, the percentage of occurrences of stop words
* etc.

Regression model of the importance of information blocks

For this, a program was developed that:
* Chose from Google 100-200 sites
* divided each web page into information blocks
* parsed (parsila) block content by 20 parameters
* put it all in base

Further, several experts for each block with their hands put the “importance” of each block at their discretion.

The result was a database, on the basis of which a regression analysis was performed and each parameter was given its importance (degree of influence on the importance rating). The regression was built in the SPSS math package.

The result is a regression model of the type:

y (param1, ..., param20) = coef1 * param1 + coef2 * param2 + coef3 * param3 + ... + coef20 * param20

I would say that the most "important" parameter was the percentage of stop words :)

Having this model, we transfer the parameters of a specific block and get its quantitative (number) assessment. It is clear that a block that receives a larger numerical value will be more “important” for the user.

The accuracy of this model can be improved by analyzing a large number of studied web pages. After all, 200 pages for a very accurate model is not enough, but enough for our prototype.

The definition of "important" blocks

At the beginning, I used some “boundaries” of importance assessment according to the type “more than 20 - main content, more than 10, but less than 20 — not the main content”, etc. But this approach did not produce results, since the “page to page is different” and coefficients could differ significantly within several web pages.

A good solution was the use of the fuzzy clustering algorithm with-means, which for a particular page "decides" how to cluster our blocks by their numerical values, and "spreads" them into three clusters (three, because three types of importance).

What is the result?

As a result, we get ONLY the main content (well, of course, ideally, see the problems above).

The implementation of the prototype is executed in the form of a browser codenamed “SmartBrowser” and can be quietly pulled from the site http://smartbrowser.codeplex.com/ .

Requirements:
* Windows 32bit (VIPS dll needs to be registered in the system)
* .NET Framework 3.5

Examples

Examples can be found on this page , where the news about the program was first published.

This is what the program looks like:

Reviews

“Some” people from America wanted to be a sponsor of further development, but then the crisis “burst out” and it was all over.

The guys from Microsoft Research Asia (the authors of the VIPS algorithm) spoke positively about the idea and wished good luck in its development.

If you have comments, want to develop this topic or just help with advice - always happy. If someone has some groundwork in this area - let's cooperate :)

Source: https://habr.com/ru/post/66221/

All Articles