📜 ⬆️ ⬇️

Getting the main content of web pages programmatically

The task of clearing web pages of information noise is one of the urgent tasks of information retrieval. Its essence is to clear the information noise and get only the main content.

Consider an example:



The main content can be considered this part of the page:
')


Where it can be applied:
Before proceeding to the description of the solution, I will briefly discuss the existing solutions. With the advent of HTML5, the problem of finding the main content should disappear, since the specification implies new semantic elements. Consider them in more detail.

HTML5 semantic elements


Currently, the following semantic elements are assumed:The classic structure of blog notes:

image

The use of this method currently faces the following problems:But the biggest problem at the moment is the presence of billions of pages that were written a long time ago and are unlikely to be rewritten using the new standards. That is why the task of identifying the main content is important and relevant.

Readability


Website: http://lab.arc90.com/experiments/readability/

Readability is an Arc90 Lab development that allows you to install a small bookmark that will help bring web pages into a readable form. Readability uses its metrics to analyze the DOM model and identify “useful” content.

Readability example:



At the moment there are plug-ins for various browsers, and in Safari this feature is known as Safari Reader. For those who work on the Internet, this should be enough, but what about those who want to use this tool to write their own scripts? Actually, about it further.

Studies on the importance of information blocks


A number of my previous articles were devoted to the study of this problem, in particular, I propose to get acquainted with such publications:Own development - SmartBrowser - a browser prototype that clears web pages of information noise, has been available for a long time on the codeplex website at http://smartbrowser.codeplex.com/ .

The new version of SmartBrowser will be available soon on the site. The percentage of correctly recognized web pages has increased, while models and algorithms have become simpler as a result of experiments and research.

Currently, SmartBrowser looks like this:



Consider the work of SmartBrowser in action. Processing a web page looks like this:

MainContentExtractor r = new MainContentExtractor(new Uri(tbUrl.Text.Trim()));
var html = r.GetContent();
webBrowser1.DocumentText = r.getTitle().InnerText + html;


All the logic is encapsulated in the MainContentExtractor class from the developed Data Extracting SDK library (at the moment this functionality is not on the site yet), which I have already written about several times.

As a result, I obtained the following results for a number of well-known sites:













At the moment there are problems with some sites, for example, with Habr. Therefore, at the moment, research and development continues, but I hope that in the near future we will be able to talk about some kind of stable build.

Thanks for attention.

Source: https://habr.com/ru/post/105582/


All Articles