Approaches to extracting data from web resources

In the previous article we reviewed the basic concepts and terms in the framework of Data Mining technology. Today, we will focus more on Web Mining and approaches to extracting data from web resources.

Web mining is a process of extracting data from web resources, which, as a rule, has more practical component than theoretical one. The main goal of Web Mining is data collection (parsing) and then saving in the required format. In fact, the task comes down to writing HTML parsers, and let's talk about this in more detail.

There are several approaches to extracting data:

DOM tree analysis, using XPath.
String parsing
Use regular expressions.
XML parsing.
Visual approach.

Consider all the approaches in more detail.

DOM tree analysis

This approach is based on the analysis of the DOM tree. Using this approach, data can be obtained directly from the identifier, name, or other attributes of a tree element (such an element can be a paragraph, table, block, etc.). In addition, if the element is not identified by any identifier, then it can be reached via a unique path going down the DOM tree, for example:
')
body -> p [10] -> a [1] -> link text

or go through the collection of similar elements, for example:

body -> links -> 5 element -> link text

Advantages of this approach:

you can get data of any type and any level of complexity
knowing the location of the element, you can get its value by writing the path to it

Disadvantages of this approach:

Different HTML / JavaScript engines generate the DOM tree in different ways, so you need to bind to a specific engine.
the element path may change, therefore, as a rule, such parsers are designed for a short period of data collection
DOM path can be complex and not always unambiguous.

This approach can be used in conjunction with the Microsoft.mshtml library, which, in fact. is a core element in Internet Explorer.

Data Extracting SDK uses Microsoft.mshtml to analyze the DOM tree, but is an “add-on” over the library for easy reference:

UriHtmlProcessor proc = new UriHtmlProcessor( new Uri ( "http://habrahabr.ru/new/page1/" )); proc.Initialize(); var links = from l in proc.Links where l.Class == "topic" && EndsWithInt(l.Href) == true select new ResultItem{ Link = l.Href, TopicName = l.Text.ToWindows1251() }; * This source code was highlighted with Source Code Highlighter .

The next evolutionary step in analyzing the DOM tree is using XPath — i.e. paths that are widely used when parsing XML data. The essence of this approach is to describe the path to the element with the help of some simple syntax without the need to gradually move down the DOM tree. This approach uses the well-known jQuery library and the HtmlAgilityPack library:

HtmlDocument doc = new HtmlDocument(); doc.Load( "file.htm" ); foreach (HtmlNode link in doc.DocumentElement.SelectNodes( "//a[@href" ]) { HtmlAttribute att = link[ "href" ]; att.Value = FixLink(att); } doc.Save( "file.htm" ); * This source code was highlighted with Source Code Highlighter .

String parsing

Despite the fact that this approach cannot be used to write serious parsers, I will tell you a bit about it.

Sometimes data is displayed using a certain template (for example, a table of characteristics of a mobile phone), when the values of the parameters are standard, and only their values change. In this case, the data can be obtained without analyzing the DOM tree, and by parsing the lines, for example, as done in the Data Extracting SDK:

Data:

Company: Microsoft
Headquarters: Redmond

Code:

string data = ": Microsoft-: " ; string company = data.GetHtmlString( ": " , "" ); string location = data.GetHtmlString( "-: " , "" ); // output // company = "Microsoft" // location = "" * This source code was highlighted with Source Code Highlighter .

Using a set of methods for analyzing strings is sometimes (more often - simple template cases) more efficient than analyzing a DOM tree or XPath.

Regular expressions and XML parsing

It is often seen when HTML is completely parsed using regular expressions. This is fundamentally the wrong approach, as this way you can get more problems than good.

Regular expressions should only be used to extract data that has a strict format — email addresses, phone numbers, etc., in rare cases — addresses, template data.

Another inefficient approach is to treat HTML as XML data. The reason is that HTML is rarely valid, i.e. such that it can be viewed as XML data. Libraries that have implemented this approach have devoted more time to converting HTML to XML and only then directly to data parsing. Therefore, better avoid this approach.

Visual approach

At the moment, the visual approach is at the initial stage of development. The essence of the approach is that the user can “set up” the system without the use of a programming language or API to obtain the necessary data of any complexity and nesting. About something similar (although applicable in another area) - methods of analyzing web pages at the level of information blocks, I already wrote . I think that the future parsers will be visual.

Problems and general recommendations

Problems with parsing HTML data - using JavaScript / AJAX / asynchronous downloads makes writing parsers very difficult; different engines for rendering HTML can produce different DOM trees (in addition, engines can have bugs that later affect the results of the parsers); large amounts of data require you to write distributed parsers, which entails additional synchronization costs.

It is impossible to unambiguously single out an approach that will be 100% applicable in all cases; therefore, modern libraries for parsing HTML data, as a rule, combine different approaches. For example, HtmlAgilityPack allows you to analyze the DOM tree (use XPath), and Linq to XML technology has recently been supported. Data Extracting SDK uses DOM tree analysis, contains a set of additional methods for string parsing, and also allows you to use Linq technology for querying DOM page models.

Today, the absolute leader for parsing HTML data for dotnetchik is the HtmlAgilityPack library, but for the sake of interest, you can also look at other libraries.

Source: https://habr.com/ru/post/99918/

All Articles