Data Extracting SDK: Part 1

Data Extracting SDK is written in the .NET Framework 3.5 and contains tools for extracting and analyzing data from text files and web resources. Listening to the results of the survey, I post the first version of the Data Extracting SDK CTP (Community Technical Preview) for all to see.

Key features:

Html Processing - download, html analysis
DOM analysis - getting links, images, tables
extracting links, filters, the ability to write your own filter, in-depth analysis of the site
extract email addresses, phone numbers, urls, etc.
site content analysis (number of elements, word density)
opportunities for SEO analysis

Let us dwell on the main features of the SDK.

How to use Data Extracting SDK

Areas of use:

programs to collect the necessary information
development of analytical services, site and content analysis
programs for compiling databases, lists
competitor analysis software
SEO programs
automation of work with web resources
crawler programs

HtmlProcessor and ContentAnalyzer classes

The HtmlProcessor class is designed for loading and processing HTML.

Key features:

proxy support
UserAgent support
extraction of titles, meta tags, images, links, keywords, etc.
work with tables (search, filters)
GET and POST protocol support

The ContentAnalyzer class is an extension of the HtmlProcessor class and contains tools for statistical analysis of content.
')
The class diagram is presented below (clickable):

An example of working with HtmlProcessor:

HtmlProcessor proc = new HtmlProcessor( new Uri ( "http://www.microsoft.com/" ), new WebProxy( "http://111.111.11.1/" , true )); proc.Initialize(); // string html = proc.InnerHtml; // Html string text = proc.InnerText; // // DataTable DataTable dt = proc.GetDataTableByTableIndex(0); // "Access and connect with thousands of // Microsoft Certified companies to find products and services" string value = proc.GetHtmlString( "Microsoft Pinpoint" , "</div></div>" ).RemoveHtmlTags(); // List <ImageInfo> images = proc.Images; * This source code was highlighted with Source Code Highlighter .

WebProxy can be missed in the constructor.

To send a POST request, you need to use the following code:

HtmlProcessor proc = new HtmlProcessor( new Uri ( "http://www.microsoft.com/" ), new WebProxy( "http://11.11.1.1:111/" , true )); proc.HttpMethod = HttpMethods.POST; var parameters = new NameValueCollection(); parameters.Add( "name" , "value" ); proc.PostParameters = parameters; proc.Initialize(); * This source code was highlighted with Source Code Highlighter .

Class LinksExtractor

The LinksExtractor class is designed to extract links.

Key features:

extract links by url
proxy support
ability to extract visible / invisible links
deep analysis of pages (in depth)
managing the maximum number of retrievable links
filter support and retrieval rules

Class diagram:

Rules:

TextMustContainCondition - the link text must contain some value
TextMustNotContainCondition - the link text should not contain some value
SameDomainCondition - links must be in the same domain as the page (internal links only)
LinkIdMustContainCondition - the link id must contain some value
LinkIdMustNotContainCondition - the link id should not contain some value
HrefMustContainCondition - href links must contain some value
HrefMustNotContainCondition - href links should not contain any value.

To add your own rule, you need to implement a simple interface:

public interface ICondition
{
bool Satisfied(LinkInfo linkInfo, string value);
bool Satisfied(string linkInfo, string value);
}

Example of use:

LinksExtractor ext = new LinksExtractor( new Uri ( "http://microsoft.com/" )); // , href "microsoft" ext.AddRule( "microsoft" , new HrefMustContainCondition()); // 10 ext.Maximum = 10; // ext.ExtractHidden = true ; // ext.Extract(); // var links = ext.Links; * This source code was highlighted with Source Code Highlighter .

In the CTP version, the Maximum property is limited to 100.

Example of real use - How to get a list of sites in a given zone .

Other classes

You can read about the WebScreenshotExtractor class and the program that uses it here .

We will talk about EmailsExtractor, PhonesExtractor, UrlsExtractor, GuidExtractor , SEO and other features next time, but some examples can be found here .

Download Data Extraction SDK v.1.0 CTP from Codeplex website

A few words about real use

With the help of this SDK the following applications were developed:

Australian Yellow Pages data extractor (private project)
Emails Extractor
Phones extractor
Links Extractor
Facebook data extractor (private project)
Google web search results extractor
Google related websites extractor
PageRank Extractor
SmartBrowser ( announcement on habre )
Website Screenshots and Thumbnails Extractor ( Habré Announcement )
and etc.

Feedback

I would like to hear:

noticed bugs
possible options to extend the functionality
the answer to the question of how much you don’t mind giving the blood for a license for this kind of product

And finally, if you have data extraction tasks, please contact :)

Thanks for attention!

Source: https://habr.com/ru/post/68150/

All Articles