📜 ⬆️ ⬇️

Data Extracting SDK: Part 1

Data Extracting SDK is written in the .NET Framework 3.5 and contains tools for extracting and analyzing data from text files and web resources. Listening to the results of the survey, I post the first version of the Data Extracting SDK CTP (Community Technical Preview) for all to see.

Key features:
Let us dwell on the main features of the SDK.

How to use Data Extracting SDK


Areas of use:

HtmlProcessor and ContentAnalyzer classes


The HtmlProcessor class is designed for loading and processing HTML.

Key features:
The ContentAnalyzer class is an extension of the HtmlProcessor class and contains tools for statistical analysis of content.
')
The class diagram is presented below (clickable):

HTMLProcessor

An example of working with HtmlProcessor:

HtmlProcessor proc = new HtmlProcessor(
new Uri ( "http://www.microsoft.com/" ),
new WebProxy( "http://111.111.11.1/" , true ));

proc.Initialize(); //
string html = proc.InnerHtml; // Html
string text = proc.InnerText; //

// DataTable
DataTable dt = proc.GetDataTableByTableIndex(0);

// "Access and connect with thousands of
// Microsoft Certified companies to find products and services"
string value = proc.GetHtmlString( "Microsoft Pinpoint" , "</div></div>" ).RemoveHtmlTags();

//
List <ImageInfo> images = proc.Images;


* This source code was highlighted with Source Code Highlighter .


WebProxy can be missed in the constructor.

To send a POST request, you need to use the following code:

HtmlProcessor proc = new HtmlProcessor(
new Uri ( "http://www.microsoft.com/" ),
new WebProxy( "http://11.11.1.1:111/" , true ));

proc.HttpMethod = HttpMethods.POST;
var parameters = new NameValueCollection();
parameters.Add( "name" , "value" );
proc.PostParameters = parameters;
proc.Initialize();


* This source code was highlighted with Source Code Highlighter .


Class LinksExtractor


The LinksExtractor class is designed to extract links.

Key features:
Class diagram:

Linksextractor

Rules:
To add your own rule, you need to implement a simple interface:

public interface ICondition
{
bool Satisfied(LinkInfo linkInfo, string value);
bool Satisfied(string linkInfo, string value);
}


Example of use:

LinksExtractor ext = new LinksExtractor( new Uri ( "http://microsoft.com/" ));

// , href "microsoft"
ext.AddRule( "microsoft" , new HrefMustContainCondition());

// 10
ext.Maximum = 10;

//
ext.ExtractHidden = true ;

//
ext.Extract();

//
var links = ext.Links;


* This source code was highlighted with Source Code Highlighter .


In the CTP version, the Maximum property is limited to 100.

Example of real use - How to get a list of sites in a given zone .

Other classes


You can read about the WebScreenshotExtractor class and the program that uses it here .

We will talk about EmailsExtractor, PhonesExtractor, UrlsExtractor, GuidExtractor , SEO and other features next time, but some examples can be found here .

Download Data Extraction SDK v.1.0 CTP from Codeplex website

A few words about real use


With the help of this SDK the following applications were developed:

Feedback


I would like to hear:


And finally, if you have data extraction tasks, please contact :)

Thanks for attention!

Source: https://habr.com/ru/post/68150/


All Articles