Data Extracting SDK is written in the .NET Framework 3.5 and contains tools for extracting and analyzing data from text files and web resources. Listening to the results of the survey, I post the first version of the Data Extracting SDK CTP (Community Technical Preview) for all to see.
Key features:
Html Processing - download, html analysis
DOM analysis - getting links, images, tables
extracting links, filters, the ability to write your own filter, in-depth analysis of the site
extract email addresses, phone numbers, urls, etc.
site content analysis (number of elements, word density)
opportunities for SEO analysis
Let us dwell on the main features of the SDK.
How to use Data Extracting SDK
Areas of use:
programs to collect the necessary information
development of analytical services, site and content analysis
programs for compiling databases, lists
competitor analysis software
SEO programs
automation of work with web resources
crawler programs
HtmlProcessor and ContentAnalyzer classes
The HtmlProcessor class is designed for loading and processing HTML.
Key features:
proxy support
UserAgent support
extraction of titles, meta tags, images, links, keywords, etc.
work with tables (search, filters)
GET and POST protocol support
The ContentAnalyzer class is an extension of the HtmlProcessor class and contains tools for statistical analysis of content. ')
The class diagram is presented below (clickable):
An example of working with HtmlProcessor:
HtmlProcessor proc = new HtmlProcessor( new Uri ( "http://www.microsoft.com/" ), new WebProxy( "http://111.111.11.1/" , true ));
proc.Initialize(); // string html = proc.InnerHtml; // Html string text = proc.InnerText; //
// "Access and connect with thousands of // Microsoft Certified companies to find products and services" string value = proc.GetHtmlString( "Microsoft Pinpoint" , "</div></div>" ).RemoveHtmlTags();
// List <ImageInfo> images = proc.Images;
* This source code was highlighted with Source Code Highlighter .
WebProxy can be missed in the constructor.
To send a POST request, you need to use the following code:
HtmlProcessor proc = new HtmlProcessor( new Uri ( "http://www.microsoft.com/" ), new WebProxy( "http://11.11.1.1:111/" , true ));
proc.HttpMethod = HttpMethods.POST; var parameters = new NameValueCollection(); parameters.Add( "name" , "value" ); proc.PostParameters = parameters; proc.Initialize();
* This source code was highlighted with Source Code Highlighter .
Class LinksExtractor
The LinksExtractor class is designed to extract links.
Key features:
extract links by url
proxy support
ability to extract visible / invisible links
deep analysis of pages (in depth)
managing the maximum number of retrievable links
filter support and retrieval rules
Class diagram:
Rules:
TextMustContainCondition - the link text must contain some value
TextMustNotContainCondition - the link text should not contain some value
SameDomainCondition - links must be in the same domain as the page (internal links only)
LinkIdMustContainCondition - the link id must contain some value
LinkIdMustNotContainCondition - the link id should not contain some value
HrefMustContainCondition - href links must contain some value
HrefMustNotContainCondition - href links should not contain any value.
To add your own rule, you need to implement a simple interface:
You can read about the WebScreenshotExtractor class and the program that uses it here .
We will talk about EmailsExtractor, PhonesExtractor, UrlsExtractor, GuidExtractor , SEO and other features next time, but some examples can be found here .