Data acquisition, part 1

One of the advantages of a general reduction in the cost of equipment and the Internet is that the collection of information from various sources on the Internet costs almost nothing and can be done without any problems. The task of obtaining and processing large amounts of data is commercially attractive due to the demand for reading (“scrapping”) of websites by customers (usually described by the term 'social media analysis', ie social media analysis). Well, in principle, it is quite interesting - at least compared to the routine development of sites, reports, etc.

In this article I will start a story about how you can implement data collection and processing using the .Net platform. It would be interesting to hear about how to do the same thing in the Java stack, so if someone wants to join this article as a collaborator, you are welcome.

All sources are here: http://bitbucket.org/nesteruk/datagatheringdemos
')

So, we probably have the most "blurry" of the possible tasks - receiving, processing and storing data. To get a working system, we need to know

Where are the data and how to properly access it
How to process data to get only what you need
Where and how to store data

Let's look at the data sources from which you need to receive information:

Forums
Twitter
Blogs
News sites
Catalogs, Listings
Public Web Services
Application Software

Just want to emphasize that the web browser is not the only source of data. However, if working with web services or, say, using the API of a social platform, is a fairly understandable task and does not require a lot of body movements, parsing HTML is much more difficult. And HTML is not the limit - sometimes you have to parse JavaScript or even visual information from images (for example, to bypass the "captcha").

Another problem is that sometimes the content is loaded dynamically via AJAX, which makes it necessary for different kinds of state accounting to get the content exactly when it is available.

Data processing is the most time-consuming and expensive (from the point of view of a potential customer) operation. On the one hand, it may seem that the same HTML should be very easy to understand by existing means, but in fact it is not. Firstly, HTML in most cases is not XHTML, in other words, by making XElement.Parse() you simply get an exception. Therefore, you must at least be able to "correct" poorly written HTML.

Even with well-formed data, you will still have a lot of problems - after all, any more or less complex web page is a projection of the multidimensional structure of the database owner on a one-dimensional space. The restoration of relationships and dependencies is thus a necessary task for storing the obtained information in relational databases.

We should not forget about more "down to earth" data processing, that is, some transformations or arbitrary actions on the data obtained. For example, if you receive an IP address, you will want to know the location or availability of a web server at this address, which will require additional requests. Or, say, when you receive new data, you need to constantly recalculate the moving average (streaming OLAP).

After receiving the data, they need to be stored somewhere. There are a lot of storage options - using serialization, text files, as well as object-oriented and document-oriented as well as of course relational databases. The choice of storage in a commercial order most likely depends on either the customer (“we want MySQL”) or the financial preferences of the customer. In .Net development, the default database is SQL Server Express. If you are making a repository for yourself, you can use anything you want - be it MongoDB, db4o or, for example, SQL Server 2008R2 Datacenter Edition.

In most cases, data warehouses do not require special complexity, since users simply project the database into Excel (or SPSS , SAS, etc.) and then use familiar methods for analysis. Options like SSAS (SQL Server Analysis Services) are used much less often (due to the minimum price tag of $ 7,500 - see here ), but it’s also worth knowing about them.

Let's look at the minimum piece of code that will help us download and “parse” the page. For these tasks, we will use two packages:

WatiN is a library for testing web interfaces. Its good to use for automated pressing buttons, select items from the list, and similar things. WatiN also provides the object model of the captured page, but I would not use it. The reason is generally the same - WatiN is an unstable and quite capricious library, which should be used with caution (only in 32-bit mode!) To control the browser.
HTML Agility Pack - a library for parsing HTML. HTML itself can be taken from WatiN, downloaded, and even if it is poorly formed, the Agility Pack will allow you to search and select it using XPath.

Here is a minimal example of how you can use these two frameworks together to get a page from the site:

[STAThread] static void Main() { using ( var browser = new IE( "http://www.pokemon.com" )) { var doc = new HtmlDocument(); doc.LoadHtml(browser.Body.OuterHtml); var h1 = doc.DocumentNode.SelectNodes( "//h3" ).First(); Console.WriteLine(h1.InnerText); } Console.ReadKey(); } 

In the example above, we got the page through WatiN, loaded the body of the page into the HTML Agility Pack, found the first element of the H3 type and wrote out its contents to the console.

Probably it is obvious to you that writing data to some storage is not done from a console application. In most cases, the service (windows service) is used for this. And what the service does is in most cases polling, that is, the regular download of the resource and the update of our understanding of it. Downloading usually occurs at intervals of N minutes / hours / days.

public partial class PollingService : ServiceBase { private readonly Thread workerThread; public PollingService() { InitializeComponent(); workerThread = new Thread(DoWork); workerThread.SetApartmentState(ApartmentState.STA); } protected override void OnStart( string [] args) { workerThread.Start(); } protected override void OnStop() { workerThread.Abort(); } private static void DoWork() { while ( true ) { log.Info( "Doing work⋮" ); // do some work, then Thread.Sleep(1000); } } }

For the good behavior of the service you need a few more useful chips. First, it is useful to add the ability to start from the console to the services. This helps with debugging.

var service = new PollingService(); ServiceBase[] servicesToRun = new ServiceBase[] { service }; if (Environment.UserInteractive) { Console.CancelKeyPress += (x, y) => service.Stop(); service.Start(); Console.WriteLine( "Running service, press a key to stop" ); Console.ReadKey(); service.Stop(); Console.WriteLine( "Service stopped. Goodbye." ); } else { ServiceBase.Run(servicesToRun); } 

Another useful feature is self-registration, so that instead of using installutil you can install the service through myservice /i . For this there is a separate class ...

class ServiceInstallerUtility { private static readonly ILog log = LogManager.GetLogger( typeof (Program)); private static readonly string exePath = Assembly.GetExecutingAssembly().Location; public static bool Install() { try { ManagedInstallerClass.InstallHelper( new [] { exePath }); } catch { return false ; } return true ; } public static bool Uninstall() { try { ManagedInstallerClass.InstallHelper( new [] { "/u" , exePath }); } catch { return false ; } return true ; } }

The installation class uses the little-known System.Configuration.Install assembly. It is used directly from Main() :

if (args != null && args.Length == 1 && args[0].Length > 1 && (args[0][0] == '-' || args[0][0] == '/' )) { switch (args[0].Substring(1).ToLower()) { case "install" : case "i" : if (!ServiceInstallerUtility.Install()) Console.WriteLine( "Failed to install service" ); break ; case "uninstall" : case "u" : if (!ServiceInstallerUtility.Uninstall()) Console.WriteLine( "Failed to uninstall service" ); break ; default : Console.WriteLine( "Unrecognized parameters." ); break ; } }

Well, the last feature is of course the use of logging. I use the log4net library, and to write logs to the console you can use a very tasty feature called ColoredConsoleAppender . The logging process itself is primitive.

For the first time enough information. By the end I want to remind a few simple rules:

Running IE requires a single-thread apartment; I really use FireFox. i like firebug
WatiN should be executed in a 32-bit program (x86)
Polling, given above, is not ideal, because does not take into account the fact that in itself WatiN protrusive and HTML parsing is also a slow operation

Speaking of birds ... instead of a service, you can basically make an EXE and run it through a sheduler. But it is somehow untidy.

Thanks for attention. To be continued :)

Source: https://habr.com/ru/post/93958/

All Articles

Data acquisition, part 1

More articles: