📜 ⬆️ ⬇️

Data acquisition, part 4

In the previous sections, I outlined the process of collecting data from web sources. In this post I will show how to make a generic host for processing various sites using WatiN. Also, I will address the issue of multithreading in the use of WatiN. Sources, as always, here .

Generic WatiN host using MEF


Since it is dangerous to run several WatiN-managed services, we need to control this process using a service (host) that implements the plug-in architecture. To begin with, let's define a certain interface on which WatiN-managed services will work:

public abstract class WatinDataAcquisitionService : DataAcquisitionService<br/>
{<br/>
/// <summary>
/// This method must be implemented by any scraping service that needs to
/// use WatiN.
/// </summary>
/// <param name="browser">A preinitialized <c>Browser</c> object
/// that one can use for scraping.</param>
/// <remarks>Do not pass the <c>browser</c> object into other
/// threads or asynchronous operations.</remarks>
public abstract void AcquireData(Browser browser, ILog log);<br/>
}<br/>

Our interface has only one method for scrapping. In this service we transfer the already initialized object of Browser type (it can be IE or FireFox ) as well as a link to the logger from the main service - this allows us to log the process from the main host.

In order to get all the available WatiN services, our host uses MEF, declaring the fact that he wants to load all objects of the WatinDataAcquisitionService :
')
[ImportMany( typeof (WatinDataAcquisitionService))]<br/>
public WatinDataAcquisitionService[] WatinServices { get; set; }<br/>

Loading of available services occurs in the initialization of the service itself. In our case, we just find all the DLLs in the plugins subdirectory:

cat = new DirectoryCatalog( "plugins" );<br/>
cc = new CompositionContainer(cat);<br/>
cc.ComposeParts( this );<br/>

Our stereotypical DoWork() method looks quite chic. Let's show it first:

private void DoWork()<br/>
{<br/>
while ( true )<br/>
{<br/>
log.InfoFormat( "Found {0} WatiN services" , WatinServices.Length);<br/>
if (WatinServices.Length > 0)<br/>
using ( var browser = new IE())<br/>
{<br/>
browser.Visible = false ;<br/>
foreach ( var s in WatinServices)<br/>
{<br/>
using ( var timer = new MyTimer(s.GetType().FullName, log))<br/>
{<br/>
// prevent errors from bleeding through
try <br/>
{<br/>
s.AcquireData(browser, log);<br/>
}<br/>
catch (Exception ex)<br/>
{<br/>
log.Error(<br/>
string .Format( "WatiN service {0} threw an exception" , s.GetType().FullName),<br/>
ex);<br/>
}<br/>
}<br/>
}<br/>
}<br/>
// do some work, then
Thread.Sleep(pollingFrequency);<br/>
}<br/>
}<br/>

There are a few things happening here - measuring time, launching services and logging errors in case their authors allow exceptions to break through the meningeal barrier (House should be watched). Since services are called sequentially, they all use the browser without interfering with each other.

As for our plugin, everything is very simple - this is a DLL in which there is a class (s) marked with the Export attribute. Like this:

[Export( typeof (WatinDataAcquisitionService))]<br/>
public class PokemonService : WatinDataAcquisitionService<br/>
{<br/>
public override void AcquireData(Browser browser, ILog log)<br/>
{<br/>
log.Info( "Pokemon service running" );<br/>
browser.GoTo( "http://www.pokemon.com" );<br/>
var doc = new HtmlDocument();<br/>
doc.LoadHtml(browser.Body.OuterHtml);<br/>
var h3 = doc.DocumentNode.SelectNodes( "//h3" ).First();<br/>
log.Info(h3.InnerText);<br/>
}<br/>
}<br/>

The beauty of MEF is that the resulting DLL can be simply copied into plugins daddy and everything will work. Danger, Will Robinson: dependencies, too, need to be copied to this folder or do ILmerge (the second is preferable).

Seriously, what about multithreading?


In fact, multi-threaded use of WatiN is certainly possible - after all, we can open multiple copies of IE at the same time, right? But not everything is so simple.

First, you cannot open immediately, say, 100 copies of IE - what specifically breaks is not clear (COM exceptions are such informative ...), but problems are guaranteed. On the other hand, you can open for example 2*Environment.ProcessorCount copies and everything is more or less working.

The second problem is that if you use, say, TPL, then you need to write your StaTaskScheduler that will create STA threads instead of MTA. Fortunately, this solution was already on the network ( on MSDN ), and I put it in the examples. Here is an example of how you can run 4 copies of IE each time:

var po = new ParallelOptions();<br/>
po.TaskScheduler = new StaTaskScheduler(4);<br/>
Parallel.For(0, 100, po, x =><br/>
{<br/>
using ( var browser = new IE( "http://news.bbc.co.uk" ))<br/>
{<br/>
browser.Visible = false ;<br/>
var doc = new HtmlDocument();<br/>
doc.LoadHtml(browser.Body.OuterHtml);<br/>
var h3 = doc.DocumentNode.SelectNodes( "//h3" ).First();<br/>
Console.WriteLine(h3.InnerText);<br/>
}<br/>
});<br/>

By analogy with this approach, our host server can open not one browser, but have a whole pool of, say, 10 browsers that can be selectively transferred to the services under control.

Source: https://habr.com/ru/post/94960/


All Articles