Asynchronous web-mining using node.js

I would like to share the experience of solving web-mining tasks: collecting some information from a specific list of resources. Immediately I would like to note that this is not an attempt to create your own “search engine” - for this purpose completely different approaches are used. The purpose of web-mining'a - pull out a piece of information. For example, if the resource supports microformats in the form of "business cards", etc.

Now about the implementation: why node.js? Indeed, I had no restrictions on any particular technology - you could use everything from C ++ from Java / .NET to Perl / Python. I'll tell you why I chose node.js:

Asynchronous IO operations. Although in other languages it is also possible to organize asynchrony, and sometimes it is very simple - there is an async block in F #, but node.js has asynchrony out of the box and is the preferred way to perform operations.
The most familiar syntax with the least amount of redundant constructs. Of course, the “holivar” item, but in fact javascript is closer to those who used C / C ++, java, C #, than F # or Python.
Support for http-client and regular expressions out of the box without having to install additional modules.
Execution speed Although the V8 has a “weak” place - context switching, but for a given task it should not be a “bottleneck” and “linear” speed is more important. And the V8 boasts just that (NB make a benchmark to prove this point in numbers).

Install node.js

Installation on my server (FreeBSD, amd64) was more than smooth - “cd / usr / ports / www / node; make install” and node.js is ready for use.

For Windows platforms, the installation option via cygwin is the most available. I did not find a good instruction, although I came across the implementation of node.js purely by .NET .
')
For Ubuntu, it is also done without any problems - for example, a good instruction .

Further reading of a rather pleasant manual . Although the manual looks really nice, but it covers only the basic elements, and when I wanted my web miner to be like most other classes and could initiate events, it turned out that the manual did not describe it at all. But more about that later.

Page unloader

Taking an example of http.Client and completing the waiting for the entire document to load, parsing the url and composing the necessary request, this “class” came out:

var webDownloader = function (sourceUrl) {<br>
events.EventEmitter.call( this );<br>
this .load = function (sourceUrl) {<br>
var src = url.parse(sourceUrl);<br>
var webClient = http.createClient(src.port==undefined?80:src.port,src.hostname);<br>
var get = src.pathname+(src.search==undefined? '' :src.search);<br>
sys.log( 'loading ' +src.href);<br>
var request = webClient.request( 'GET' , get ,<br>
{ 'host' : src.hostname});<br>
request.end();<br>
var miner = this ;<br>
request.on( 'response' , function (response) {<br>
// console.log('STATUS: ' + response.statusCode); <br>
// console.log('HEADERS: ' + JSON.stringify(response.headers)); <br>
response.setEncoding( 'utf8' );<br>
var body = '' ;<br>
response.on( 'data' , function (chunk) {<br>
body += chunk;<br>
});<br>
response.on( 'end' , function () {<br>
miner.emit( 'page' ,body, src);<br>
});<br>
});<br>
};<br>
}<br>
sys.inherits(webDownloader, events.EventEmitter); <br>
<br>
* This source code was highlighted with Source Code Highlighter .

var webDownloader = function (sourceUrl) { events.EventEmitter.call( this ); this .load = function (sourceUrl) { var src = url.parse(sourceUrl); var webClient = http.createClient(src.port==undefined?80:src.port,src.hostname); var get = src.pathname+(src.search==undefined? '' :src.search); sys.log( 'loading ' +src.href); var request = webClient.request( 'GET' , get , { 'host' : src.hostname}); request.end(); var miner = this ; request.on( 'response' , function (response) { // console.log('STATUS: ' + response.statusCode); // console.log('HEADERS: ' + JSON.stringify(response.headers)); response.setEncoding( 'utf8' ); var body = '' ; response.on( 'data' , function (chunk) { body += chunk; }); response.on( 'end' , function () { miner.emit( 'page' ,body, src); }); }); }; } sys.inherits(webDownloader, events.EventEmitter); * This source code was highlighted with Source Code Highlighter .

var webDownloader = function (sourceUrl) {<br>
events.EventEmitter.call( this );<br>
this .load = function (sourceUrl) {<br>
var src = url.parse(sourceUrl);<br>
var webClient = http.createClient(src.port==undefined?80:src.port,src.hostname);<br>
var get = src.pathname+(src.search==undefined? '' :src.search);<br>
sys.log( 'loading ' +src.href);<br>
var request = webClient.request( 'GET' , get ,<br>
{ 'host' : src.hostname});<br>
request.end();<br>
var miner = this ;<br>
request.on( 'response' , function (response) {<br>
// console.log('STATUS: ' + response.statusCode); <br>
// console.log('HEADERS: ' + JSON.stringify(response.headers)); <br>
response.setEncoding( 'utf8' );<br>
var body = '' ;<br>
response.on( 'data' , function (chunk) {<br>
body += chunk;<br>
});<br>
response.on( 'end' , function () {<br>
miner.emit( 'page' ,body, src);<br>
});<br>
});<br>
};<br>
}<br>
sys.inherits(webDownloader, events.EventEmitter); <br>
<br>
* This source code was highlighted with Source Code Highlighter .

The interesting thing here is how the class is registered as an event source:

we first register with an EventEmitter in the constructor: events.EventEmitter.call (this);
“Inherit” class from EventEmitter
“Emit” an event using the emit method

It is the work with EventEmitter that is still poorly documented, so I had to google it a bit.

Now we can subscribe to the full page download event:

var loader = new webDownloader();<br>
loader.on('page',vcardSearch);

Search vCard data

Now a less interesting function that just pulls vCard data from a page. I didn’t want to spend a lot of time on the correct implementation, so I did it head-on - searching for elements with the necessary classes.

There is nothing particularly interesting here except for using the Apricot module for parsing the page (although it would really be enough to use htmlparser, but Apricot was set up much faster). At first I tried to build a CSS selector to search for the desired elements and use Apricot’s find function (which, in turn, uses Sizzle to search), but as it turned out, recurrent crawling of all elements was faster.

The result was this function:

var vcardSearch = function (body,src) {<br>
sys.log( 'scaning ' +src.href);;<br>
Apricot.parse(body, function (doc) {<br>
var vcardClasses = [<br>
// required <br>
'fn' ,<br>
'family-name' , 'given-name' , 'additional-name' , 'honorific-prefix' , 'honorific-suffix' ,<br>
'nickname' ,<br>
// optional <br>
'adr' , 'contact' ,<br>
'email' ,<br>
'post-office-box' , 'extended-address' , 'street-address' , 'locality' , 'region' , 'postal-code' , 'country-name' ,<br>
'bday' , 'email' , 'logo' , 'org' , 'photo' , 'tel' <br>
];<br>
var vcard = new vCard();<br>
var scanElement = function (el) {<br>
if (el==undefined) return ;<br>
<br>
if (el.className != undefined && el.className!= '' ) {<br>
var classes = el.className.split( ' ' );<br>
for ( var n in classes) {<br>
if (vcardClasses.indexOf(classes[n])>=0) {<br>
var value = el.text.trim().replace(/<\/?[^>]+(>|$)/g, '' );<br>
if (value != '' ) vcard.Values[classes[n]] = value;<br>
}<br>
}<br>
}<br>
for ( var i in el.childNodes) scanElement(el.childNodes[i]);<br>
}<br>
scanElement(doc. document .body);<br>
if (!vcard.isEmpty())<br>
sys.log( 'vCard = ' +vcard.toString());<br>
else <br>
sys.log( 'no vCard found on ' +src.href);<br>
});<br>
} <br>
<br>
* This source code was highlighted with Source Code Highlighter .

var vcardSearch = function (body,src) { sys.log( 'scaning ' +src.href);; Apricot.parse(body, function (doc) { var vcardClasses = [ // required 'fn' , 'family-name' , 'given-name' , 'additional-name' , 'honorific-prefix' , 'honorific-suffix' , 'nickname' , // optional 'adr' , 'contact' , 'email' , 'post-office-box' , 'extended-address' , 'street-address' , 'locality' , 'region' , 'postal-code' , 'country-name' , 'bday' , 'email' , 'logo' , 'org' , 'photo' , 'tel' ]; var vcard = new vCard(); var scanElement = function (el) { if (el==undefined) return ; if (el.className != undefined && el.className!= '' ) { var classes = el.className.split( ' ' ); for ( var n in classes) { if (vcardClasses.indexOf(classes[n])>=0) { var value = el.text.trim().replace(/<\/?[^>]+(>|$)/g, '' ); if (value != '' ) vcard.Values[classes[n]] = value; } } } for ( var i in el.childNodes) scanElement(el.childNodes[i]); } scanElement(doc. document .body); if (!vcard.isEmpty()) sys.log( 'vCard = ' +vcard.toString()); else sys.log( 'no vCard found on ' +src.href); }); } * This source code was highlighted with Source Code Highlighter .

Total

Using the result is easy:

loader.load('http://www.google.com/profiles/olostan');<br>
loader.load('http://www.flickr.com/people/olostan/');<br>

At once I want to say that it was intended not as a final, at least a little serious product, but rather as a proof-of-concept and in order to touch node.js

Complete code (uploaded to Google Docs, may require a google account)

PS This is a repost from my post in the sandbox. I apologize if this is not the case, but it would be interesting to hear the comments. Thanks to Romachev for an invite. Post to the thematic blog is not enough karma.

Source: https://habr.com/ru/post/102840/

All Articles

Asynchronous web-mining using node.js

Install node.js

Page unloader

Search vCard data

Total

More articles: