📜 ⬆️ ⬇️

Writing a parser on NodeJS

Previously, the main parsing library was JSDOM, which suffered from excessive ponderability and in fact slowed down the parsing process. But time has changed and cheerio has come. It does almost the same thing, and discards the unnecessary from the process, while it itself implements some part of jQuery (namely, the one that we need for parsing). And due to this, it finally allows you to write a non-slowing parser, while not using regexps to increase performance. It copes with xml, only you need to call it with {xmlMode: true}. About how you can easily parse on nodeJS under the cat.

Technology

We will use Q - to create Defered and build an asynchronous queue, request - to extract content, and cheerio for the parsing itself.

Example in vacuum â„–1

request(url, function(err, res, body){ if(err){console.log(err);} else{ $ = cheerio.load(body); var cards = []; $('.card').each(function(){ cards.push({ title:$('.title',this).text(), url:$('a',this).attr('href') }); }); } } 


This is not a tricky way you can parse the page.
')
But what if there are more pages than one? We will have 2 problems if we solve it without using Promise. The first is care in the stack, the second is care in memory through duplication of scope. The root of all evil is of course a recursive function, which is not very suitable for us when parsing, respectively, we need to build an asynchronous queue without increasing the osprey level.

To do this, we divide our program into 2 stages:
Stage 1: taking a page with a paginator and recognizing the total number of pages.
Step 2: Create an asynchronous queue into which we hook our parsing function.

The function that will be executed in the asynchronous queue can be done in 2 ways.
First: we generate sub-scopes for each of the calls in advance (the code below needs some work before entering production):

 for(var i = 0; i<l;i+=){ chain.then(asyncF.bind({page: i})); } 


inside the asynchronous function, there must be a reading of the context from this.page.

The other way is to have a common data stream in a global form, and inside the asynchronous functions just transfer the number that will increase already in the asynchronous function itself, as is done below:

Example in vacuum â„–2

 //stage 1 request('pager',function(err,res, body){ $ = cheerio.load(body); var pager = $('.pager'); var limitPage = parseInt( pager.eq(pager.length-1).text().trim(), 10); //stage 2 function parsePage(page){ var defer = Q.defer(); request('/pager/'+page,function(err,res, body){ if(page<=limitPage){ defer.resolve(page+1); //              } else { defere.reject(); } //      }); // promise     . return defer.promise; } var chain = Q.fcall(function(){ return parsePage(1); }); for(var i = 2; i<limitPage;i++){ chain = chain.then(function(page){ return parsePage(page); //       ,   . }); } }); 


UPDATE:

Problems with encodings when working with Node.JS:

To work with one of those encodings that are not initially in Node.JS, you need to accept the data as a Buffer - for this you need to call with encoding: null and then, using the iconv library, already parse.

Encoding example

 var request = require('request'); var Iconv = require('iconv').Iconv; var fromEnc = 'cp1251'; var toEnc = 'utf-8'; var translator = new Iconv(fromEnc,toEnc); request( { url:'http://winrus.com/cpage_r.htm', encoding:null }, function(err,res,body){ console.log(translator.convert(body).toString()); } ); 

Source: https://habr.com/ru/post/210166/


All Articles