📜 ⬆️ ⬇️

We write a simple web page parser

Hello. My name is Serezha. I want to talk about how I wrote the simplest web spider.
image
Since this is a non-commercial project created solely on my enthusiasm, I was guided by the following when working:

1. Minimum of necessary functions (web scanning, saving necessary in DB, simple UI for access)
2. 0 financial costs:
- I use a netbook as a server, which I bought acer aspare ONE KAV60 in my time to study, quite budget even at the time of purchase (2008), now its 1600 MHz atom processor is not enough even for normal operation in MS OFFICE
- Internet - wired home. The benefit of IP has not changed for half a year, I did not have to order static
3. Minimum time costs. The project was done after work and giving.

As software used:


There is a popular entertainment resource in runet (let's call it Picaba - I don’t give a link, so that it’s not considered advertising). It is an entertainment post posted by users and comments. The moderators of the resource are tough (in my opinion, sometimes even a chur) follow the content of comments. As a result of their work, you often see the following instead of a comment:
')
image

Our robot will, at certain intervals, scan the resource in search of new comments that will be added to the database, so that you can always see what the moderator did not like.

We will need the following modules:

var jsdom = require("jsdom"); var mysql = require('mysql'); var CronJob = require('cron').CronJob; 

Our robot will be based on 2 functions:

 // * * *     function get_links(){ jsdom.env('http://pikabu.ru',function(err, window){ if (err) return false; var page = window.document.getElementsByClassName('to-comments') for (var i=0; i<page.length; i+=2) get_comments(page[i].getAttribute('href')); }) } 

With this function we request the main page of the resource and look for elements with the class ".to-comments". They contain links to pages with comments. As these elements go together, we need only every second.

In this function, the jsdom module helps us a lot . It converts html code to a DOM tree, in which we can easily find the desired item.

As we can see, this function calls get_comments ().

 function get_comments(link){ jsdom.env(link,function(err, window){ if (err) return false; var comment = window.document.getElementsByClassName('comment') for (var i=0; i<comment.length; i++){ var id = comment[i].getAttribute('data-id'); var author = comment[i].getElementsByClassName('post_author')[0].textContent; //    var block = comment[i].getElementsByClassName('comment_text')[0].getElementsByTagName('span'); if (block.length > 0 && block[0].textContent.substr(0,11)==''){ console.log(block[0].baseURI+'#comment_'+id); continue} var com = comment[i].getElementsByClassName('comment_text')[0].outerHTML.replace(/"/g, '"').replace(/\n|\t/g, '').replace('previews_gif_comm', 'big_size_comm_an'); var query = 'INSERT IGNORE comments (id, user, comment) VALUES ('+id+', "'+author+'", "'+com+'")'; DBconnection.query(query, function(err, rows, fields) { if (err) throw err; }); } console.log(new Date()+' DATA ADDED...'); }); } 

Here we also go over the tree, look for elements with the “comment” class, select the necessary elements from them: comment id, author, weed out deleted comments, remove special characters, slightly alter the code (remove thumbnails) and put all this in the database. In the comments table, the id field is unique, so mySQL itself ensures that there are no duplicate comments.

It remains for us to start a timer that awakens the robot every 5 minutes. In node.JS, this can be implemented using the croneJob module, an analogue of the crone scheduler in linux.

  var job = new CronJob('*/5 * * * *', function(){ get_links(); }); job.start(); 

That's all for now. Our spider learned to climb a resource and save comments. If you want, I can write an article about the web interface to this robot or about the chrome plugin for this robot.

Source: https://habr.com/ru/post/309428/


All Articles