We write telegrams of a bot parser of vacancies on JS

The subject of creating bots for Telegram is becoming increasingly popular, attracting programmers to try their hand at this field. Everyone periodically has ideas and tasks that can be solved by writing a thematic bot. For me, as a JS programmer, an example of such an urgent task is to monitor the job market on relevant topics.

However, one of the most popular languages and technologies in the field of creating bots is Python, offering the programmer a huge amount of good libraries for processing and parsing various sources of information in the form of text. I also wanted to do it in JavaScript — one of my favorite languages.

Task

The main task: to create a detailed job tape with tagging and nice visual markup. It can be divided into separate subtasks:
')

interaction with the Telegram API;
parsing RSS feeds of job sites;
parsing a single job;
thematic tagging;
visual design information;
duplication prevention.

At first, I thought of using a generic ready-made bot, for example, @TheFeedReaderBot . But after his detailed study, it turned out that tagging was completely absent, and the possibilities for customizing the display of content were severely limited. Fortunately, modern Javascript provides many libraries to help solve these problems. But first things first.

Bot frame

Of course, it would be possible to interact directly with the Telegram REST API, but in terms of labor costs it is easier to take ready-made solutions. Therefore, I selected the slimbot npm package referenced in the official bot tutorials. And although we will only send messages, this package will significantly simplify life by allowing you to create an internal bot API as an entity:

const Slimbot = require('slimbot'); const config = require('./config.json'); const bot = new Slimbot(config.TELEGRAM_API_KEY); bot.startPolling(); function logMessageToAdmin(message, type='Error') { bot.sendMessage(config.ADMIN_USER, `<b>${type}</b>\n<code>${message}</code>`, { parse_mode: 'HTML' }); } function postVacancy(message) { bot.sendMessage(config.TARGET_CHANNEL, message, { parse_mode: 'HTML', disable_web_page_preview: true, disable_notification: true }); } module.exports = { postVacancy, logMessageToAdmin };

We will use the usual setInterval as a scheduler, and feed-read will be used for parsing RSS, and the source of vacancies will be My Circle sites and hh.ru.

 const feed = require("feed-read"); const config = require('./config.json'); const HhAdapter = require('./adapters/hh'); const MoikrugAdapter = require('./adapters/moikrug'); const bot = require('./bot'); const { FeedItemModel } = require('./lib/models'); function processFeed(articles, adapter) { articles.forEach(article => { if (adapter.isValid((article))) { const key = adapter.getKey(article); new FeedItemModel({ key, data: article }).save().then( model => adapter.parseItem(article).then(bot.postVacancy), () => {} ); } }); } setInterval(() => { feed(config.HH_FEED, function (err, articles) { if (err) { bot.logMessageToAdmin(err); return; } processFeed(articles, HhAdapter); }); feed(config.MOIKRUG_FEED, function (err, articles) { if (err) { bot.logMessageToAdmin(err); return; } processFeed(articles, MoikrugAdapter); }); }, config.REQUEST_PERIOD_TIME);

Parsing a single job

Due to the different structure of the vacancy pages for each source site, the implementation of parsing is different. Therefore, adapters providing a unified interface were used. To work with DOM on the server, the jsdom library came up , with which you can perform standard operations: finding an element by a CSS selector, getting the contents of an element that we actively use.

MoikrugAdapter

 const request = require('superagent'); const jsdom = require('jsdom'); const { JSDOM } = jsdom; const { getTags } = require('../lib/tagger'); const { getJobType } = require('../lib/jobType'); const { render } = require('../lib/render'); function parseItem(item) { return new Promise((resolve, reject) => { request .get(item.link) .end(function(err, res) { if(err) { console.log(err); reject(err); return; } const dom = new JSDOM(res.text); const element = dom.window.document.querySelector(".vacancy_description"); const salaryElem = dom.window.document.querySelector(".footer_meta .salary"); const salary = salaryElem ? salaryElem.textContent : ' .'; const locationElem = dom.window.document.querySelector(".footer_meta .location"); const location = locationElem && locationElem.textContent; const title = dom.window.document.querySelector(".company_name").textContent; const titleFooter = dom.window.document.querySelector(".footer_meta").textContent; const pureContent = element.textContent; resolve(render({ tags: getTags(pureContent), salary: `: ${salary}`, location, title, link: item.link, description: element.innerHTML, jobType: getJobType(titleFooter), important: Array.from(element.querySelectorAll('strong')).map(e => e.textContent) })) }); }); } function getKey(item) { return item.link; } function isValid() { return true } module.exports = { getKey, isValid, parseItem };

Hhadapter

 const request = require('superagent'); const jsdom = require('jsdom'); const { JSDOM } = jsdom; const { getTags } = require('../lib/tagger'); const { getJobType } = require('../lib/jobType'); const { render } = require('../lib/render'); function parseItem(item) { const splited = item.content.split(/\n<p>|<\/p><p>|<\/p>\n/).filter(i => i); const [ title, date, region, salary ] = splited; return new Promise((resolve, reject) => { request .get(item.link) .end(function(err, res) { if(err) { console.log(err); reject(err); return; } const dom = new JSDOM(res.text); const element = dom.window.document.querySelector('.b-vacancy-desc-wrapper'); const title = dom.window.document.querySelector('.companyname').textContent; const pureContent = element.textContent; const tags = getTags(pureContent); resolve(render({ title, location: region.split(': ')[1] || region, salary: `: ${salary.split(': ')[1] || salary}`, tags, description: element.innerHTML, link: item.link, jobType: getJobType(pureContent), important: Array.from(element.querySelectorAll('strong')).map(e => e.textContent) })) }); }); } function getKey(item) { return item.link; } function isValid() { return true } module.exports = { getKey, isValid, parseItem };

Formatting

After parsing, you need to present information in a convenient form, but with the Telegram API there are not so many possibilities for this: only unicode tags and symbols can be put in messages (emoticons and stickers do not count). At the input we get a pair of semantic fields in the description and the description itself in raw HTML. After a short search, we find the solution - the html-to-text library. After a detailed study of the API and its implementation, one involuntarily wonders why the formatting functions are called not from a dynamic config, but through a closure, which eliminates many of the advantages provided by configuration parameters. And in order to beautifully display bullets instead of li in lists, you have to cheat a little:

 const htmlToText = require('html-to-text'); const whiteSpaceRegex = /^\s*$/; function render({ title, location, salary, tags, description, link, important = [], jobType='' }) { let formattedDescription = htmlToText .fromString(description, { wordwrap: null, noLinkBrackets: true, hideLinkHrefIfSameAsText: true, format: { unorderedList: function formatUnorderedList(elem, fn, options) { let result = ''; const nonWhiteSpaceChildren = (elem.children || []).filter( c => c.type !== 'text' || !whiteSpaceRegex.test(c.data) ); nonWhiteSpaceChildren.forEach(function(elem) { result += ' <b>●</b> ' + fn(elem.children, options) + '\n'; }); return '\n' + result + '\n'; } } }) .replace(/\n\s*\n/g, '\n'); important.filter(text => text.includes(':')).forEach(text => { formattedDescription = formattedDescription.replace( new RegExp(text, 'g'), `<b>${text}</b>` ) }); const formattedTags = tags.map(t => '#' + t).join(' '); const locationFormatted = location ? `#${location.replace(/ |-/g, '_')} `: ''; return `<b>${title}</b>\n${locationFormatted}#${jobType}\n<b>${salary}</b>\n${formattedTags}\n${formattedDescription}\n${link}`; } module.exports = { render };

Tagging

Suppose we have beautiful job descriptions, but not enough tagging. To resolve this issue, I tokenized the natural Russian language using the az library. So I got the filtering of words in the stream of tokens and replacing with tags if there are corresponding words in the tag dictionary.

 const Az = require('az'); const namesMap = require('../resources/tagNames.json'); function onlyUnique(value, index, self) { return self.indexOf(value) === index; } function getTags(pureContent) { const tokens = Az.Tokens(pureContent).done(); const tags = tokens.filter(t => t.type.toString() === 'WORD') .map(t => t.toString().toLowerCase().replace('-', '_')) .map(name => namesMap[name]) .filter(t => t) .filter(onlyUnique); return tags; } module.exports = { getTags };

Dictionary format

 { "js": "JS", "javascript": "JS", "sql": "SQL", "": "Angular", "angular": "Angular", "angularjs": "Angular", "react": "React", "reactjs": "React", "": "React", "node": "NodeJS", "nodejs": "NodeJS", "linux": "Linux", "ubuntu": "Ubuntu", "unix": "UNIX", "windows": "Windows" .... }

Deploy and everything else

To publish each vacancy only once, I used the MongoDB database, reducing everything to the uniqueness of the references of the vacancies themselves. To monitor the processes and their logs on the server, I chose the process manager pm2 , where the warmup is performed by a regular bash script. By the way, the easiest Droplet from Digital Ocean is used as a server.

Deploy script

 #!/usr/bin/env bash # rs -       rsync ./ rs:/var/www/js_jobs_bot --delete -r --exclude=node_modules ssh rs " . ~/.nvm/nvm.sh cd /var/www/js_jobs_bot/ mv prod-config.json config.json npm i && pm2 restart processes.json "

findings

It was not difficult to make simple bots, you just need a desire, knowledge of some programming language (preferably Python or JS) and a couple of days of free time. The results of the work of my bot (as well as the thematic line of vacancies) can be found in the appropriate channel - @javascriptjobs .

PS The full version of the source can be found in my repository

Source: https://habr.com/ru/post/337940/

All Articles