Classification of Russian text using the Library Natural on NodeJS

Preamble

I will not surprise anyone if I say that a modern person, and, in particular, a programmer, gets a lot of information every day. For example, my RSS-client gives me about 500 articles per week. And, of course, this is not the only source of information.

I thought about creating an RSS client for myself with a student filter of articles on NodeJS. In principle, there are ready-made RSS readers under a node, there are ready-made neural networks with classifiers, so it seemed to me to write some kind of prototype not a particularly difficult task.
')
I decided to start by testing the neural networks tucked under my arm. I took a small amount of input. I copied the positive data from articles on nodejs with habr. I found the negative data on the "tape.ru". The task of the classifier was to sort the articles on programming and nodejs from the usual, uninteresting for my development, news.

I don’t want to show the results of the work with Brain and Fann - I don’t think that I have enough expertise to judge them. I can only say that out of the box they did not suit me at all - on my input they did not give an adequate number of correct answers. But the Natural Library impressed me a lot.

Then I will show how I taught the classifier, checked his work and made him understand Russian.

Input data

The data on which I trained and tested the classifier can be viewed here . There are a lot of them for the article, that's why I brought them from here.

Code

'use strict'; var data = require('./data'); var natural = require('natural'), porterStemmer = natural.PorterStemmerRu, classifier = new natural.BayesClassifier(porterStemmer); //  classifier'     . for (var i = 0; i < data.good.length; i++) { classifier.addDocument(data.good[i], 'good'); }; for (var i = 0; i < data.bad.length; i++) { classifier.addDocument(data.bad[i], 'bads'); }; //     . classifier.train(); //     . console.log('START CLASSIFICATION'); console.log('Test on good'); for (var i = 0; i < data.test_good.length; i++) { console.log("> ",classifier.classify(data.test_good[i])); }; console.log('Test on bad'); for (var i = 0; i < data.test_bad.length; i++) { console.log("> ",classifier.classify(data.test_bad[i])); };

Result

  START CLASSIFICATION
 Test on good
 > good
 > good
 > good
 > good
 Test on bad
 > bads
 > bads
 > bads
 > bads
 > good
 > bads
 > bads
 > good

Russian language support

For qualitative classification, Natural uses the “stemmer” component, which splits text into an array of words, removes useless words (so-called stopwords ), and cuts off the endings of words.

By default, the classifier ignores Russian words, although there is support for the Russian language in the project. In order to make the classifier understand the Russian language, it is necessary to initialize the classifier, passing into it a steamer for the Russian language, thus replacing the default English steamer. This is very easy to do:

 var classifier = new natural.BayesClassifier(natural.PorterStemmerRu);

Now the text inside the classifier will be processed correctly, taking into account the peculiarities of the Russian language.

Lovers of experiments

I specially created a repository with a working classifier. Installation is trivial:

 git clone git@github.com:shuvalov-anton/classifier.git cd classifier npm i node app.js

Then change the data in data.js to your own and see the result.

PS

To be honest, I have no experience in classifying information to evaluate the result, but the results of Natural made me very impressed as a simple user. Unfortunately, I did not find any more or less serious project documentation other than the readme on github. And in order to understand how to turn on the Russian language, I had to dig in the source code, but there was nothing supercomplex in this, and I believe that the result was worth it!

Source: https://habr.com/ru/post/193738/

All Articles