Natural language processing. Useful tools

Recently, Habré has made frequent articles about the processing of natural language.
And it just so happened that lately I have been working in this area.
The sentiment analysis was very well covered, and the tagger of parts of speech pymorphy .
But I would like to tell you what tools I used for NLP, and what I found new, which was not here yet

First, I was surprised that no one mentioned nltk , a very useful thing. Especially if you are interested not in Russian, but in other European ones. Among the skills

tokenization
stemming
definitions of parts of speech
entity recognition (named entity detection)
bonus in the form of classifiers
bonus in the form of labeled cases (not for Russian)

And most importantly - an excellent guide for beginners cookbook . Moreover, the threshold of entry is very low, it is not necessary to even know python itself, because everything in the book is chewed up in great detail. But without English can not do. There is, however, another version of the translation into Japanese , if you of course know the language. In fact, there is a very useful chapter, especially nltk, when working with Japanese, which should really help, for parsing the language with Hieroglyphs.

In general, all of the above does not work with the Russian language, which is insulting. Now the nltk cookbook is being translated into Russian, you can read about it and help in the google group
')
But the biggest treasure I found on the internet is freeling.

Freeling is a word processing library. Written in C ++, due to which it has not a bad speed. It also has API in all (or almost all) languages.

The main features of FreeLing:

Tokenization
Morphological analysis
Reduction of the word to the initial form
Definition of a part of speech, including an unknown word
Normalization and determination of dates and numbers
Recognition and classification of entities

Spanish, Catalan, Galician, Italian, English, Russian, Portuguese and Welsh are currently supported. Unfortunately, the Russian language was cheated and recognition and classification of entities does not work for it, but the definition of parts of speech out of the box is already very good. More on cross-paganism .
What surprised me the most was that I almost did not find any mention of freeling in runet, despite the fact that there is a good manual in Russian .

Why are these tools useful to me?
The task was to classify texts in one category or another and automatically. For this, the first texts were protogirovanny and categorized by hand (the corpus is difficult to call it). New texts have already been disassembled automatically, first RegExp to highlight the necessary tags on the dictionary. Then the text was sorted out with the help of Freeling, this includes tokenization and analysis into parts of speech. And in this form was stored.

The next step was that each category had its own classifier, by that time it was very superficially familiar with the topic of classification (thanks to Irokez for the article). For this, I used the first thing that came to hand, the Naive Bayes classifier from nltk, as parameters (features), words were passed one by one (unigrams) and tags, while the number of tags was doubled, since we explicitly set weights in the Naive classifier It is impossible, and tags are much more informative than simple words. And parts of speech were used to filter out extra words: pronouns, numerals, interjections (the emotional color was not important), and so on.

UPD. from kmike You can also add scikit-learn.org/ to this list - this is a library that contains almost everything that is needed for machine learning: there are many different classifiers (with a good implementation, anything), tf-idf counting, there is cython -implementation of HMM (in dev version with github), etc.

Source: https://habr.com/ru/post/149749/

All Articles

Natural language processing. Useful tools

More articles: