Recently, Habré has made frequent articles about the processing of natural language.
And it just so happened that lately I have been working in this area.
The
sentiment analysis was very well covered, and the tagger of parts of speech
pymorphy .
But I would like to tell you what tools I used for NLP, and what I found new, which was not here yet
First, I was surprised that no one mentioned
nltk , a very useful thing. Especially if you are interested not in Russian, but in other European ones. Among the skills
- tokenization
- stemming
- definitions of parts of speech
- entity recognition (named entity detection)
- bonus in the form of classifiers
- bonus in the form of labeled cases (not for Russian)
And most importantly - an excellent guide for beginners
cookbook . Moreover, the threshold of entry is very low, it is not necessary to even know python itself, because everything in the book is chewed up in great detail. But without English can not do. There is, however, another version of the translation into
Japanese , if you of course know the language. In fact, there is a very useful chapter, especially nltk, when working with Japanese, which should really help, for parsing the language with Hieroglyphs.
In general, all of the above does not work with the Russian language, which is insulting. Now the nltk cookbook is being translated into Russian, you can read about it and help in the
google group')
But the biggest treasure I found on the internet is
freeling.Freeling is a word processing library. Written in C ++, due to which it has not a bad speed. It also has API in all (or almost all) languages.
The main features of FreeLing:
- Tokenization
- Morphological analysis
- Reduction of the word to the initial form
- Definition of a part of speech, including an unknown word
- Normalization and determination of dates and numbers
- Recognition and classification of entities
Spanish, Catalan, Galician, Italian, English, Russian, Portuguese and Welsh are currently supported. Unfortunately, the Russian language was cheated and recognition and classification of entities does not work for it, but the definition of parts of speech out of the box is already very good. More on
cross-paganism .
What surprised me the most was that I almost did not find any mention of freeling in runet, despite the fact that there is a good manual in
Russian .
Why are these tools useful to me?The task was to classify texts in one category or another and automatically. For this, the first texts were protogirovanny and categorized by hand (the corpus is difficult to call it). New texts have already been disassembled automatically, first RegExp to highlight the necessary tags on the dictionary. Then the text was sorted out with the help of Freeling, this includes tokenization and analysis into parts of speech. And in this form was stored.
The next step was that each category had its own classifier, by that time it was very superficially familiar with the topic of classification (thanks to
Irokez for the article). For this, I used the first thing that came to hand, the Naive Bayes classifier from nltk, as parameters (features), words were passed one by one (unigrams) and tags, while the number of tags was doubled, since we explicitly set weights in the Naive classifier It is impossible, and tags are much more informative than simple words. And parts of speech were used to filter out extra words: pronouns, numerals, interjections (the emotional color was not important), and so on.
UPD. from
kmike You can also add
scikit-learn.org/ to this list - this is a library that contains almost everything that is needed for machine learning: there are many different classifiers (with a good implementation, anything), tf-idf counting, there is cython -implementation of HMM (in dev version with github), etc.