Comparison and creation of morphological analyzers in the NLTK

Hello. This article is about comparing existing and creating your own morphological analyzers in the NLTK library.

Introduction

NLTK is a package of libraries and programs for symbolic and statistical processing of natural language written in the programming language Python. Great for people studying computational linguistics, machine learning, information retrieval [1].
In this article I will accompany the examples with Python code (version 2.7).

Let's get started

Before you begin the process, you need to install and configure the NLTK package itself.

This can be done via pip:
')

pip install nltk

Now configure the package. To do this, in the Python GUI you need to enter:

 >>> import nltk >>> nltk.download()

A window will open in which you can install packages to the NLTK, including the Brown housing we need. Mark the desired package and click "Download". Everything, setup is finished. Now you can get to work.

Sampling and training

How will be tested? Before testing the analyzer itself, we need to train it. And learning is done with the help of ready-made tagged words. We will use Brown’s corpus, or rather its part called “news” - this is a fairly large category of material in the corpus, mainly consisting of news texts, oddly enough.

90% of the entire sample will be used for training, and the remaining 10% will be used for testing. We will check the result using the method

 tagger.evaluate(test_sents)

As a result, we obtain a value from 0 to 1. It can be multiplied by 100 to get the percentage.

First, we define training and test sentences. Find out the number of sentences from 90% of Brown Corps.

 >>> training_count = int(len(nltk.corpus.brown.tagged_sents(categories='news')) * 0.9) >>> training_count 4160

4160 is the number of training sentences for each analyzer. The rest, as has been said, will be used for testing.

We will define the samples, we will work with them. Add more suggestions themselves to show the work:

 >>> training_sents = nltk.corpus.brown.tagged_sents(categories='news')[:training_count] >>> testing_sents = nltk.corpus.brown.tagged_sents(categories='news')[training_count+1:] >>> test_sents_notags = nltk.corpus.brown.sents(categories='news')[training_count+1:]

Existing analyzers

There are several morphological analyzers in the NLTK package. But the most popular ones are: the default analyzer, Unigram analyzer, N-gram analyzer, regular expression analyzer. You can also create your own based on them (but more on that later). Let's look at each of them in more detail:

The default analyzer.

Perhaps the easiest of all existing in the NLTK. Automatically denotes the same tag to every word. This analyzer can be used if you want to assign the most used tag. Find it:

 >>> tags = [tag for (word, tag) in nltk.corpus.brown.tagged_words(categories='news')] >>> nltk.FreqDist(tags).max() 'NN'

As a result, we get NN (noun, noun). In the listing we will create a default analyzer. We will also immediately check his work:

 >>> default_tagger = nltk.DefaultTagger('NN') >>> default_tagger.tag(testing_sents_notags[10]) [('The', 'NN'), ('evidence', 'NN'), ('in', 'NN'), ('court', 'NN'), ('was', 'NN'), ('testimony', 'NN'), ('about', 'NN'), ('the', 'NN'), ('interview', 'NN'), (',', 'NN'), ('which', 'NN'), ('for', 'NN'), ('Holmes', 'NN'), ('lasted', 'NN'), ('an', 'NN'), ('hour', 'NN'), (',', 'NN'), ('although', 'NN'), ('at', 'NN'), ('least', 'NN'), ('one', 'NN'), ('white', 'NN'), ('student', 'NN'), ('at', 'NN'), ('Georgia', 'NN'), ('got', 'NN'), ('through', 'NN'), ('this', 'NN'), ('ritual', 'NN'), ('by', 'NN'), ('a', 'NN'), ('simple', 'NN'), ('phone', 'NN'), ('conversation', 'NN'), ('.', 'NN')]

As it was said, all words (and not even words) are marked with one tag. This analyzer is rarely used alone, for it is coarse.

Now we find out the accuracy:

 >>> default_tagger.evaluate(testing_sents) 0.1262832652247583

Accuracy of only ~ 13% is a very small indicator.
Let us turn to more complex analyzers.

Regular expression based analyzer.

This is a very interesting analyzer, in my opinion. It sets the tag based on some pattern. Suppose we can assume that every word ending in -ed is past participle in verbs; if -ing , then it is a gerund.

Let's create a parser and check it right away:

 >>> patterns = [ (r'.*ing$', 'VBG'), # gerunds (r'.*ed$', 'VBD'), # simple past (r'.*es$', 'VBZ'), # 3rd singular present (r'.*ould$', 'MD'), # modals (r'.*\'s$', 'NN$'), # possessive nouns (r'.*s$', 'NNS'), # plural nouns (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers (r'.*', 'NN') # nouns (default) ] >>> regexp_tagger = nltk.RegexpTagger(patterns) >>> regexp_tagger.tag(testing_sents_notags[10]) [('The', 'NN'), ('evidence', 'NN'), ('in', 'NN'), ('court', 'NN'), ('was', 'NNS'), ('testimony', 'NN'), ('about', 'NN'), ('the', 'NN'), ('interview', 'NN'), (',', 'NN'), ('which', 'NN'), ('for', 'NN'), ('Holmes', 'VBZ'), ('lasted', 'VBD'), ('an', 'NN'), ('hour', 'NN'), (',', 'NN'), ('although', 'NN'), ('at', 'NN'), ('least', 'NN'), ('one', 'NN'), ('white', 'NN'), ('student', 'NN'), ('at', 'NN'), ('Georgia', 'NN'), ('got', 'NN'), ('through', 'NN'), ('this', 'NNS'), ('ritual', 'NN'), ('by', 'NN'), ('a', 'NN'), ('simple', 'NN'), ('phone', 'NN'), ('conversation', 'NN'), ('.', 'NN')]

As you can see, most of the words are still marked with “default” tag NN. But some are marked by others due to the pattern.

Check accuracy:

 >>> regexp_tagger.evaluate(testing_sents) 0.2047244094488189

20% - already this analyzer copes well, compared to the default analyzer

Unigram analyzer.

Uses a simple statistical word marking algorithm. Each word (token) is tagged the most likely for that word.

First, we create and train the analyzer, and also show it in our work:

 >>> unigram_tagger = nltk.UnigramTagger(training_sents) >>> unigram_tagger.tag(testing_sents_notags[10]) [('The', 'AT'), ('evidence', 'NN'), ('in', 'IN'), ('court', 'NN'), ('was', 'BEDZ'), ('testimony', 'NN'), ('about', 'IN'), ('the', 'AT'), ('interview', 'NN'), (',', ','), ('which', 'WDT'), ('for', 'IN'), ('Holmes', None), ('lasted', None), ('an', 'AT'), ('hour', 'NN'), (',', ','), ('although', 'CS'), ('at', 'IN'), ('least', 'AP'), ('one', 'CD'), ('white', 'JJ'), ('student', 'NN'), ('at', 'IN'), ('Georgia', 'NP-TL'), ('got', 'VBD'), ('through', 'IN'), ('this', 'DT'), ('ritual', None), ('by', 'IN'), ('a', 'AT'), ('simple', 'JJ'), ('phone', 'NN'), ('conversation', 'NN'), ('.', '.')]

Already the result is much better than the default analyzer. But you can see that as a result there are words that are not tagged (worth None). This means that these words did not appear during training. Check the accuracy of the analyzer:

 >>> unigram_tagger.evaluate(testing_sents) 0.8110236220472441

~ 81% is a very good indicator. Only 19% of the words are marked or wrong, or the same words did not appear at all during training.

N-grams.

If in the previous analyzer the tag was set on the basis of the word that was encountered in the training, its context was not taken into account. For example, the word wind will be marked with the same tags, regardless of what it stands before: to or the . An analyzer based on N-grams allows you to solve this problem. This is a common case of the Unigram analyzer, when the n-1 tag of the previous words is used to set the tag for the current word.

Now let's check the work of the BigramTagger - an analyzer for n and n-1 words.
```
 >>> bigram_tagger = nltk.BigramTagger(training_sents) >>> bigram_tagger.tag(testing_sents_notags[10]) [('The', 'AT'), ('evidence', 'NN'), ('in', 'IN'), ('court', 'NN'), ('was', 'BEDZ'), ('testimony', None), ('about', None), ('the', None), ('interview', None), (',', None), ('which', None), ('for', None), ('Holmes', None), ('lasted', None), ('an', None), ('hour', None), (',', None), ('although', None), ('at', None), ('least', None), ('one', None), ('white', None), ('student', None), ('at', None), ('Georgia', None), ('got', None), ('through', None), ('this', None), ('ritual', None), ('by', None), ('a', None), ('simple', None), ('phone', None), ('conversation', None), ('.', None)] 
```
And here the main problem of the analyzer immediately arises - many unmarked words. As soon as a new word is encountered in the text, the analyzer cannot set a tag for it. He also does not mark the next tag, because this word was not encountered when testing after the None tag. And then it turned out such a chain of unmarked words.

Because of this problem, this analyzer will have a small accuracy:
```
 >>> bigram_tagger.evaluate(testing_sents) 0.10216286255357321 
```
Only 10% is a very small figure. This way of marking words is not used alone due to low accuracy. But this is a very powerful tool when using a combination of analyzers.

There is also the TrigramTagger, which operates on the same principle as the Bigram analyzer, only one and two previous tags are analyzed. But its accuracy, of course, will be even lower.

Combinations from different analyzers

Finally, we have reached the most interesting - the creation of combinations of analyzers. For example, you can combine the results of Bigram analyzer, Unigram analyzer and default analyzer. This is done using the backoff parameter when creating the analyzer. Each analyzer (except the default analyzer) may have a pointer to use another analyzer to build a multi-pass analyzer.

Let's create it:

 >>> default_tagger = nltk.DefaultTagger('NN') >>> unigram_tagger = nltk.UnigramTagger(training_sents, backoff=default_tagger) >>> bigram_tagger = nltk.BigramTagger(training_sents, backoff=unigram_tagger)

Let's check the analyzer:

 >>> bigram_tagger.tag(test_sents_notags[10]) [('The', 'AT'), ('evidence', 'NN'), ('in', 'IN'), ('court', 'NN'), ('was', 'BEDZ'), ('testimony', 'NN'), ('about', 'IN'), ('the', 'AT'), ('interview', 'NN'), (',', ','), ('which', 'WDT'), ('for', 'IN'), ('Holmes', 'NN'), ('lasted', 'NN'), ('an', 'AT'), ('hour', 'NN'), (',', ','), ('although', 'CS'), ('at', 'IN'), ('least', 'AP'), ('one', 'CD'), ('white', 'JJ'), ('student', 'NN'), ('at', 'IN'), ('Georgia', 'NP'), ('got', 'VBD'), ('through', 'IN'), ('this', 'DT'), ('ritual', 'NN'), ('by', 'IN'), ('a', 'AT'), ('simple', 'JJ'), ('phone', 'NN'), ('conversation', 'NN'), ('.', '.')]

As you can see, all words are marked. Now let's check the accuracy of this analyzer:

 >>> bigram_tagger.evaluate(testing_sents) 0.8447124489185687

As a result, we get ~ 84%. This is a very good indicator. You can combine different analyzers, take a more training sample to achieve a better result.

Conclusion

What can be concluded? Best of all, of course, use a combination of analyzers. But Unigram analyzer coped no worse and less time was spent on it.

I hope this article will help in the choice of the analyzer. Thanks for attention.

Links

Source: https://habr.com/ru/post/340404/

All Articles