NLTK parse

Hello. This is an article about the syntactic analysis of sentences, their presentation. NLTK and the Python programming language (version 2.7) will be used to parse proposals.

Introduction

In my previous article, we looked at morphological analyzers and their use. I strongly recommend reading it in order to better understand this article. It also covers installation and configuration of the NLTK package.

Let's get started

What is parsing? Parsing is a process that determines whether a certain sequence of lexemes belongs to the language generated by the grammar. It is usually presented as a tree for easy understanding.

Consider some grammatical features of the English language:
')

Unlimited possibilities
For example, an interesting feature of offers is that they can be invested in large offers. Consider the following suggestions:
- Usain Bolt broke the 100m record
- The Jamaica Observer reported that Usain Bolt broke the 100m record
- Andre said The Jamaica Observer reported that the Usain Bolt broke the 100m record
- I think Andre said the Jamaica Observer reported that Usain Bolt broke the 100m record
If we replace the previous sentences with the symbol S , then the following sentences are built up following the patterns Andre said S , I think S. These templates and other similar (for example, S but S or S when S ) allow you to build more sentences on the basis of one sentence.

With the help of these templates, a huge sentence was built in the fairy tale “Winnie the Pooh” (Winnie the Pooh story by AA Milne, In which the Piglet is Entirely Surrounded by Water):
You can imagine it when you’ve seen it. ”] It is a long story about a couple of hours of his imprisonment. It is a little bit a little bit. If you’re losing yourself, you can’t have to say it, she? " f Pooh (Captain, C. Robin; 1st Mate, P. Bear) coming over the sea to rescue him ...
This is a sentence with a simple structure, in fact. It uses simple templates, starting with S but S , S when S. From this example, it can be concluded that the language has intrinsic constructions that allow, it seems, to extend the sentence unlimitedly.

Ambiguity in sentences
A well-known example of the ambiguity of the film Groucho Marx, "Animal Crackers" (1930):

While hunting in Africa, I shot an elephant in my pajamas. How do you never know.

Ambiguity can be seen through the program. Consider in detail the process of creating grammar and building a tree a little lower on the article. Now let's just see an example of ambiguity.

First we define a simple grammar and build a tree:

>>> grammar = nltk.parse_cfg(""" S -> NP VP PP -> P NP NP -> Det N | Det N PP | 'I' VP -> V NP | VP PP Det -> 'an' | 'my' N -> 'elephant' | 'pajamas' V -> 'shot' P -> 'in' """) >>> sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] >>> parser = nltk.ChartParser(grammar) >>> trees = parser.nbest_parse(sent) >>> for tree in trees: print tree (S (NP I) (VP (V shot) (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas)))))) (S (NP I) (VP (VP (V shot) (NP (Det an) (N elephant))) (PP (P in) (NP (Det my) (N pajamas)))))

The program has two proposal structures. This is a sign of ambiguity.

Context-free grammar

Simple grammar
Consider the simplest context-free grammar. According to the definition, the first character on the left in the first rule of grammar is the special character S and all trees must have this character as the root.
The following listing provides an example of the definition of grammar and using it to analyze the sentence "Mary saw Bob":
```
 >>> grammar1 = nltk.parse_cfg(""" S -> NP VP VP -> V NP | V NP PP PP -> P NP V -> "saw" | "ate" | "walked" NP -> "John" | "Mary" | "Bob" | Det N | Det N PP Det -> "a" | "an" | "the" | "my" N -> "man" | "dog" | "cat" | "telescope" | "park" P -> "in" | "on" | "by" | "with" """) >>> sent = "Mary saw Bob".split() >>> rd_parser = nltk.RecursiveDescentParser(grammar1) >>> for tree in rd_parser.nbest_parse(sent): print tree (S (NP Mary) (VP (V saw) (NP Bob))) 
```
The grammar from this example contains rules that include different syntactic categories from the table below:
Symbol Value Example
S sentence the man walked
NP noun phrase a dog
VP verb phrase saw a park
PP prepositional phrase with a telescope
Det determiner the
N noun dog
V verb walked
P preposition in
Recursion in syntax structures
A grammar is called recursive if the categories on the left side of its rules are also found in the right parts. In the following listing, the rule Nom -> Adj Nom contains direct recursion of the category Nom , while direct recursion S arises from a combination of two rules S -> NP VP and VP -> VS.
```
 grammar2 = nltk.parse_cfg(""" S -> NP VP NP -> Det Nom | PropN Nom -> Adj Nom | N VP -> V Adj | V NP | VS | V NP PP PP -> P NP PropN -> 'Buster' | 'Chatterer' | 'Joe' Det -> 'the' | 'a' N -> 'bear' | 'squirrel' | 'tree' | 'fish' | 'log' Adj -> 'angry' | 'frightened' | 'little' | 'tall' V -> 'chased' | 'saw' | 'said' | 'thought' | 'was' | 'put' P -> 'on' """) 
```
The recursion of this grammar allows you to build, for example, trees with nested expressions and nested sentences.

Parsing context-free grammar

A parser is a program that processes an input sentence according to the rules of the grammar and builds one or more syntax structures that correspond to this grammar.

We will look at two simple parsing algorithms: the recursive descent algorithm (top-down strategy) and the displacement-folding algorithm (bottom-up strategy).
Consider each more:

Algorithm recursive descent.
The simplest analyzer algorithm that interprets a grammar as a specification for turning a top-level element is not a few lower level elements. For example: the task is to find the element S. The rule S -> NP VP allows the analyzer to replace this element with two others (first find NP , then VP ). Each of these elements can be converted to other elements based on grammar rules. This process will continue until the task is to replace the element with a word, for example, telescope . Then this word is compared with the word from the input sequence, and if a correspondence is established between these words, the algorithm continues to work. If not, the analyzer goes back a step and tries to consider other options.

The algorithm of recursive descent is carried out by the method

 nltk.RecursiveDescentParser(grammar)

This method may have an optional trace parameter. If the value of this parameter is greater than zero, then the analyzer will display the results of all parsing steps on the screen.

Let's look at his work. We will analyze the sentence "Mary saw a dog". First we define the grammar, and then we use the method:

 >>> import nltk >>> grammar = nltk.parse_cfg(""" S -> NP VP VP -> V NP | V NP PP PP -> P NP V -> "saw" | "ate" | "walked" NP -> "John" | "Mary" | "Bob" | Det N | Det N PP Det -> "a" | "an" | "the" | "my" N -> "man" | "dog" | "cat" | "telescope" | "park" P -> "in" | "on" | "by" | "with" """) >>> rd_parser = nltk.RecursiveDescentParser(grammar) >>> sent = 'Mary saw a dog'.split() >>> for t in rd_parser.nbest_parse(sent): print t (S (NP Mary) (VP (V saw) (NP (Det a) (N dog))))

Test the method with a certain parameter trace = 2 . Full output will be hidden under the spoiler, because it is very long:

Program code and output

 >>> rd_parser = nltk.RecursiveDescentParser(grammar, trace=2) >>> sent = 'Mary saw a dog'.split() >>> for t in rd_parser.nbest_parse(sent): print t Parsing 'Mary saw a dog' [ * S ] E [ * NP VP ] E [ * 'John' VP ] E [ * 'Mary' VP ] M [ 'Mary' * VP ] E [ 'Mary' * V NP ] E [ 'Mary' * 'saw' NP ] M [ 'Mary' 'saw' * NP ] E [ 'Mary' 'saw' * 'John' ] E [ 'Mary' 'saw' * 'Mary' ] E [ 'Mary' 'saw' * 'Bob' ] E [ 'Mary' 'saw' * Det N ] E [ 'Mary' 'saw' * 'a' N ] M [ 'Mary' 'saw' 'a' * N ] E [ 'Mary' 'saw' 'a' * 'man' ] E [ 'Mary' 'saw' 'a' * 'dog' ] M [ 'Mary' 'saw' 'a' 'dog' ] + [ 'Mary' 'saw' 'a' 'dog' ] E [ 'Mary' 'saw' 'a' * 'cat' ] E [ 'Mary' 'saw' 'a' * 'telescope' ] E [ 'Mary' 'saw' 'a' * 'park' ] E [ 'Mary' 'saw' * 'an' N ] E [ 'Mary' 'saw' * 'the' N ] E [ 'Mary' 'saw' * 'my' N ] E [ 'Mary' 'saw' * Det N PP ] E [ 'Mary' 'saw' * 'a' N PP ] M [ 'Mary' 'saw' 'a' * N PP ] E [ 'Mary' 'saw' 'a' * 'man' PP ] E [ 'Mary' 'saw' 'a' * 'dog' PP ] M [ 'Mary' 'saw' 'a' 'dog' * PP ] E [ 'Mary' 'saw' 'a' 'dog' * P NP ] E [ 'Mary' 'saw' 'a' 'dog' * 'in' NP ] E [ 'Mary' 'saw' 'a' 'dog' * 'on' NP ] E [ 'Mary' 'saw' 'a' 'dog' * 'by' NP ] E [ 'Mary' 'saw' 'a' 'dog' * 'with' NP ] E [ 'Mary' 'saw' 'a' * 'cat' PP ] E [ 'Mary' 'saw' 'a' * 'telescope' PP ] E [ 'Mary' 'saw' 'a' * 'park' PP ] E [ 'Mary' 'saw' * 'an' N PP ] E [ 'Mary' 'saw' * 'the' N PP ] E [ 'Mary' 'saw' * 'my' N PP ] E [ 'Mary' * 'ate' NP ] E [ 'Mary' * 'walked' NP ] E [ 'Mary' * V NP PP ] E [ 'Mary' * 'saw' NP PP ] M [ 'Mary' 'saw' * NP PP ] E [ 'Mary' 'saw' * 'John' PP ] E [ 'Mary' 'saw' * 'Mary' PP ] E [ 'Mary' 'saw' * 'Bob' PP ] E [ 'Mary' 'saw' * Det N PP ] E [ 'Mary' 'saw' * 'a' N PP ] M [ 'Mary' 'saw' 'a' * N PP ] E [ 'Mary' 'saw' 'a' * 'man' PP ] E [ 'Mary' 'saw' 'a' * 'dog' PP ] M [ 'Mary' 'saw' 'a' 'dog' * PP ] E [ 'Mary' 'saw' 'a' 'dog' * P NP ] E [ 'Mary' 'saw' 'a' 'dog' * 'in' NP ] E [ 'Mary' 'saw' 'a' 'dog' * 'on' NP ] E [ 'Mary' 'saw' 'a' 'dog' * 'by' NP ] E [ 'Mary' 'saw' 'a' 'dog' * 'with' NP ] E [ 'Mary' 'saw' 'a' * 'cat' PP ] E [ 'Mary' 'saw' 'a' * 'telescope' PP ] E [ 'Mary' 'saw' 'a' * 'park' PP ] E [ 'Mary' 'saw' * 'an' N PP ] E [ 'Mary' 'saw' * 'the' N PP ] E [ 'Mary' 'saw' * 'my' N PP ] E [ 'Mary' 'saw' * Det N PP PP ] E [ 'Mary' 'saw' * 'a' N PP PP ] M [ 'Mary' 'saw' 'a' * N PP PP ] E [ 'Mary' 'saw' 'a' * 'man' PP PP ] E [ 'Mary' 'saw' 'a' * 'dog' PP PP ] M [ 'Mary' 'saw' 'a' 'dog' * PP PP ] E [ 'Mary' 'saw' 'a' 'dog' * P NP PP ] E [ 'Mary' 'saw' 'a' 'dog' * 'in' NP PP ] E [ 'Mary' 'saw' 'a' 'dog' * 'on' NP PP ] E [ 'Mary' 'saw' 'a' 'dog' * 'by' NP PP ] E [ 'Mary' 'saw' 'a' 'dog' * 'with' NP PP ] E [ 'Mary' 'saw' 'a' * 'cat' PP PP ] E [ 'Mary' 'saw' 'a' * 'telescope' PP PP ] E [ 'Mary' 'saw' 'a' * 'park' PP PP ] E [ 'Mary' 'saw' * 'an' N PP PP ] E [ 'Mary' 'saw' * 'the' N PP PP ] E [ 'Mary' 'saw' * 'my' N PP PP ] E [ 'Mary' * 'ate' NP PP ] E [ 'Mary' * 'walked' NP PP ] E [ * 'Bob' VP ] E [ * Det N VP ] E [ * 'a' N VP ] E [ * 'an' N VP ] E [ * 'the' N VP ] E [ * 'my' N VP ] E [ * Det N PP VP ] E [ * 'a' N PP VP ] E [ * 'an' N PP VP ] E [ * 'the' N PP VP ] E [ * 'my' N PP VP ] (S (NP Mary) (VP (V saw) (NP (Det a) (N dog)))) >>>

The recursive descent analyzer has three significant drawbacks:

left-recursive rules of the form NP -> NP PP lead to an infinite loop;
in the analysis, a lot of time is spent on the consideration of words and structures that do not correspond to the input sentence (this can be seen in the listing using the parameter trace = 2 );
the process of iteration with the return discards the analyzed components, and then again has to build them.

Algorithm of displacement-folding.
This is a bottom-up analyzer. Like other analyzers, this one tries to find sequences of words and phrases that fit the right side of the grammar rule and replaces them with the left side of the rule, until the sentence “collapses” to the character S.
The move-minimize analyzer repeatedly puts the next word on the stack — this operation is called “move.” If the N items from the top of the stack match the N items on the right side of the rule, then they are removed from the stack, and the item on the left side of the rule moves to the stack. This replacement of the N elements at the top of the stack by one element is called “folding”. This operation is performed only in relation to the top of the stack; abbreviations of elements that are deeper must be completed before new data enters the top. The analyzer finishes its work when all the elements arriving at its input are processed and when only one element remains in the stack - the parse tree with the S element. By the way, it is possible to examine the analyzer in detail in the demo nltk.app.srparser () .

In NLTK, the displacement-folding algorithm is implemented in the method
```
 nltk.ShiftReduceParser(grammar) 
```
Let's immediately check it on the familiar sentence “Mary saw a dog” (we will use the grammar from the previous example too):
```
 >>> sr_parse = nltk.ShiftReduceParser(grammar) >>> sent = 'Mary saw a dog'.split() >>> print sr_parse.parse(sent) (S (NP Mary) (VP (V saw) (NP (Det a) (N dog)))) 
```
The displacement-folding analyzer may end up in dead ends, which will not allow obtaining any analysis results, even if the input sentence does not contradict grammar. If this happens, then there are no input words already and the stack contains elements that cannot be minimized to S. These problems arise from the actions in the previous steps.

The advantage of this analyzer over the recursive descent analyzer is that it builds a syntactic structure that corresponds to the input word sequence. In addition, it builds each substructure separately. For example, NP (Det (the), N (man)) is constructed regardless of whether it will be used when folding the VP -> V NP PP rule, or NP -> NP PP .

Conclusion

In this article, we looked at parsing sentences using the NLTK package. We also considered two algorithms for parsing: the displacement-folding algorithm and the recursive descent algorithm.
Thanks for attention.

Links

Source: https://habr.com/ru/post/340574/

All Articles

Symbol	Value	Example
S	sentence	the man walked
NP	noun phrase	a dog
VP	verb phrase	saw a park
PP	prepositional phrase	with a telescope
Det	determiner	the
N	noun	dog
V	verb	walked
P	preposition	in