Syntax analysis in NLTK. Continuation

Hello. This is a small continuation of the previous article, where the basics of parsing were considered using the Natural Language Toolkit package (abbreviated NLTK). As in the last article, in this one I will accompany the examples with Python code (version 2.7).

Introduction

In the previous article, we looked at parsers and grammar types. I strongly recommend reading it if you have not done so. You can also read the first article where we install and configure the NLTK package.

Simple syntactic analyzers, which we have already considered, have a number of drawbacks, which impose significant limitations on both efficiency and the possibility of obtaining syntactic analysis results. To solve these problems, algorithms based on dynamic programming are used.

Dynamic programming involves the preservation of intermediate results and their use if necessary, which can significantly improve the efficiency of various algorithms.
')
Dynamic programming allows for the syntactic analysis of the sentence " I shot an elephant in my pajamas ", which, by the way, we have already considered, to build " PP in my pajamas " only once. The result is stored in a special table WFST (well-formed substring table or “table of regularities podstrochek”) and from the same table, you can take, if necessary, the components to build NP or VP .

Let's get started

Well-Formed Substring Tables

Consider the sentence “I shot an elephant in my pajamas” as input. This offer can be presented in the form shown in the figure. This structure is called Chart Data Structure .

In WFST, the positions of words are recorded by filling in the cells of a triangular matrix , in which the initial positions of the substrings are recorded vertically, and the final positions horizontally (thus, the word shot will be answered by a cell with coordinates (1, 2)). To simplify such a presentation, it is assumed that each word corresponds to only one unique lexical category (the tag of morphological characteristics) and that it is stored in the matrix cell (for example, cell (1, 2) contains V ).

Tag for the word shot from the text list can be found on the basis of the grammar:

>>> grammar = nltk.parse_cfg(""" S -> NP VP PP -> P NP NP -> Det N | Det N PP | 'I' VP -> V NP | VP PP Det -> 'an' | 'my' N -> 'elephant' | 'pajamas' V -> 'shot' P -> 'in' """) >>> text = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] >>> grammar.productions(rhs=text[1]) [V -> 'shot']

To build a WFST, a matrix of dimension (n-1) on (n-1) is created. In Python, the matrix is built as a list of lists. In the following listing, we will create several methods to build, populate, and display the WFST table. With the help of the init_wfst () method, only the main diagonal is filled, where tags of all words of the sentence are present. The display () method is designed to display the results on the screen in a readable form.

 def init_wfst(tokens, grammar): numtokens = len(tokens) wfst = [[None for i in range(numtokens+1)] for j in range(numtokens+1)] for i in range(numtokens): productions = grammar.productions(rhs=tokens[i]) wfst[i][i+1] = productions[0].lhs() return wfst def display(wfst, tokens): print '\nWFST ' + ' '.join([("%-4d" % i) for i in range(1, len(wfst))]) for i in range(len(wfst)-1): print "%d " % i, for j in range(1, len(wfst)): print "%-4s" % (wfst[i][j] or '.'), print

Immediately and check the work methods. By the way, we will use the grammar described in the previous example.

 >>> tokens = "I shot an elephant in my pajamas".split() >>> wfst0 = init_wfst(tokens, grammar) >>> display(wfst0, tokens) WFST 1 2 3 4 5 6 7 0 NP . . . . . . 1 . V . . . . . 2 . . Det . . . . 3 . . . N . . . 4 . . . . P . . 5 . . . . . Det . 6 . . . . . . N

Now we add the complete_wfst () method, which already fills the table in accordance with the specified grammar. At the entrance it is given the initial WFST table with the main diagonal filled in, the grammar and the output flag (its work will be shown a little later).

 def complete_wfst(wfst, tokens, grammar, trace=False): index = dict((p.rhs(), p.lhs()) for p in grammar.productions()) numtokens = len(tokens) for span in range(2, numtokens+1): for start in range(numtokens+1-span): end = start + span for mid in range(start+1, end): nt1, nt2 = wfst[start][mid], wfst[mid][end] if nt1 and nt2 and (nt1,nt2) in index: wfst[start][end] = index[(nt1,nt2)] if trace: print "[%s] %3s [%s] %3s [%s] ==> [%s] %3s [%s]" % \ (start, nt1, mid, nt2, end, start, index[(nt1,nt2)], end) return wfst

Check work:

 >>> wfst1 = complete_wfst(wfst0, tokens, grammar) >>> display(wfst1, tokens) WFST 1 2 3 4 5 6 7 0 NP . . S . . S 1 . V . VP . . VP 2 . . Det NP . . . 3 . . . N . . . 4 . . . . P . PP 5 . . . . . Det NP 6 . . . . . . N

Let's look at the detailed output of the complete_wfst method:

 >>> wfst1 = complete_wfst(wfst0, tokens, grammar, trace=True) [2] Det [3] N [4] ==> [2] NP [4] [5] Det [6] N [7] ==> [5] NP [7] [1] V [2] NP [4] ==> [1] VP [4] [4] P [5] NP [7] ==> [4] PP [7] [0] NP [1] VP [4] ==> [0] S [4] [1] VP [4] PP [7] ==> [1] VP [7] [0] NP [1] VP [7] ==> [0] S [7]

The operation of the algorithm is completed if for the input sequence in the cell with coordinates (0, 7) S is received, indicating that a syntactic structure has been found that corresponds to the input sequence.

By the way, the completed WFST table can be represented as Chart Data Structure:

The WFST build program is a simple chart parser that has several drawbacks:

The table we got is not a separate parse tree, but rather a recognition method or sentences generated (allowed) by the grammar
The program requires that the rules of grammar, where the terminals on the right side are not binary, are binary. Of course, any context-free grammar can be turned into such a form (Chomsky's normal form), but it is more convenient to work without such additional restrictions.
The bottom-up approach is characterized by great redundancy when building components (it is proposed to place non-terminal symbols in cells) that are not provided for by the grammar.

Dependencies and Dependencies Grammar

Grammar, which are based on the phrasal structure of the sentence, describing how words and their sequences are combined (combined) into components (direct components). An alternative or complementary approach, called dependency grammar, is to establish relationships between individual words. A binary asymmetric relationship is established between the pairs of words in the sentence, indicating the main word and the dependent one. The main word in a sentence is usually considered a verb (predicate).

The dependency tree is represented as a labeled oriented graph, in which the nodes are lexical units and the labeled arcs represent the relationship of dependencies between basic and and dependent words.

The following figure shows such a graph. The direction of the arrows indicates dependent words.

Each of the arcs in the figure is marked by the type of relationship that is established between the main and dependent words. For example, I is SBJ (subject), shot (main word of the sentence), in - NMOD (noun modifier elephant). In the grammar of dependencies, relationships between sentence members can be represented using dependency types.

Consider building a grammar of dependencies without specifying the type of dependencies:

 >>> grammar = nltk.parse_dependency_grammar(""" 'shot' -> 'I' | 'elephant' | 'in' 'elephant' -> 'an' | 'in' 'in' -> 'pajamas' 'pajamas' -> 'my' """) >>> print groucho_dep_grammar Dependency grammar with 7 productions 'shot' -> 'I' 'shot' -> 'elephant' 'shot' -> 'in' 'elephant' -> 'an' 'elephant' -> 'in' 'in' -> 'pajamas' 'pajamas' -> 'my' >>>

The following example shows how an alternative approach is implemented in this grammar to take into account the ambiguity of a compound:

 >>> pdp = nltk.ProjectiveDependencyParser(grammar) >>> sent = 'I shot an elephant in my pajamas'.split() >>> trees = pdp.parse(sent) >>> for tree in trees: print tree (shot I (elephant an (in (pajamas my)))) (shot I (elephant an) (in (pajamas my)))

Conclusion

Of course, these analyzers are somewhat more complicated than those previously considered. But they provide new opportunities for parsing.

Thanks for attention.

Links

Source: https://habr.com/ru/post/342162/

All Articles