Notes on NLP (Part 6)

(The first parts: 1 2 3 4 5 ). I hope the conversation about the natural language of the readers is not tired yet! In my opinion, the topic is really interesting (although the popularity of topics is clearly declining :)). Well, let's see how many parts I still have enough. I think we have already passed the equator, but it is still possible to touch on three or four topics.

This time, the note is entirely devoted to the XDG / XDK project, which I am trying to study at my leisure. I cannot call myself an XDG specialist yet. But slowly moving.

So, we stopped at the fact that we will consider dependency parsing, and, if possible, with non-projective constructions and multiple generation of trees. This naturally leads us to the XDG / XDK project, which satisfies all these requirements. I myself came to the "topic" with sympathy for lexicalized dependency parsing, but otherwise I start almost from scratch, and the choice of XDG is dictated by cold sifting of other things. Studied what is, choose what is best. Today it turned out that XDG seems to be better (at least within the framework of a given concept).
')
What is it? XDG stands for Extensible Dependency Grammar , and XDK, respectively, as the XDG Development Kit. XDG is a formalism for describing admissible connections between words in a sentence. This is a vocabulary formalism: for each specific word of the language, it is necessary to describe how and with what it can be glued. XDK includes a set of tools, the most important of which is Solver, also known as a parser. Solver accepts a grammar of the language and a sentence on it. The output generates a set of trees corresponding to the parsed sentence. As you can see, everything is very simple in words :)

Grammar writing

Where to get grammar? Here I would like to emphasize: yes, in the simplest scenario (although for the developer it is clearly not the easiest :)) the grammar is written by hand. However, let's not forget that grammatical rules are only an exact formal description of language constructs. And where they come from - written by experts or derived in some way from existing texts - is the second question. However, for accurate analysis of sentences, rules are needed, and grammar is just a variation of their record.

As for the way these rules are written down, it seems to me quite obvious and adequate. I believe that using XDG can describe not only natural language, but also, say, Pascal. Of course, such a parser will be far from perfect, but, in theory, it should work :)

I don’t even know why nobody used to do such work (dependency parser kit). The project is quite fresh, the current version is dated 2007 (however, I have already said that I don’t notice much enthusiasm about the XDK yet, and at the moment the project is not developing).

Probably, it makes sense to give a little "touch" the language of the proposed grammars. It is based on three basic principles:

Lexicalization . The grammar elements are the words of the language. Each word can be assigned a set of attributes, in fact, determining that the word can stick to itself and to which it can stick itself. Each attribute has a name, type, and value. The most commonly used types are “string” and “multiple lines”.

Principles of creating a graph . In addition to the restrictions on the "gluing", set at the level of individual words, you can set restrictions for the whole phrase parsing graph. Such restrictions are called "principles." The simplest example of the principle is the tree principle, which requires the generated graph to be a tree. We will talk about some other useful principles a bit later.

Multidimensionality . Each word attribute is defined in a namespace (user-defined). The principles of creating a graph are also described in a specific namespace. Each such space is called a “dimension.” The phrase analysis graph for each dimension is constructed separately. Thus, it is possible to obtain several analysis graphs at once, reflecting certain structural features of the proposal (that is, the user can look at the proposal as if from different angles of view). For example, along with a dependency graph, you can construct a degenerate graph list showing the order of words in a sentence. Or, say, a graph connecting all the noun phrases.

Valence principle

Now a few words about what the principles are. First of all, I would like to mention the principle of valence , indicating that the relationship between the two words w1 and w2 can be established only if the out attribute of the word w1 coincides with the in attribute of the word w2. Let me explain with an example:

 defentry {dim lex {word: "eats"}
            dim syn {in: {root} out: {subj obj adv *}}}
 defentry {dim lex {word: "Peter"}
            dim syn {in: {subj?  obj?} out: {}}}
 defentry {dim lex {word: "spaghetti"}
            dim syn {in: {subj?  obj?} out: {}}}
 defentry {dim lex {word: "today"}
            dim syn {in: {adv?} out: {}}}

The out attribute specification of the verb “eats” indicates that a single subject (subj), one object (obj) and an arbitrary number of circumstances (adv *) can attach to itself the word. The in attribute specification of the word “Peter” allows the word to be a subject or object. The same is the specification of the word "spaghetti". The word "today" is described as a potential circumstance. Thus, if we feed the parser the phrase “Peter eats spaghetti today”, the resulting tree will have “eats” as the root, “Peter” and “spaghetti” will be attached to the root as subject and object, and the word “today” in as a circumstance.

There is also the “principle of orderliness” (sacrificing them for the sake of space), with which you can indicate that the earlier word in a sentence takes precedence as a subject (thus, “spaghetti” will not become a subject).

The principle of consistency

This is another important principle, by which I can not pass. It establishes the following requirement: the words to be linked must agree on an attribute. For example, you can specify that the verb "eats" as a subject requires a third person, the only number. At the same time, the verb "eat" will add anything other than a third person, singular:

  defentry {dim lex {word: "eats"}
            dim syn {agrs: {["3" sg]}
                      agree: {subj}}}
 defentry {dim lex {word: "eat"}
            dim syn {agrs: {["1" sg] ["2" sg]
                        ["1" pl] ["2" pl] ["3" pl]}
                      agree: {subj}}

In order for the word "Peter" to join, and now, you will have to specify its attributes of the number and face:

  defentry {dim lex {word: "Peter"}
            dim syn {agrs: {["3" sg]}
                      agree: {}}}

As you can see from these examples, the grammar does not directly support morphology. So, the words "eat" and "eats" are different for her. However, I don’t quite understand how to push the support of “morphology in general” into universal formalism. This problem is solved, however, quite simply: it is necessary to drive the input sentence through a morphological analyzer and generate grammar rules (with correct endings and word attributes) automatically, on the fly.

The analysis ends when it was possible to assemble the entire structure and attach it to a word that has the attribute “tree top”. For example, as such a word you can describe a point (for the sentence ends with a point):

  defentry {
   dim lex {word: "."}
   dim syn {in: {}
            out: {root} 
            agrs: top
            }}

Start!

If the resulting grammar (of course, here are only its fragments!) Feed the analyzer along with the input string “Peter eats spaghetti today.”, The expected tree will be generated at the output:

Here I propose to put if not a point, then a semicolon, because “briefly” it will not be possible to continue. The next “block of thoughts” pulls into a separate post (tomorrow, probably? ..) Let's continue about the XDG, most likely.

Is that, in order not to return to this issue, I will list the problems XDK, striking. Firstly, the current build is compiled under ASCII with support for Western European languages. That is, the Cyrillic alphabet becomes kryakozyblyi (it will still be disassembled, but inconvenient debugging) However, the problem is small, I think. Secondly, the author defended his thesis and went to work in the industry, he no longer supports the project, and no one else has yet picked up, in spite of the fact that open source. Thirdly, “entering the project” is not so simple, because everything is written in the relatively marginal Mozart / Oz programming environment in the constraint programming paradigm. It’s difficult to criticize the author for this — the task is very specific. I have already said that a full analysis pulls on the NP-complete task. The author also considered the parsing as a constraint satisfaction problem: graph nodes are given and a set of conditions limiting edges is given. It is required to find a graph satisfying the conditions. That is, the task was set in a more general form, and as a result, it was decided to use a specialized tool for this kind of things. Properly done. There is something else that does not suffice me personally, but more on that next time.

Source: https://habr.com/ru/post/79923/

All Articles

Notes on NLP (Part 6)

Grammar writing

Valence principle

The principle of consistency

Start!

More articles: