📜 ⬆️ ⬇️

Analysis of the natural language: grammatical notation



I have been interested in AI for quite some time, especially in the field of machine understanding of texts written in natural language. As is known, the classical theory of text analysis divides this process into three stages:


The first stage is generally resolved. We have detailed morphological dictionaries covering the lion's share of words found in most texts. In addition, for common languages ​​there are rules that allow for sufficient accuracy to classify unknown word forms.
')
The situation with parsing is much more complicated. Existing analyzers can not claim the correctness and accuracy of analysis in complex cases. Most of the high-quality products are released under a proprietary license (to a greater extent this concerns the Russian language; the problem with the English does not seem to be so acute). Therefore, in order to make the machine understand the texts written in natural language, we need high-quality and affordable syntax analyzers.

Due to my lack of deep knowledge in the field of neural networks, I decided to follow a more beaten path, namely, to develop BNF-like grammatical notation and implement an analyzer using the grammar rules described with its help. From this point of view, when developing a practically useful analyzer, the main work consists precisely in building a sufficient system of rules (which is far from completion). In the next post I will describe the device of the implemented analyzer , but for now I want to focus on the developed grammatical notation.

Basics of grammatical notation


I decided to start my research by looking at a specific text. I chose “Chamber No. 6” - the famous novel by A.S. Chekhov, written by him in 1892. As it turned out, already from the first sentence I faced very significant difficulties related to the complexity of Chekhov's sentences. Before considering specific examples of syntax analysis, I must warn you that I am not a professional linguist or philologist, and therefore my analysis may turn out to be incorrect from an academic point of view. My goal will be to identify the general structure of the proposal, and not to prepare an analysis that corresponds to the school curriculum.

     ,    ,    . 

This is a simple sentence consisting of a subject (outhouse), definitions related to it (small) and participial turnover (surrounded by a whole forest of burdock, nettle and wild cannabis), as well as predicate (standing) and place related to it (in the hospital yard) .

Let's divide this sentence into two large parts:

Now we can already outline a few grammar rules that successfully parse the first sentence:

 sentence: - "{predicate_group} {subject_group}." predicate_group: - "{adverbial_modifier} {predicate}" subject_group: - "{attribute} {subject}, {participial_phrase}" 

The grammar is described in YAML format. This format is convenient for editing rules and is used in my analyzer. Definitions of non-terminals begin with a name ending with a colon. This is followed by one or more rules that correspond to a given non-terminal. A rule is a sequence of terminals and non terminals. The non-terminal inside the rule is denoted as "{non-terminal name}". For example, the only rule of a nonterminal "sentence" consists of a sequence: the nonterminal "predical_group", the terminal "", the nonterminal "subject_group" and the terminal ".".

Our grammar, moreover, that it does not define many of the nonterminals used, also does not take into account the most important syntactic restrictions. Here, for example, a completely incorrect sentence, not inconsistent with this grammar:

      ,    ,    . 

The fact is that this grammar does not take into account restrictions on the most important characteristics, in particular, on the case , number and gender . Let's modify the rules by adding the necessary restrictions.

 sentence: - "{predicate_group number=@1 gender=@2} {subject_group number=@1 gender=@2}." predicate_group: - "{adverbial_modifier} {predicate !number !gender !tense}" subject_group: - "{attribute} {subject case=nomn !number=@1 !gender=@2}, {participial_phrase number=@1 gender=@2}" 

In the above notation, some non-terminals have attributes. An attribute can have a directly specified value, for example, the “case” attribute of the non-terminal “predicate_group” in the rule “sentence” must be “nomn” (nominative, nominative).

The attribute name can begin with an exclamation mark. The value of such an attribute is exported from the corresponding non-terminal to the external non-terminal, the rule of which we are considering. For example, "predicate_group" must export the attributes "number" (number), "gender" (gender) and "tense" (time) for the rule from "sentence", but their values ​​cannot be determined at this level, so it imports them values ​​from the non-terminal "predicate".

Sometimes it is necessary that the attributes in different non-terminals rules have the same values ​​regardless of these values ​​themselves (which often cannot be defined at a given level). In these cases, special values ​​are used, preceded by the "@" symbol. For example, in the sentence rule, the non-terminals predicate_group and subject_group must have the same number and gender. In addition, an attribute with a special value can also be exported (the name is preceded by an exclamation mark).

Attracting attribute constraints


In Russian, the form of the verb depends on the gender of the corresponding noun, being in the past tense (“cat ate”, “cat ate”), but does not depend on being in the present or future tense (“cat eats”, “cat eats”). We could describe this law by creating a separate rule for each time, but it is easier to use the non-strictness of the attribute constraint: the attribute limit is ignored when this attribute is not defined in one of the corresponding non-terminals.

I will give an example that demonstrates this rule. To analyze phrases like “cat ate”, “cat ate”, “cat eats”, “cat eats” we will use the following rule:

 - "{noun case=nomn number=@1 gender=@2} {verb number=@1 gender=@2}" 

In the case of the present tense, the verb form (“eats”) will not have a kind attribute, therefore, due to the lack of strictness of the attribute restriction, the gender restriction will be ignored. This technique also helps with the attribute of the number: we do not need to add a separate rule for the plural (the gender of the verb is not defined for the plural, so the corresponding restriction will be ignored). In practice, this approach can significantly reduce the size and complexity of the grammar.

Assigning Values ​​to Attributes


Sometimes you need to assign values ​​for attributes that can be further used in constraints. For example:

 - "@{number=plur}{noun}  {noun}" 

This rule establishes a plural (plural) for listing two nouns ("elephant and pug"). The assignment of attribute values ​​is at the very beginning of the rule, preceded by the "@" symbol and contains an enumeration of attributes with corresponding direct values.

Lexical terminals


The lowest terminals of the lowest level could be defined as follows:

 predicate: ... - "@{number=sing gender=femin tense=past}" - "@{number=sing gender=masc tense=past}" - "@{number=plur tense=past}" ... 

However, this approach is not particularly realistic, given the huge number of word forms. Terminals of word forms should be extracted from the morphological dictionary. Therefore, such word forms are not listed in the rules, but are denoted by a set of attribute values:

 predicate: - "{pos=verb !number !gender !tense}" 

As you can see, such a notation is similar to the assignment of values ​​to attributes, only in contrast to the latter it does not begin with the “@” symbol and can be in any part of the rule. The above rule corresponds to the lexical terminal (word form) whose attribute “pos” (part of speech, part of speech) has the value “verb” (verb).

Source: https://habr.com/ru/post/255073/


All Articles