Universal grammar analyzer of natural languages from scratch. Release 1

Compilers, interpreters ... How many books and projects are dedicated to them! Basta, tired! But sushussya in the analysis of natural languages, and no information! And all that is somehow very difficult, incomprehensible and not universal. I had the idea to create a medieval linguistic novel. So that you can talk to the characters in some ancient natural or fictional language. In Latin for example? And on Quenya. And so they understand. Why not?

For all this, you need:

1) Develop a format for describing the grammar of an arbitrary language.
2) Write grammar for Quenya and Latin.
3) Develop a universal grammar analyzer and parser.
4) Make the connection between the behavior of the characters and the parser.

For example, the phrase “Legolas, go to the tree” would be interpreted as:
')
The verb script “go” is launched, the Legolas is transmitted as the action subject (by searching by tag we find the game object), the time (imperative) is indicated, which, without additional waiting conditions, causes the action subject to go to the position of the action object.

In this series of articles, we will develop a grammar analyzer from scratch to a completely stable version already on the githab (link at the end of the article):

1) Design the analyzer architecture
2) Develop a language for describing grammar summaries (so that ordinary linguists could write on it)
3) Let's teach our analyzer to read reports
4) Teach the analyzer on the basis of grammar summaries to analyze the text

The analyzer code will be:
1) Quality
2) Expandable
3) Easily maintained
4) Pleasant to read

The analyzer can be used:
1) From the command line
2) Remotely or locally, via RPC

You think it will be unreal a lot of code? If we wrote it in C ++, we would really have had little, but in the next issue, in secret, I will tell you about a very pleasant, concise and concise language in which you can write such a project in 2 months (in the evenings).
So, let's begin!

First, let's think about what the grammar description language should look like? Following the remarkable principles of SOLID, we fully move on to abstraction. What abstractions can be distinguished in linguistics? I selected a whole 'sheet':

L) Alphabet L (character set)
E) Essence E (verb, noun, etc.)
A) Attributes A (time, mood, etc.)
F) Comparison rule F (how to get its characteristics by a word)

F : L - > E * A

$F: L -> E * A$

Therefore, our analyzer will operate on these abstractions.

The specification of the language "Ololo".

In the repository there are summaries for the languages Quenya (well developed) and for Lingua Latina (very little). But we will write a summary for the simple fictional alien language "Ololo". You can find it in the same place. Open the etc / al / tpl specification:

0) Alphabet.
0.1) Class (a, o, u).
0.2) Consonants (l).

 etc/al/tpl/verb.tpl.txt /* * It's time to introduce your experimental 'lol' language. * The 'lol' language has only two verb: 'ololoo' and 'olalaa'. * The infinitive coinsides to vocabular stem. * Present time of the verb forms by shorthening of the last vowel. * Past time forms by replacing of the first vowel 'o' to any other vowel. */

1) We have two verbs (ololoo, olalaa).
1.1) The present tense is formed by the contraction of the last vowel (ololo, olala).
1.2) The past tense is formed by replacing the initial vowel with any other vowel (ulolloo, alallaa).

 etc/al/tpl/noun.tpl.txt /* * It's time to introduce your experimental 'lol' language. * The 'lol' language has only one noun: 'll'. * But there are small assumption: between two consonants can be places * arbitrary count of arbitrary vowels. */

2) We have one noun (ll, lol, lool, luol, ...) without declensions.

 etc/al/tpl/prep.tpl.txt /* * It's time to introduce your experimental 'lol' language. * The 'lol' language has only two preposition: 'ao' and 'oa'. * But there are small assumption: adjacent prepositions can be concatenated. */

3) We have two prepositions (ao, oa).
3.1) Related prepositions in the text are glued together (ao oa -> aooa).

 etc/al/tpl/adj.tpl.txt /* * It's time to introduce your experimental 'lol' language. * The 'lol' language has only two adjective: 'lo' and 'ol'. * If adjective ends with consonant - it is consonant declension AD.C (ol). * If adjective ends with vowel - it is vowel declension AD.V (lo). * If adjective has not additional endings - it has singular number. * Plural number forms like this: * For AD.C by adding vowel 'a' * For AD.V by adding suffix 'la' * Superlative A.sup. and comparative A.cmp. are formed like this: * A.sup. by prefix 'al' * A.cmp. exists only for adjective starts with consonant and * forms by appending any short vowel (wildcard @ from 'etc.tpl.txt') * at the beggining of the word. * Superlative and comparative forms plural number on it's own manner. * Only adjectives ends with vowel can form plural A.sup. or A.cmp. by * appending the 'lolo' suffix to the end of the word. */

4) We have two adjectives (lo, ol) of different declinations.
4.1) The plural (lo -> lola, ol -> ola), respectively.
4.2) Superlatives (lo -> allo, ol -> alol).
4.3) Comparative for only one declination (lo -> alo, ulo, olo).
4.4) Plural for only one declension (lo -> lololo).

Competition. Learning poetry aliens.

ll ololo - Simple sentence (1 grammatical basis).
lol ulalaa oa lul olala allololo - Complex sentence (2 grammatical bases).

Competition: think up an initial translation of the above dictionary forms and write the most beautiful and sonorous eighth in the language of “Ololo” (observing all the rules of the grammar described above). The offer is valid for 2 weeks from the date of publication of the article.
Answers write in the comments and be sure to duplicate to the address apborezkiy@gmail.com, indicating the topic "ACC # 1".
The winner will be announced in subsequent articles and will receive a free Skype consultation on any of the topics in which I can be something useful.

Development through testing.

Using the principle of TDD, we first prepare test material for our analyzer.

var / test.tpl / test.in

 oa / * preposition * /
 aooaaooa / * concatenated preposition * /

 loool / * noun * /
 ll / * noun * /

 ololoo / * verb infinitive * /
 olalaa / * verb infinitive * /
 ulalaa / * verb past time * /
 uloloo / * verb past time * /
 ololo / * verb present time * /
 olala / * verb present time * /

 lo / * singular adjective vowel declension * /
 ol / * singular adjective consonant declension * /
 lola / * plural adjective vowel declension * /
 ola / * plural adjective consonant declension * /
 allo / * singular superlative adjective vowel declension * /
 alol / * singular superlative adjective consonant declension * /
 ulo / * singular comparative adjective vowel declension * /
 allololo / * plural superlative adjective vowel declension * /
 ulololo / * plural comparative adjective vowel declension * /

Looking ahead, I will say that we will teach our analyzer to issue at the output such a detailed analysis:

var / test.tpl / test.out

 oa {
   preposition {/ * 1.4.1.1.1.1.1.100.  * /
     pars_orationis = p.  / * preposition * /
     {
       mp_prep = oa / * voc_prepositions * /
     }
   }
 }
 aooaaooa {
   preposition {/ * 1.4.1.1.1.2.1.2.1.1.2.1.2.1.1.2.1.2.1.1.1.1.100.100.100.100.  * /
     pars_orationis = p.  / * preposition * /
     {
       mp_prep = oa / * voc_prepositions * /
       mp_prep = ao / * voc_prepositions * /
       mp_prep = oa / * voc_prepositions * /
       mp_prep = ao / * voc_prepositions * /
     }
   }
 }
 loool {
   noun {/ * 1.2.1.1.1.2.1.2.2.1.2.1.1.3.1.1.2.2.100.  * /
     pars_orationis = n.  / * noun * /
     {
       mn_stem = ll / * voc_nouns * /
       / * mn_vowel_left = ooo * /
       / * mn_vowel_left = o * /
       / * mn_vowel_left * /
       / * mn_vowel_right * /
     }
   }
 }
 ll {
   noun {/ * 1.2.1.1.1.1.1.3.1.1.2.100.  * /
     pars_orationis = n.  / * noun * /
     {
       mn_stem = ll / * voc_nouns * /
       / * mn_vowel_left * /
       / * mn_vowel_right * /
     }
   }
 }
 ololoo {
   verb {/ * 1.3.1.1.1.1.1.2.100.  * /
     pars_orationis = vb.  / * verb * /
     mvb_time = V.inf.  / * infinitive *
     {
       mvb_stem = ololoo / * voc_verbs * /
       / * mvb_time_past * /
     }
   }
 }
 olalaa {
   verb {/ * 1.3.1.2.2.1.1.1.100.  * /
     pars_orationis = vb.  / * verb * /
     mvb_time = V.inf.  / * infinitive *
     {
       mvb_stem = olalaa / * voc_verbs * /
       / * mvb_time_present * /
     }
   }
 }
 ulalaa {
   verb {/ * 1.3.1.1.1.2.1.2.100.  * /
     mvb_time = Vp / * past time * /
     pars_orationis = vb.  / * verb * /
     {
       mvb_stem = olalaa / * voc_verbs * /
       / * mvb_time_past = o * /
     }
   }
 }
 uloloo {
   verb {/ * 1.3.1.1.1.2.1.2.100.  * /
     mvb_time = Vp / * past time * /
     pars_orationis = vb.  / * verb * /
     {
       mvb_stem = ololoo / * voc_verbs * /
       / * mvb_time_past = o * /
     }
   }
 }
 ololo {
   verb {/ * 1.3.1.2.2.2.1.1.100.  * /
     mvb_time = V.pr.  / * present time * /
     pars_orationis = vb.  / * verb * /
     {
       mvb_stem = ololoo / * voc_verbs * /
       / * mvb_time_present = oo * /
     }
   }
 }
 olala {
   verb {/ * 1.3.1.2.2.2.1.1.100.  * /
     mvb_time = V.pr.  / * present time * /
     pars_orationis = vb.  / * verb * /
     {
       mvb_stem = olalaa / * voc_verbs * /
       / * mvb_time_present = aa * /
     }
   }
 }
 lo {
   adjective {/ * 1.1.1.1.1.3.1.3.2.1.2.100.  * /
     aa_number = A.sg.  / * singular number * /
     pars_orationis = adj.  / * adjective *
     {
       ma_stem = lo / * voc_adjectives * /
       / * ma_degree * /
       / * ma_number * /
     }
   }
 }
 ol {
   adjective {/ * 1.1.1.1.1.3.1.3.1.1.2.100.  * /
     aa_number = A.sg.  / * singular number * /
     pars_orationis = adj.  / * adjective *
     {
       ma_stem = ol / * voc_adjectives * /
       / * ma_degree * /
       / * ma_number * /
     }
   }
 }
 lola {
   adjective {/ * 1.1.1.1.1.3.1.3.4.1.2.100.  * /
     aa_degree = A.no.  / * no special degree * /
     aa_number = A.pl.  / * plural number * /
     pars_orationis = adj.  / * adjective *
     {
       ma_stem = lo / * voc_adjectives * /
       / * ma_degree * /
       / * ma_number = la * /
     }
   }
 }
 ola {
   adjective {/ * 1.1.1.1.1.3.1.3.3.1.2.100.  * /
     aa_number = A.pl.  / * plural number * /
     aa_degree = A.no.  / * no special degree * /
     pars_orationis = adj.  / * adjective *
     {
       ma_stem = ol / * voc_adjectives * /
       / * ma_degree * /
       / * ma_number = a * /
     }
   }
 }
 allo {
   adjective {/ * 1.1.1.1.1.1.1.3.2.2.2.100.  * /
     aa_number = A.sg.  / * singular number * /
     pars_orationis = adj.  / * adjective *
     aa_degree = A.sup.  / * superlative * /
     {
       ma_stem = lo / * voc_adjectives * /
       / * ma_degree = al * /
       / * ma_number * /
     }
   }
 }
 alol {
   adjective {/ * 1.1.1.1.1.1.1.3.1.1.2.100.  * /
     aa_number = A.sg.  / * singular number * /
     pars_orationis = adj.  / * adjective *
     aa_degree = A.sup.  / * superlative * /
     {
       ma_stem = ol / * voc_adjectives * /
       / * ma_degree = al * /
       / * ma_number * /
     }
   }
 }
 ulo {
   adjective {/ * 1.1.1.1.1.2.1.3.2.1.2.100.  * /
     aa_number = A.sg.  / * singular number * /
     aa_degree = A.cmp.  / * comparative * /
     pars_orationis = adj.  / * adjective *
     {
       ma_stem = lo / * voc_adjectives * /
       / * ma_degree = ul * /
       / * ma_number * /
     }
   }
 }
 allololo {
   adjective {/ * 1.1.1.1.1.1.1.3.5.1.2.100.  * /
     aa_number = A.pl.  / * plural number * /
     pars_orationis = adj.  / * adjective *
     aa_degree = A.sup.  / * superlative * /
     {
       ma_stem = lo / * voc_adjectives * /
       / * ma_degree = al * /
       / * ma_number = lolo * /
     }
   }
 }
 ulololo {
   adjective {/ * 1.1.1.1.1.2.1.3.5.1.2.100.  * /
     aa_degree = A.cmp.  / * comparative * /
     aa_number = A.pl.  / * plural number * /
     pars_orationis = adj.  / * adjective *
     {
       ma_stem = lo / * voc_adjectives * /
       / * ma_degree = ul * /
       / * ma_number = lolo * /
     }
   }
 }

Numbers mean a chain of rules (horizontal and vertical). It is very convenient to track the sequence of analysis.

We write our first summary.

Alphabet

So what will the bulletin description language contain? First, the alphabet in the style of OOP.
Reserved lexemes are:
1) .alphabet - Indicates the beginning of the alphabet description
2) .base - Indicates the parent alphabet
3) = "" {}

 etc/al/tpl/etc.tpl.txt .alphabet short_vowel { a = "vowel (a)" o = "vowel (o)" u = "vowel (u)" } .alphabet long_vowel { aa = "long vowel (a)" oo = "long vowel (o)" } .alphabet vowel .base short_vowel long_vowel { } .alphabet consonant { l = "consonant l" } .alphabet phoneme .base vowel consonant { }

Hierarchy of linguistic entities

Secondly, the description of entities (parts of speech) and their attributes.
Reserved lexemes are:
1) .attribute - Indicates the beginning of the attribute description.
2) .class - Indicates the beginning of the description of the entity.
3) = "" {}

 etc/al/tpl/lang.tpl.txt .attribute pars_orationis 0 { n. = "noun" adj. = "adjective" vb. = "verb" p. = "preposition" }

 etc/al/tpl/verb.tpl.txt .attribute mvb_time 1 { V.inf. = "infinitive" V.pr. = "present time" Vp = "past time" } .class verb { pars_orationis mvb_time }

 etc/al/tpl/noun.tpl.txt .class noun { pars_orationis }

 etc/al/tpl/prep.tpl.txt .class preposition { pars_orationis }

 etc/al/tpl/adj.tpl.txt .attribute aa_declension 1 .verbose { AD.C = "consonant declension" AD.V = "vovel declension" } .attribute aa_number 2 { A.sg. = "singular number" A.pl. = "plural number" } .attribute aa_degree 3 { A.no. = "no special degree" A.sup. = "superlative" A.cmp. = "comparative" } .class adjective { pars_orationis aa_declension aa_number aa_degree }

The number after the attribute name will indicate the order in which the attribute is displayed in the output file (for clarity).

Vocabulary

It's time to enter the dictionary:

 etc/al/tpl/voc.tpl.txt .vocabulary voc_adjectives { lo adj. AD.V ol adj. AD.C } .vocabulary voc_nouns { ll n. } .vocabulary voc_prepositions { ao p. oa p. } .vocabulary voc_verbs { ololoo vb. olalaa vb. }

Reserved lexemes are:
1) .vocabulary - Indicates the beginning of the description of the dictionary.
2) {}

Here, each dictionary entry is associated with an entity and a set of attributes. Remember our magic formula?

Wildcard characters

You also need to enter the wildcard characters (aliases) that will be used in mutations and matching rules.

 etc/al/tpl/etc.tpl.txt .wildcard . phoneme .wildcard * vowel .wildcard # consonant .wildcard @ short_vowel

Here we set the wildcards for our alphabets, which we described above.
Now we need to describe the transformations (mutations) that occur in words.

 etc/al/tpl/etc.tpl.txt .mutation longify_vowel { a = aa o = oo } .mutation change_vowel_to_o { * = o }

Reserved lexemes are:
1) .mutation - Indicates the beginning of the description of the transformation.
2) {}
3) =

On the left, in mutations, you can write both alphabet characters and wildcard characters.
And now the wildcards themselves for our mutations:

 etc/al/tpl/etc.tpl.txt .wildcard (a>aa,o>oo) longify_vowel .wildcard (*>o) change_vowel_to_o

Matching rules

It remains the most difficult. Matching rules The idea is similar to a dictionary, but instead of dictionary forms, a mask with wildcard and alphabet characters is used.
Rules for matching nouns.

 etc/al/tpl/voc.tpl.txt .vocabular .inward mn_stem voc_nouns .match .forward mn_vowel_left { =## =#+* mn_vowel_left } .match .backward mn_vowel_right { =## mn_vowel_right +*=# } .match .inward-void m_noun { mn_vowel_left mn_stem mn_vowel_right | n. noun }

Reserved lexemes are:
1) .match - Indicates the beginning of the description of the matching rule.
2) .backward - The mask begins to match at the end of the word (convenient for suffixes and endings).
2) .forward - The mask starts to match from the beginning of the word (convenient for prefixes).
3) .inward-void - First from the end, then first, and so on to the vocabulary base.
4) | - After the vertical bar, the corresponding characteristics of the word, suitable under the mask, begin.
five) {}
6) + - =

The mask of the word is indicated on the left, on the right - the attributes or entity to which this mask corresponds. A mask consists of a sequence of rules, which can be either an independent rule, or a combination of wildcard characters with alphabet characters and with special signs "+", "-", "=".
We write the collation rules specification:

 etc/al/tpl/adj.tpl.txt /* * Match specification is the powerful easy mechanism for words recognision. * Each regular match expression has 3 mode: * * '=' match mode: * only comparation. * '+' rift mode: * comparation and rifting from subword copy, * appending detached part to rule 'value' field that could be * found in the output generated files. * '-' hold mode (comparation and holding) * comparation and holding (not detaching), * appending holded part to rule 'value' field that could be * found in the output generated files. * * Also regular expressions supports negotiation of the single next character * or wildcard (wildcard can has arbitrary name length) through the preceding * reserved symbol '~'. * * Examples: /* meaning */ * =~a /* not 'a' */ * =~ab /* not 'a' followed by 'b' */ * =~a~b /* not 'a' followed by not 'b' */ * =~# /* not any phoneme from wildcard '#' consistent alphabet tree */ */

Parse the rules for matching nouns

Now how it all works. On the example of nouns.

 .match .inward-void m_noun { mn_vowel_left mn_stem mn_vowel_right | n. noun }

First, from the end, the rule “m_noun” is considered, which immediately passes into consideration of the matching rule “from the end” - “mn_vowel_right”, which looks like this:

 .match .backward mn_vowel_right { =## mn_vowel_right +*=# }

The first mask is checked "= ##". First there is a pointer to the banal character-by-character comparison mode "=". Behind her are two of our above wildcard characters "#", meaning two consonants. So, at the end of the word there should be two consonants, the rule ends here.
Consider an alternative course of events “mn_vowel_right + * = #”. This rule is recursive. It means that at the end there is a consonant. After that, before this consonant, we must remove one vowel and write it as the result of this rule. And do so until we stumble upon the only possible option "= ##". Those. all posts that we remove will be the result of “mn_vowel_right”, which we should see in the analysis results.

Step by step the rule "mn_vowel_right". Take for example the word "loolool".

Word residue	Pattern mapping	Result
loolool	# in compare mode
loolol	* in split mode	o
loolol	# in compare mode	o
looll	* in split mode	oo
looll	## in compare mode	oo

By the end we got a “looll”. It will go further into mn_vowel_left. Similarly, by the end we get “lll”. It will go further to mn_stem and will be searched in the dictionary.
Since there is no such word in the dictionary, this chain of rules is considered inappropriate. But if we took "looool", we would get our vocabulary form of the noun "ll".
C noun figured out. Fuuuh. You are not tired yet? A little rest and it's time to take on the verb.

Parse the verb matching rules

 etc/al/tpl/verb.tpl.txt .vocabulary voc_verbs ; /* preemptive declaration, see 'voc.tpl.txt' */ .vocabular .inward mvb_stem voc_verbs .match .backward mvb_time_present { /* eg */ =. | V.inf. /* '.' is any phoneme ('etc.tpl.txt') */ /* 1 */ -(a>aa,o>oo) | V.pr. /* ololo -> ololoo, olala -> olalaa */ /* 2 */ } .match .forward mvb_time_past { /* eg */ =. | V.inf. /* '.' is any phoneme ('etc.tpl.txt') */ /* 1 */ =~o-(*>o) | Vp /* eloloo -> ololoo, ulalaa -> olalaa */ /* 2 */ } .match .inward-void m_verb { mvb_time_past mvb_stem | vb. verb /* 1 */ mvb_stem mvb_time_present | vb. verb /* 2 */ }

It is much more interesting here, we use the mutations described above to restore the effects of fusions. For the first time, the “-” hold mode is used, which instead of “+” splitting off keeps the character in place as “=”, but writes it to the result of the rule as “+”.
Let's go from the beginning. Take the word "ulolloo." Let's analyze the rules of the past and present times.

Rule "mvb_time_past".

Word residue	Pattern mapping	Result
ulolloo
ololloo	(*> o) on hold	u
ololloo	not “o” in compare mode	u

The analyzer considers all possible chains of events, but if there are conflicting characteristics, for example V.inf. and Vp then this scenario is terminated as impossible.

Rule "mvb_time_present". Take the word "olollo".

Word residue	Pattern mapping	Result
olollo
ololloo	(a> aa, o> oo) on hold	o

Parse the rules for matching adjectives.

Tilde before the characteristic denies it. "~ A.no." then in the future, either A.sup. or A.cmp. or nothing is allowed.

 .match .backward ma_number { /* eg */ =# | A.sg. AD.C /* al -> al */ =* | A.sg. AD.V /* lo -> lo */ =#+a | A.pl. AD.C A.no. /* ola -> ol */ =*+la | A.pl. AD.V A.no. /* lola -> lo */ =*+lolo | A.pl. AD.V ~ A.no. /* allololo -> allo */ } .match .forward ma_degree { /* eg */ +al | A.sup. /* allo -> lo, alol -> ol */ +@-# | A.cmp. /* ulo -> lo */ =. | ~ A.sup. ~ A.cmp. /* '.' is any phoneme wildcard ('etc.tpl.txt') */ } .match .inward-void m_adjective { ma_degree ma_stem ma_number | adj. adjective }

Take the word allololo . Consider the larger steps.

Source word	Rule	Matching pattern matching	Result	Word residue
allololo	ma_number	"= " or "= + lolo"	no either "lolo"	allololo either allo
allololo allo	ma_degree	"+ al" or "+ @ - #" or "=."	“Al” or “al” or not	lololo, lo either llololo, llo or allololo, allo
lololo, lo either llololo, llo or allololo, allo	ma_stem	vocabulary	not	not

Fits only lo . Thus, we have obtained the only suitable chain of comparisons of characteristics: “A.pl. AD.V ~ A.no. A.sup. adj. adjective. Therefore, the word allololo is unambiguously a “plural superlative adjective vowel declension”.
If it is interesting, for the language of the description of grammars we will highlight a separate cycle of articles. For now, bye!

Universal grammar analyzer available at:
github.com/ArseniyBorezkiy/arda_compiler_collection

In the next release we will proceed directly to the design and coding of our analyzer.

Source: https://habr.com/ru/post/323692/

All Articles

Universal grammar analyzer of natural languages ​​from scratch. Release 1