Universal grammar analyzer of natural languages from scratch. Release 1
Compilers, interpreters ... How many books and projects are dedicated to them! Basta, tired! But sushussya in the analysis of natural languages, and no information! And all that is somehow very difficult, incomprehensible and not universal. I had the idea to create a medieval linguistic novel. So that you can talk to the characters in some ancient natural or fictional language. In Latin for example? And on Quenya. And so they understand. Why not? For all this, you need:
1) Develop a format for describing the grammar of an arbitrary language. 2) Write grammar for Quenya and Latin. 3) Develop a universal grammar analyzer and parser. 4) Make the connection between the behavior of the characters and the parser.
For example, the phrase “Legolas, go to the tree” would be interpreted as: ')
The verb script “go” is launched, the Legolas is transmitted as the action subject (by searching by tag we find the game object), the time (imperative) is indicated, which, without additional waiting conditions, causes the action subject to go to the position of the action object.
In this series of articles, we will develop a grammar analyzer from scratch to a completely stable version already on the githab (link at the end of the article):
1) Design the analyzer architecture 2) Develop a language for describing grammar summaries (so that ordinary linguists could write on it) 3) Let's teach our analyzer to read reports 4) Teach the analyzer on the basis of grammar summaries to analyze the text
The analyzer code will be: 1) Quality 2) Expandable 3) Easily maintained 4) Pleasant to read
The analyzer can be used: 1) From the command line 2) Remotely or locally, via RPC
You think it will be unreal a lot of code? If we wrote it in C ++, we would really have had little, but in the next issue, in secret, I will tell you about a very pleasant, concise and concise language in which you can write such a project in 2 months (in the evenings). So, let's begin!
First, let's think about what the grammar description language should look like? Following the remarkable principles of SOLID, we fully move on to abstraction. What abstractions can be distinguished in linguistics? I selected a whole 'sheet':
L) Alphabet L (character set) E) Essence E (verb, noun, etc.) A) Attributes A (time, mood, etc.) F) Comparison rule F (how to get its characteristics by a word)
Therefore, our analyzer will operate on these abstractions.
The specification of the language "Ololo".
In the repository there are summaries for the languages Quenya (well developed) and for Lingua Latina (very little). But we will write a summary for the simple fictional alien language "Ololo". You can find it in the same place. Open the etc / al / tpl specification:
0) Alphabet. 0.1) Class (a, o, u). 0.2) Consonants (l).
etc/al/tpl/verb.tpl.txt /* * It's time to introduce your experimental 'lol' language. * The 'lol' language has only two verb: 'ololoo' and 'olalaa'. * The infinitive coinsides to vocabular stem. * Present time of the verb forms by shorthening of the last vowel. * Past time forms by replacing of the first vowel 'o' to any other vowel. */
1) We have two verbs (ololoo, olalaa). 1.1) The present tense is formed by the contraction of the last vowel (ololo, olala). 1.2) The past tense is formed by replacing the initial vowel with any other vowel (ulolloo, alallaa).
etc/al/tpl/noun.tpl.txt /* * It's time to introduce your experimental 'lol' language. * The 'lol' language has only one noun: 'll'. * But there are small assumption: between two consonants can be places * arbitrary count of arbitrary vowels. */
2) We have one noun (ll, lol, lool, luol, ...) without declensions.
etc/al/tpl/prep.tpl.txt /* * It's time to introduce your experimental 'lol' language. * The 'lol' language has only two preposition: 'ao' and 'oa'. * But there are small assumption: adjacent prepositions can be concatenated. */
3) We have two prepositions (ao, oa). 3.1) Related prepositions in the text are glued together (ao oa -> aooa).
etc/al/tpl/adj.tpl.txt /* * It's time to introduce your experimental 'lol' language. * The 'lol' language has only two adjective: 'lo' and 'ol'. * If adjective ends with consonant - it is consonant declension AD.C (ol). * If adjective ends with vowel - it is vowel declension AD.V (lo). * If adjective has not additional endings - it has singular number. * Plural number forms like this: * For AD.C by adding vowel 'a' * For AD.V by adding suffix 'la' * Superlative A.sup. and comparative A.cmp. are formed like this: * A.sup. by prefix 'al' * A.cmp. exists only for adjective starts with consonant and * forms by appending any short vowel (wildcard @ from 'etc.tpl.txt') * at the beggining of the word. * Superlative and comparative forms plural number on it's own manner. * Only adjectives ends with vowel can form plural A.sup. or A.cmp. by * appending the 'lolo' suffix to the end of the word. */
4) We have two adjectives (lo, ol) of different declinations. 4.1) The plural (lo -> lola, ol -> ola), respectively. 4.2) Superlatives (lo -> allo, ol -> alol). 4.3) Comparative for only one declination (lo -> alo, ulo, olo). 4.4) Plural for only one declension (lo -> lololo).
Competition: think up an initial translation of the above dictionary forms and write the most beautiful and sonorous eighth in the language of “Ololo” (observing all the rules of the grammar described above). The offer is valid for 2 weeks from the date of publication of the article. Answers write in the comments and be sure to duplicate to the address apborezkiy@gmail.com, indicating the topic "ACC # 1". The winner will be announced in subsequent articles and will receive a free Skype consultation on any of the topics in which I can be something useful.
Development through testing.
Using the principle of TDD, we first prepare test material for our analyzer.
Numbers mean a chain of rules (horizontal and vertical). It is very convenient to track the sequence of analysis.
We write our first summary.
Alphabet
So what will the bulletin description language contain? First, the alphabet in the style of OOP. Reserved lexemes are: 1) .alphabet - Indicates the beginning of the alphabet description 2) .base - Indicates the parent alphabet 3) = "" {}
Secondly, the description of entities (parts of speech) and their attributes. Reserved lexemes are: 1) .attribute - Indicates the beginning of the attribute description. 2) .class - Indicates the beginning of the description of the entity. 3) = "" {}
etc/al/tpl/lang.tpl.txt .attribute pars_orationis 0 { n. = "noun" adj. = "adjective" vb. = "verb" p. = "preposition" }
The number after the attribute name will indicate the order in which the attribute is displayed in the output file (for clarity).
Vocabulary
It's time to enter the dictionary:
etc/al/tpl/voc.tpl.txt .vocabulary voc_adjectives { lo adj. AD.V ol adj. AD.C } .vocabulary voc_nouns { ll n. } .vocabulary voc_prepositions { ao p. oa p. } .vocabulary voc_verbs { ololoo vb. olalaa vb. }
Reserved lexemes are: 1) .vocabulary - Indicates the beginning of the description of the dictionary. 2) {}
Here, each dictionary entry is associated with an entity and a set of attributes. Remember our magic formula?
Wildcard characters
You also need to enter the wildcard characters (aliases) that will be used in mutations and matching rules.
It remains the most difficult. Matching rules The idea is similar to a dictionary, but instead of dictionary forms, a mask with wildcard and alphabet characters is used. Rules for matching nouns.
Reserved lexemes are: 1) .match - Indicates the beginning of the description of the matching rule. 2) .backward - The mask begins to match at the end of the word (convenient for suffixes and endings). 2) .forward - The mask starts to match from the beginning of the word (convenient for prefixes). 3) .inward-void - First from the end, then first, and so on to the vocabulary base. 4) | - After the vertical bar, the corresponding characteristics of the word, suitable under the mask, begin. five) {} 6) + - =
The mask of the word is indicated on the left, on the right - the attributes or entity to which this mask corresponds. A mask consists of a sequence of rules, which can be either an independent rule, or a combination of wildcard characters with alphabet characters and with special signs "+", "-", "=". We write the collation rules specification:
etc/al/tpl/adj.tpl.txt /* * Match specification is the powerful easy mechanism for words recognision. * Each regular match expression has 3mode: * * '=' match mode: * only comparation. * '+' rift mode: * comparation and rifting from subword copy, * appending detached part to rule 'value' field that could be * found in the output generated files. * '-' hold mode (comparation and holding) * comparation and holding (not detaching), * appending holded part to rule 'value' field that could be * found in the output generated files. * * Also regular expressions supports negotiation of the single next character * or wildcard (wildcard can has arbitrary name length) through the preceding * reserved symbol '~'. * * Examples: /* meaning */ * =~a /* not'a' */ * =~ab /* not'a' followed by 'b' */ * =~a~b /* not'a' followed by not'b' */ * =~# /* not any phoneme from wildcard '#' consistent alphabet tree */ */
First, from the end, the rule “m_noun” is considered, which immediately passes into consideration of the matching rule “from the end” - “mn_vowel_right”, which looks like this:
The first mask is checked "= ##". First there is a pointer to the banal character-by-character comparison mode "=". Behind her are two of our above wildcard characters "#", meaning two consonants. So, at the end of the word there should be two consonants, the rule ends here. Consider an alternative course of events “mn_vowel_right + * = #”. This rule is recursive. It means that at the end there is a consonant. After that, before this consonant, we must remove one vowel and write it as the result of this rule. And do so until we stumble upon the only possible option "= ##". Those. all posts that we remove will be the result of “mn_vowel_right”, which we should see in the analysis results.
Step by step the rule "mn_vowel_right". Take for example the word "loolool".
Word residue
Pattern mapping
Result
loolool
# in compare mode
loolol
* in split mode
o
loolol
# in compare mode
o
looll
* in split mode
oo
looll
## in compare mode
oo
By the end we got a “looll”. It will go further into mn_vowel_left. Similarly, by the end we get “lll”. It will go further to mn_stem and will be searched in the dictionary. Since there is no such word in the dictionary, this chain of rules is considered inappropriate. But if we took "looool", we would get our vocabulary form of the noun "ll". C noun figured out. Fuuuh. You are not tired yet? A little rest and it's time to take on the verb.
It is much more interesting here, we use the mutations described above to restore the effects of fusions. For the first time, the “-” hold mode is used, which instead of “+” splitting off keeps the character in place as “=”, but writes it to the result of the rule as “+”. Let's go from the beginning. Take the word "ulolloo." Let's analyze the rules of the past and present times.
Rule "mvb_time_past".
Word residue
Pattern mapping
Result
ulolloo
ololloo
(*> o) on hold
u
ololloo
not “o” in compare mode
u
The analyzer considers all possible chains of events, but if there are conflicting characteristics, for example V.inf. and Vp then this scenario is terminated as impossible.
Rule "mvb_time_present". Take the word "olollo".
Word residue
Pattern mapping
Result
olollo
ololloo
(a> aa, o> oo) on hold
o
Parse the rules for matching adjectives.
Tilde before the characteristic denies it. "~ A.no." then in the future, either A.sup. or A.cmp. or nothing is allowed.
.match .backward ma_number { /* eg */ =# | A.sg. AD.C /* al -> al */ =* | A.sg. AD.V /* lo -> lo */ =#+a | A.pl. AD.C A.no. /* ola -> ol */ =*+la | A.pl. AD.V A.no. /* lola -> lo */ =*+lolo | A.pl. AD.V ~ A.no. /* allololo -> allo */ } .match .forward ma_degree { /* eg */ +al | A.sup. /* allo -> lo, alol -> ol */ +@-# | A.cmp. /* ulo -> lo */ =. | ~ A.sup. ~ A.cmp. /* '.' is any phoneme wildcard ('etc.tpl.txt') */ } .match .inward-void m_adjective { ma_degree ma_stem ma_number | adj. adjective }
Take the word allololo . Consider the larger steps.
Source word
Rule
Matching pattern matching
Result
Word residue
allololo
ma_number
"= *" or "= * + lolo"
no either "lolo"
allololo either allo
allololo allo
ma_degree
"+ al" or "+ @ - #" or "=."
“Al” or “al” or not
lololo, lo either llololo, llo or allololo, allo
lololo, lo either llololo, llo or allololo, allo
ma_stem
vocabulary
not
not
Fits only lo . Thus, we have obtained the only suitable chain of comparisons of characteristics: “A.pl. AD.V ~ A.no. A.sup. adj. adjective. Therefore, the word allololo is unambiguously a “plural superlative adjective vowel declension”. If it is interesting, for the language of the description of grammars we will highlight a separate cycle of articles. For now, bye!