📜 ⬆️ ⬇️

How I created the parser

For one of my projects, an interesting feature was needed - to rephrase the text, which allows, for example, the phrase “a cow grazed in a meadow” to be converted into a “spotted burenka chewing juicy grass on a green meadow”. Of course, this kind of transformation requires a very large base of connections between words and expressions, the absence of which has wiped out all the work. But that's another story. Now I will talk about how to solve the issue of parsing sentences, which then had to be transformed into something new, but the same human-readable.

Wikipedia defines syntactic analysis as follows: “ syntactic analysis (parsing) is the process of matching a linear sequence of lexemes (words, tokens) of a language with its formal grammar. The result is usually a parse tree (syntax tree). Usually used in conjunction with lexical analysis. A parser (parser) is a program or part of a program that performs parsing. "

For successful text conversion, it is necessary to break it into paragraphs, and those, in turn, into lines. This was done internally by the Solarix morphological engine that I bought for the synonymizer back in 2008. The output of the separator was an array of paragraphs, each of which contained an array of strings.
')
Just in case, I want to note: in this article I deliberately do not indicate a programming language, because, first, I want to avoid a holivar on the subject of PL, and, secondly, here we will talk about the method of parsing a string, and not about a specific method implement this parsing. In addition, for some reason I could not use the parser built into Solarix, which is why I had to make my own garden.

A handler is a multi-pass analyzer that can process not only individual words, but whole sentences using the context of sentences and paragraphs, and using it when difficulties arise with homonymy or in the case of incomplete or incomprehensible sentences.

Splitting into paragraphs allows you to select the main idea of ​​this paragraph (if it is, of course), thereby creating a context. Formally, the context of a paragraph can be considered as all the pairs of subjects and predicates found in all sentences of this paragraph. In the case of single-ended sentences, only one sentence structure is used. If, of course, it can be found.

All the main work is done at the sentence level. The algorithm for analyzing a sentence is quite simple and can be described as states of a finite automaton. Unfortunately, even despite the availability of knowledge on the creation of finite automata, I have no experience in creating such automata. Therefore, it was decided to do everything in the old old-fashioned way, using a procedural approach. Moreover, my hands itched very badly and I already knew about how and what I would do. Now I know that it was wrong, because already several hours later I was stuck in the wilds of the Russian grammar and spent a lot of time studying it. Although it’s not a fact that after spending another week creating a finite state machine, I would have achieved a better result.

So, I studied the grammar of the Russian language, in parallel building my analyzer. The analyzer was based on several axioms based on the basic principles of constructing phrases in Russian:

- in the offer may be zero or more subject;
- several subject, going in a row and standing in the same case form a composite subject;
- a simple subject is associated with a predicate in the singular, and a compound subject is associated with a predicate that has a plural number;
- the predicate can be both simple and complex;
- a complex predicate consists of two or more predicates with the same number and time;
- minor members of the proposal (circumstances, definitions, additions and annexes) are determined on the basis of their location relative to the main members of the proposal, as well as the corresponding morphological features such as gender, number and time.

Initially, the most simple sentences were processed: “ mom washed the frame ”, “the cat and the dog sit on the floor ”, “the cat drank milk ”. These sentences do not pose any difficulty, since they contain quite a few combinations of morphological characters. For example, “ mother soaped the frame ” is very easy to understand: in this case only the noun in the nominative case ( mother ) can be subject. The predicate is the verb “ soap ” associated with the subject by gender and number. The third word is an addition, since it answers the question of an indirect case ( what? ).

The attentive reader may ask: what about the third sentence about a cat drinking milk, in which there are two nouns in the nominative case, and one of which can be a verb? In this case, we are dealing with homonymy, and it is solved quite simply: for a word having signs and a noun, and a verb, we must find a noun associated with the verb by gender and number, or only by number in the case of a compound subject or a simple subject plural. In addition, if there is a clarification expressed by an indirect or interrogative pronoun (which? What?), The homonymy is removed even simpler - such homonyms are automatically considered subject to the subject sentence: “the cat that drank milk wanted to sleep” => “the cat (wanted (to sleep), drank (which) milk) ”. Thus, we get two sets of subject-predicate: the cat saw and the cat wanted , and the sentence itself is considered difficult.
Further refinement of the analyzer is not particularly difficult: we are looking for all the minor members of the proposal, based on their position and the relationship of morphological features relative to the already found members of the proposal.

By the way, to facilitate the parsing of the sentence, it is necessary to simplify the sentence before parsing itself, tying adverbs ( well, quickly, easily ) to verbs, verbal adverbs or adjectives, next to which they stand, if necessary, linking these adverbs into chains when they are linked by commas. or unions ( and, or, but not ).

If the analysis of a specific sentence causes a difficulty, it is marked as problematic, and its analysis is postponed for the second pass of the analyzer when the context of the paragraph is already more or less defined. In this case, using this context, it is possible with a rather large degree of confidence to identify the problem parts of speech in the problem sentence and add its context to the context of the paragraph.

Unfortunately, the parser was not completed, although I was able to implement the analysis of quite exotic sentences, both simple and complex. Also, I did not create a database of connections of words and phrases. This happened both because of the loss of interest in the project (I hope I will resume it once), and because of the lack of time associated with parallel work in several commercial projects, because of which free time I could spend on a hobby , almost gone.

Source: https://habr.com/ru/post/137799/


All Articles