📜 ⬆️ ⬇️

Natural language analysis: under the hood



API parser


I continue my previous post . Time to focus on the details of the internal device of the parser. I chose Go as the implementation language, because I wanted to get a parallel (in a sense, using all available CPU cores) productive tool at a low price, without diving into the low-level C ++ abyss.

The resulting code provides the following API:
type Attribute struct { Name string Value string } type ParseMatch struct { Text string Nonterminal string Rule string Attributes []Attribute Submatches []ParseMatch Hypotheses []string HypothesisCount uint } func Parse(text, nonterminal string, hypotheses_limit uint) []ParseMatch 

Match refers to child objects of the same type that correspond to non-terminals or lexical terminals of the matched rule. In the general case, due to the ambiguity inherent in natural languages, several analyzes correspond to the text (for example, due to the presence of homonyms). Therefore, the Parse function returns a set of Match objects. The aforementioned parsing ambiguity should be eliminated at the next (semantic) level of text analysis.
')
So, the Parse function takes the text - the text to parse, nonterminal - the name of the nonterminal (for example, "sentence"), as well as the maximum number of hypotheses_limit hypotheses (about this just below). The nonterminal parameter may be empty. In this case, the text will be compared to the lexical terminal found in the morphological base.

In terms of this analyzer, a hypothesis is the assumption that the violation of an attribute value constraint is caused by an accidental cause. If the analyzer encounters an inconsistency of the attribute value with the constraint specified by the rule currently being considered, and the number of hypotheses reached does not reach the hypotheses_limit , then this inconsistency is ignored. Otherwise, the rule in question is discarded. This mechanism is convenient for debugging rules, but should be avoided in real work, because it monstrously slows down the parsing process.

Nonterminal mapping


Before diving into the description of this process, it is worth noting that, in general, not the entire text is compared, but only its beginning. The analyzer finds all possible initial parts of the text corresponding to a given non-terminal (or all lexical terminals at the beginning of the text).

For this non-terminal, the analyzer extracts all the relevant rules (based on the grammar) and tries to find a match for each of them. So, for each such rule, the analyzer checks whether there is an attribute value assignment ("@ {...}", there must be rules first). Next, the recursive parseRulePart function is called , which sequentially performs the following actions:


Comparison of lexical terminals


On the one hand, lexical terminals can be easily retrieved from the text by searching for characters that separate words from each other (space, comma, etc.). On the other hand, there are stable phrases containing such characters. Considered as indivisible lexical units, they may have other attribute values ​​and even a different meaning (which is especially important for subsequent semantic analysis). Therefore, the comparison of lexical terminals is as follows:


Morphological base


I use the morphological dictionary of the Russian language, downloaded from the site OpenCorpora . To be used by the parser, this data must be minimized and indexed. At first, I tried using SQLite for this purpose, but the resulting solution had unsatisfactory performance (even after caching was turned on). Therefore, we had to implement a specialized morphological base. The measurements showed approximately 9-fold acceleration of the search and more than 300-fold acceleration of the initial indexing.

The file format of the morphological database is fairly straightforward: header + two large parts. The first part consists of word forms, separated by the symbol '\ 0'. The second part is a sorted array of structures containing a 32-bit text offset of the corresponding word form and packed in 32-bit values ​​of its attributes. For high search speed, both parts of the morphological base are fully loaded into the RAM.

The resulting code provides the following API:
 func BuildMorph(txt_filename, morph_filename string) error func InitMorph(morph_filename string) error func FinalizeMorph() func FindTerminals(prefix, separator string) []ParseMatch 

The FindTerminals function accepts an additional separator parameter, since in the case of a stable phrase it is an integral part of it, otherwise the separator is unnecessary and must be discarded.

Parse cache


Even after speeding up the search for word forms, the overall performance of the analysis left much to be desired (analysis of common sentences lasted several seconds). Since the structure of the developed grammar involves multiple textual comparisons to the same non-terminals (the presence of many similar rules), the use of caching was suggested. Therefore, before starting parsing, the Parse function checks for the presence of such (previously performed) parsing in the cache.

Since the cache is based on a hash table, a protecting mutex is required to support parallel access. However, one global mutex can become a bottleneck, so the cache is divided into the required number of banks , each of which consists of a hash table and a corresponding protecting mutext.

Using the analyzer


To use the developed analyzer you need:

Source: https://habr.com/ru/post/255711/


All Articles