📜 ⬆️ ⬇️

Tomita-Yandex parser for all

Yandex continues to develop its API functions. And here is the result in the form of a new parser. Tomita Parser is a tool for extracting structured data (facts) from natural language text. Fact retrieval takes place using context-free grammars and keyword dictionaries. Parser allows you to write your grammar, add your own dictionaries and run on texts.

Tomita-parser allows for user-written patterns (KS-grammars) to select from the text broken into the field of chains of words or facts. For example, you can write patterns to highlight addresses. Here the fact is the address, and its fields are “city name”, “street name”, “house number”, etc. The parser includes three standard linguistic processors: a tokenizer (word splitting), a segmenter (splitting into sentences) and a morphological analyzer (mystem). The main components of the parser are: gazetters, a set of QS grammars and many descriptions of the types of facts that are generated by these grammars as a result of the interpretation procedure.

Algorithm parser on one sentence and one grammar

1. Looking for the entry of all keys from the gazetter. If the key consists of several words (for example, “Nizhny Novgorod”), then a new artificial word is created, which we call “multivord”.
')
2. Of all the keys found gazetteer selected those that are mentioned in the grammar.

3. Among the selected keys, multivord may intersect with each other or include single keywords. The parser tries to cover the sentence with non-intersecting keywords so that as large pieces of the sentence as possible are captured by them.

4. Linear chain of words and multivordov is fed to the input of the GLR-parser. Grammar terminals are mapped to input words and multivords.

5. On a sequence of terminal sets GLR-parser builds all possible options. Of all the options built, those that cover the offer as widely as possible are also selected.

6. Then the parser starts the interpretation procedure on the constructed syntax tree. It selects specially marked sub-nodes, and the words that correspond to them are written into the fact fields generated by the grammar.

What tasks can be solved? For example, to issue structured information about birth dates of famous personalities, place of birth, educational institutions in which they studied, and so on. Probably, it can be said that this is the first serious level text analyzer to which there will be free access for solving new linguistic applied tasks of word processing and their issuance. Developers have yet to realize the full power of the resulting toolkit, but it is already clear that these opportunities will breathe new life into the technology of creating websites.

Source: https://habr.com/ru/post/175099/


All Articles