Fetching facts

The task of extracting information from the text itself is not new: quite a lot of work has been done in this direction both by large companies aka Yandex and Google, and by independent developers. However, to say that this problem is finally solved, alas, is not necessary. In this article I want to streamline my knowledge on this issue, superficially examining the developments that I recently had to face.

Existing Solutions

And so, let us be given the proposal “The visit of the President to Denmark will give a new breath to the dialogue between the two countries.” To isolate the facts from this sentence, you will need to go through the following steps:

I. Tokenization

At this stage it is necessary to break the sentence into separate words. There should be no problems here.

Ii. Morphological analysis

After the breakdown, for each word it is necessary to obtain morphological information about it (part of speech, gender, case, number, etc.) and all sorts of attributes (for example, is the word a name or a geographical location). For this task, you must have special dictionaries: as a register of Russian words, you can use the Zaliznyak dictionary or its proprietary derivatives . As a ready-made solution, you can consider the Mystem utility from Yandex.
')
If the Mystem utility is used, the original sentence will be parsed as follows:

 krestyaninov @ localhost # echo "The visit of the President to Denmark will give a new breath to the dialogue of the two countries" |  ./mystem -niwg
 Visit {visit = S, husband, neo = (im, unit | v, unit)}
 President {President = S, husband, od = (clan, ed | vin, ed)}
 in {in = PR = | in = S, abbr = (im, ed | im, plural | kind, ed | genus, plural | dat, ed | dates, plural, ed | ed, plural, ed | creator, pl | pr, ed | pr, pl)}
 Denmark {Denmark = S, Geo, Female, Neod = Wine, Unit}
 will give {attach = V = unproach, unit, withdrawing, 3-l, owls}
 new {new = S, media, neody = (im, unit | vin, unit) | new = A = (im, unit, complete, medium | win, unit, complete, medium)}
 breathing {breathing = S, media, neody = (im, unit | vin, unit)}
 dialogue {dialogue = s, husband, neo = date, unit}
 two {two = NUM = (genus | wines, wives, od wines, husband, od | pr)}
 countries {country = S, wives, neode = gender, mn}

Iii. Syntax parsing

At this stage, the associated subgroups of words in the sentence are determined. For example, the NEGATIVE-VERGE: “visit-betray”. The establishment of these relationships will allow us to determine the ambiguities in the morphological analysis. For example, from the phrase "new breath" it is clear that "new" is an adjective, and not a noun (which was not certain after morphological analysis).

Iv. Semantic parsing

The task of semantic analysis is to build a full-fledged tree of connections of words in a sentence. This process has many nuances and in general is quite complicated to be described in this article. More information about the semantic analysis can be found here .

V. Fetching facts

Having formed a connection tree of words in a sentence, we can proceed directly to extracting facts. To do this, you can use the following tools:
- Search for the reference element : Some word is searched for in the text (for example, “President”), on the basis of which a fact is built on the basis of a connection tree;
- Search by pattern : Search for data by a regular expression (for example, the isolation of the date);
- Search by ontology : Search for data based on predicative rules described in a special language . An example .

More information about the fact retrieval can be found at the following addresses:
- Presentation "Yandex.Press-portraits" ;
- Presentation: “Automatic extraction of facts from the text” ;

Existing Solutions

Of the existing systems of fact extraction that work in an acceptable manner, I managed to find only a system called GATE . The system is quite interesting and promising, but, unfortunately, it has no native support for the Russian language. You can try it in action with the help of this manual .

From paid systems, we can note the domestic development of RCO (thanks to rg_software ).

PS

As mentioned above, the article is a compilation of the data on the topic of information extraction that I was able to find and understand. If you have information / ideas / ideas on the topic - I will be glad to see them in the comments.

Source: https://habr.com/ru/post/93641/

All Articles

Fetching facts

Existing Solutions

I. Tokenization

Ii. Morphological analysis

Iii. Syntax parsing

Iv. Semantic parsing

V. Fetching facts

Existing Solutions

Related links

PS

More articles: