📜 ⬆️ ⬇️

How to use Tomita-parser in your projects. Practical course

Hi, my name is Natalya, I work in Yandex as a developer in the fact extraction team. In the spring, we talked about what a Tomita-parser is and why it is used in Yandex. And this fall, the source of the parser will be posted in open access.

In the previous post, we promised to tell you how to use the parser and the syntax of its internal language. This is what my today's story is about.




')

After reading this post, you will learn how dictionaries and grammars are compiled for Tomita, as well as how to extract facts from natural language texts with their help. The same information is available in the format of a small video course .







A grammar is a set of rules that describe a string of words in a text. For example, if we have the sentence “I like that you are not sick of me,” it can be described using the chain [first person pronoun, singular], [present tense verb and third person], [comma], [union] etc.



Grammar is written in a special formal language. Structurally, the rule is divided by the symbol -> into the left and right parts. On the left side there is one non-thermal, and the right consists of both terminals and non-terminals. The terminal in this context is a certain object that has a specific, immutable value. The set of terminals is the alphabet of the Tomit language, from which all the other words are built. The terminals in Tomit are the 'lemmas' - words in the initial form, written in single quotes, parts of speech ( Noun, Verb, Adj ...), punctuation marks ( Comma, Punct, Hyphen ...) and some other special characters ( Percent, Dollar ...). There are about twenty total terminals in Tomit, a complete list is presented in our documentation. Non-terminals are made up of terminals, and if we draw an analogy with natural languages, they are something like words. For example, the nonterminal NounPhrase, consisting of two terminals Adj and Noun , means a chain of two words: first an adjective, then a noun.



To compose our first grammar, you need to create a file with the extension .cxx, let's call it first_grammar. You can save it in the same place where the parser binary itself is located. In the first line of the grammar file you need to specify the encoding:



 #encoding "utf8" 

Then you can write the rules. In our first grammar there will be two:



 PP -> Prep Noun; S -> Verb PP; 

The first rule describes the nonterminal PP - a prepositional group consisting of a preposition and a noun ( Prep Noun ). The second rule is a verb with a prepositional group ( Verb PP ). In this case, the non-terminal S is root, because it is never mentioned on the right side of the rule. Such a non-terminal is called a treetop. It describes the entire chain that we want to extract from the text.



Our first grammar is ready, but before you run the parser, you need to do a few more manipulations. The fact is that grammar interacts with the parser not directly, but through the root dictionary - an entity that collects information about all the created grammars, dictionaries, additional files, etc. Those. The root dictionary is a kind of aggregator of everything that is created within the project. Dictionaries for Tomita-Parser are written using syntax similar to Google Protobuf (using a modified version of the Protobuf compiler, with support for inheritance). Files are usually given the extension .gzt. Create the root dictionary dic.gzt and in the beginning also specify the encoding:



 encoding "utf8"; 

After that, we import into the root dictionary files containing the base types used in dictionaries and grammars. For convenience, these files are stored in the parser's binary, and we can import them directly, without setting the path to them:



 import "base.proto"; import "article_base.proto"; 

Next we create an article. A dictionary entry describes how to highlight a string of words in a text. Grammar is one of the possible ways. You can select a chain using the list of keywords built into the algorithm's parser (the name and date chains). Other methods can be added at the source level of the parser (for example, the statistical named entity recognizer). The article consists of type, title and content. What types of articles are and what they are for, I will discuss below when we talk more about dictionaries. For now, we'll use the basic type of TAuxDicArticle . The name of the article must be unique, it is indicated in quotes after the type. Further, in curly brackets are listed the keys - the content of the article. In our case, the only key contains a reference to the grammar written by us. First, we specify the syntax of the file to which we refer (in the case of a file with a grammar it is always tomita ) and the path to this file, then in the type field - the type of the key (it must be indicated if the key contains a link to the grammar).



 TAuxDicArticle "_" { key = {"tomita:first_grammar.cxx" type=CUSTOM} } 

To tell the parser where we get the source text, where we write the result, what grammars we run, what facts we extract, as well as other necessary information, we create a single configuration file with the .proto extension. Create a config.proto file in the parser folder. As usual, at the beginning we specify the encoding and proceed to the description of our configuration.



The only required parameter of the configuration file is the path to the root dictionary, which is written in the Dictionary field. All other parameters are optional. Information about the input file is in the Input field. In addition to text files, Tomit can also input a folder, an archive or stdin. The Output field records where and in what format (text, xml or protobuf) the extracted facts should be saved. We will send the input.txt file to the input . The Articles field lists the grammars we want to run. Please note that here we indicate not the grammar file itself, but the title of the article from the dictionary, which contains a link to this file: as we have said, the parser interacts with all project files indirectly through the root dictionary.



 encoding "utf8"; TTextMinerConfig { Dictionary = "dic.gzt"; Input = {File = "input.txt"} Output = {File = "output.txt" Format = text} Articles = [ { Name = "_" } ] } 

Now that the configuration file is ready, we have to put the text file for analysis next to the binary (you can use our test file or take your own) and you can proceed to launch the grammar. In the terminal, go to the folder where our parser is located. The parser starts with a single argument — the name of the configuration file. Accordingly, in * NIX-systems, the command to start will look like this:



 ./tomitaparser config.proto 

The results can be found in the file output.txt. However, we will not see any extracted facts there, because in our grammar there are only rules for identifying chains, and in order for the selected chains to turn into structured facts, we must add an interpretation procedure. We will talk about it below. However, we can see the selected chains already at this stage, for this we need to add another parameter to the configuration file - debug output:



 PrettyOutput = "pretty.html" 

Thanks to this parameter, the results of the parser will be recorded in an html-file with a more visual representation. Now, if we restart the grammar and open the pretty.html file that appeared in the folder, we will see that we have extracted all the chains we described in the grammar — the verbs, followed by a noun with the preposition:



Result
to go on the slats
stop near dukhan
stay overnight
walk up to him
go to Stavropol
move with
take on vodka
come on line
do against the highlanders
reckon in the third
follow the day
to be in the south
go to the mountain
look back on the valley
demand vodka
look at the captain
bump into a cow
shelter by the fire
be in Chechnya
step back
pull out of the suitcase
regret
go out in front
put on trial
to be visiting
stand in a fortress
settle in a fortress
walk on a boar
burst out laughing
be in it
be on money
clap
look at this
run after the owner
to become in the hut
go on the air
go to the mountains
Wade along fence
to be behind the Terek
ride with abreks
jump over stumps
follow the tracks
hang on the front
fly into the ravine
kill to death
drag on the steppe
run along the shore
fly from under the hooves
shine in the dark
zazvenet about chain mail
hit the fence
rush into the stable
grab your guns
spin among the crowd
put in a stranger
talk about something else
come from love
jump in the village
leave the fortress
change in the face
jump over a gun
gallop on dashing
snatch from cover
fall to the ground
come to the fortress
ride on it
go to the village
go to him
stand at a dead end
stand at a dead end
sit in the corner
wither in captivity
look out the window
sit on a stove bench
go to him
hit hands
to be able to crack
dream in a dream
wait by the road
be at dusk
dive from the bush
see from the hillock
chomp in the snow
get out of the hut
go out as
to set off
to get exhausted
lead to heaven
disappear into the cloud
rest on top
crunch under your feet
surging into the head
fall away from the heart
climb the Good Mountain
get off the shelf
descend from Hood Mountain
come from the word
fall under your feet
turn into ice
hide in the fog
beat the bars
stop in the weather
give vodka
play out on the cheeks
announce death
wash for the boars
go beyond the serf
walk around the room
sit on the bed
drag in the mountains
fall on the bed
be in September
walk along the serf
sit on the sod
to be from the shaft
sit on the corner
stand still
stand on stirrups
come back from hunting
to be behind the river
bet
change to this
carry on the hunt
yearn for home
put in that
leave custody
go to America
die on the road
to be in the capital
to come from drinking
to be a wonder
dart through the reeds
go to the reeds
get together
specify in the field
tear from the saddle
compare with Pechorin
stick with a gun
fall on your knees
to hold on hands
climb a cliff
jump off the horses
pour out of the wound
be memoryless
put him
send for a doctor
get out of the fortress
sit on a rock
drag in the bushes
jump on horse
sit by the bed
turn to the wall
want to go to the mountains
meet soul
will be in paradise
come to the idea
die in that
kneel down
go to serf
die with grief
sit on the ground
run over the skin
to bury behind the fortress
go to Georgia
return to Russia
part with maxim


The parser tries to normalize the extracted chains, leading the main word of the chain (the first by default) to the initial form.



The next step is the introduction of an interpretation procedure, i.e. conversion of extracted chains into facts.



First we need to create a structure of the fact that we want to extract, i.e. describe what fields it consists of. To do this, create a new fact_types.proto file. Again, we will import the files with the basic types, and then proceed to the description of the fact. After the word message, the name of the fact, the colon and the base type of the fact from which the type of our fact is inherited is written. Next, in curly brackets, we list the fields of our fact. In our case, the field is one, it is required (required), the text (string) is called Field1 and we assign it the identifier 1.



 import "base.proto"; import "facttypes_base.proto"; message Fact: NFactType.TFact { required string Field1 = 1; } 

Now we need to import the file we created into the root dictionary (dic.gzt):



 import "fact_types.proto"; 

Let us turn to grammar, in which the interpretation procedure takes place. Suppose we want to extract the following fact from the text: verbs that control nouns with a preposition. To do this, we write interp in the rule after the verb marker, and then in parentheses the name of the fact and the period in which we want to put the extracted chain.



  S -> Verb interp (Fact.Field1) PP; 

Interpretation can occur anywhere in the grammar, but the fact is removed only if the interpreted character gets into the root non-terminal.



The last detail needed to run is to specify in the configuration file which facts we want to extract when starting the parser. The syntax in this case is the same as when specifying the launched grammars: all necessary facts are listed in the Facts field of the square brackets. In our case, there is only one fact so far:



 Facts = [ { Name = "Fact" } ] 

Now you can run the parser again.



Result
go
stay
stay
come up
go
move
to take
come
put it on
be considered
follow
to happen
go
look back
require
look
to stumble
take shelter
be
move away
pull out
regret
go out
give away
be
stand
settle
walk
to tear
be
be
clap
to look
run
become
go out
lie down
Wade through
be
ride
jump
to run
hang
fly
to be killed
to reach out
run
fly
shine
to ring
hit
rush
grapple
spin
put it on
to speak
to happen
to jump
move out
to change
jump over
to ride
snatch
to tumble down
to come
ride off
go
to go
become
become
sit
to fade away
look in
sit
come in
bump
be able to
to dream
to wait
be
dive
see
chill out
go out
go out
to move
get out
to lead
disappear
relax
crunch
surging
fall away
climb
get off
to go down
take place
fall through
turn into
to hide
to fight
stop
to give
play out
to announce
undermine
go out
walk
sit
to drag off
to fall
be
walk around
sit down
be
sit
stand
get up
return
be
to fight
to change
spend
to yearn
put it on
go out
to go
die
to happen
take place
be
dart
get away
get together
indicate
tear
compare
attach themselves
to fall
Keep
climb
jump off
pour
be
to plant
send
go out
sit down
to drag
jump
sit
turn away
want
to meet
will be
to come
die
become
to go
die
sit down
run through
bury
to leave
come back
breake down

Additional grammar features


Now we will set ourselves a more difficult task: we will try to write a grammar with which we can extract street names from the text. We will search for text descriptors (the words street, highway, avenue, etc.) and analyze the chains that stand next to them. Chains must begin with a capital letter and be located to the left or right of the descriptor. Let's create a new file with the grammar of address.cxx and save it in the folder with our project. Immediately add an article with our new grammar to the root dictionary:

 TAuxDicArticle "" { key = {"tomita:address.cxx" type=CUSTOM} } 

Now we’ll add to the fact_types.proto file a new Street fact that we want to extract. It will consist of two fields: mandatory (street name) and optional (descriptor).



 message Street: NFactType.TFact { required string StreetName = 1; optional string Descr = 2; } 

To go directly to writing grammar, you need to enter a few new concepts, which we have not touched before.



The first concept is the operators. They allow you to get a more convenient abbreviated grammar rule entry:


Let's move on to writing grammar. In the file address.cxx we will write two rules - in the first we will describe the StreetW non-terminal, which will contain the names of some street descriptors, and in the second - the StreetSokr non- StreetSokr with abbreviations.

 #encoding "utf8" StreetW -> '' | '' | '' | ''; StreetSokr -> '' | '' | '-' | '' | ''; 

Next, we will add a StreetDescr non- StreetDescr , which will combine the two previous ones:



 StreetDescr -> StreetW | StreetSokr; 

Now we need to describe the chains, which, in case they stand next to the descriptor, can be street names. To do this, we introduce two more concepts: restrictions and consistency markers.



Litters clarify the properties of terminals and non-terminals, i.e. impose restrictions on the set of chains, which describes the terminal or nonterminal. They are written in angle brackets after terminals / non terminals and, in the case of non terminals, are applied to the syntactically main word of the group. Litters can be varied in their structure. Some are a unary operator, some have a field that can be filled with different values. We list some litters that we will use in the future (a full list can be found in our documentation):



Source: https://habr.com/ru/post/225723/


All Articles