Splitting text into sentences using Tomita Parser

To prepare Russian-language texts for further analysis, I once needed to break them up into sentences. Of course, automatically. What comes to mind first if you need to divide the text into sentences? Split by points - guess?

If you start to do this, you will rather quickly encounter the fact that the dot is not always the separator of sentences (“because”, “etc.”, “etc.”, “etc.”, “STALKER "). Moreover, these tokens are not always exceptions when breaking the text into sentences. For example, “etc.” may be in the middle of a sentence, and maybe at the end.

The question and exclamation mark also do not always divide the text into sentences. For example, “Yahoo!”. Sentences may also separate other characters, for example, a colon (when a list of individual statements follows).
')
Therefore, without looking at it, I searched the finished tool for a long time and stopped at Tomita-parser from Yandex. About him and tell.

In general, the Tomita-parser is a powerful tool for extracting facts from text. The segmentator (breakdown of the text into sentences) in it is only part of the project. Tomita-parser can be downloaded immediately as a binary and run from the command line. I liked this system because it works on the basis of rules, is not whimsical to resources, and makes it possible to customize the segmentation process. And also according to my observations in most cases, perfectly copes with the task.

I also liked that if you have questions, you can ask them on github and sometimes even get an answer.

Launch

Tomita-parser is started this way

$ echo "p,  ... ,   ..  .   .    STALKER ." | ./tomita-linux64 config.proto

That is, the reading is from stdin, the output is to stdout.
The result is something like this:

 [10:01:17 17:06:37] - Start. (Processing files.)  ,   . . .  ,   . .   .    .    STALKER  . [10:01:17 17:06:37] - End. (Processing files.)

One line - one sentence. This example shows that the breakdown was correct.

Special features

What pay attention.

Spaces are added to the result before the punctuation marks.
Extra spaces are removed.
There is an automatic correction of some misprints (for example, in the source text the last letter in the word “Parser” is English “pi”, and in the processed text it is already the Russian “era”).

These features can be both pluses and minuses depending on what you will do next with the received text. I, for example, continue to build syntactic trees using SyntaxNet, and there the punctuation marks should be separated by spaces, so for me this is a plus.

Settings

I was faced with the fact that when analyzing sentences containing addresses, the system breaks them incorrectly. Example:

 $ echo "   .       ." | ./tomita-linux64 config.proto [10:01:17 18:00:38] - Start. (Processing files.)     .        . [10:01:17 18:00:38] - End. (Processing files.)

As you can see, the breakdown was incorrect. Fortunately, such things can be customized. To do this, in the gzt file we write

 TAbbreviation "." { key = { "abbreviation_." type = CUSTOM } text = "." type = NewerEOS }

That is, we ask you to assume that after “st.” The offer always continues. We try:

 $ echo "   .       ." | ./tomita-linux64 config.proto [10:01:17 18:20:59] - Start. (Processing files.)    .        . [10:01:17 18:20:59] - End. (Processing files.)

Now everything is fine. I laid out an example of settings on github .

What are the cons

About some features I mentioned above. A few words about the cons of the tool at the moment.

The first is documentation. It is, but not all is described in it. I tried now to look for the setting that I described above - I did not find it.

The second is the lack of an easy way to work with the parser in daemon mode. Processing one text in 0.3–0.4 seconds, taking into account the load of the entire system into memory, is not critical for me, since all processing goes into background processes and there are much more bold tasks among them. For some, this may be a bottleneck.

PHP call example

As mentioned above, we submit the input to stdin, read from stdout. The example below is based on github.com/makhov/php-tomita :

 <?php class TomitaParser { /** * @var string Path to Yandex`s Tomita-parser binary */ protected $execPath; /** * @var string Path to Yandex`s Tomita-parser configuration file */ protected $configPath; /** * @param string $execPath Path to Yandex`s Tomita-parser binary * @param string $configPath Path to Yandex`s Tomita-parser configuration file */ public function __construct($execPath, $configPath) { $this->execPath = $execPath; $this->configPath = $configPath; } public function run($text) { $descriptors = array( 0 => array('pipe', 'r'), // stdin 1 => array('pipe', 'w'), // stdout 2 => array('pipe', 'w') // stderr ); $cmd = sprintf('%s %s', $this->execPath, $this->configPath); $process = proc_open($cmd, $descriptors, $pipes, dirname($this->configPath)); if (is_resource($process)) { fwrite($pipes[0], $text); fclose($pipes[0]); $output = stream_get_contents($pipes[1]); fclose($pipes[1]); fclose($pipes[2]); proc_close($process); return $this->processTextResult($output); } throw new \Exception('proc_open fails'); } /** *    * @param string $text * @return string[] */ public function processTextResult($text) { return array_filter(explode("\n", $text)); } } $parser = new TomitaParser('/home/mnv/tmp/tomita/tomita-linux64', '/home/mnv/tmp/tomita/config.proto'); var_dump($parser->run(' .  .'));

Checking:

 $ php example.php /home/mnv/tmp/tomita/example.php:66: array(2) { [0] => string(32) "  . " [1] => string(32) "  . " }

In conclusion

In the process of working on the text, I regularly come across projects in which the authors make a segmenter on their own. Perhaps because at first glance the task seems a little easier than it actually is. I hope the article will be useful to those who are going to make the next segmenter within the framework of their project and will save time by choosing the ready version.

I would be glad to learn from the comments, what tool for splitting text into sentences do you use?

Source: https://habr.com/ru/post/317726/

All Articles