Probabilistic morphological analyzer of Russian and Ukrainian languages in PHP

Before each site developer, sooner or later there is a question about the implementation of site search. It is desirable that the search be based on the word, i.e. did not take into account the end of words. For this purpose, programs stemmers are used, which separate the word from the word. Many stemmers work on the basis of a dictionary, and in order not to use huge dictionaries in small and medium-sized projects, a probabilistic morphological analyzer can be used. Its distinctive feature is the relatively small size of the database and, accordingly, in the absence of load on the database. Without large losses as a selection basis.

Stemming is the process of finding the stem of a word for a given source word. The basis of the word does not necessarily coincide with the morphological root of the word. The algorithm of stemming is a long-standing problem in the field of computer science. This process is used in search systems to summarize the user's search query.
Specific implementations of stemming are called a stemming algorithm or simply a stemmer.

')
Recently, I needed a stemmer for Russian and Ukrainian with decent quality, digging in the Internet on the website of Andrei Kovalenko, I found a very interesting stemmer. Description of the measurer .

It was implemented in C ++, which upset me very much. It was not upsetting me that it was written in C, but that I was unable to use it because of the specifics (only PHP). I did not accept this and, having armed with debugger, ported this application to PHP.

The site has a more productive stemmer as a module for PHP, but for me it’s not very important how many words per second it will process 12 thousand or 2-3 thousand, one thousand is enough for me (I didn’t test speed)

Ported class code (stemka.php)

How to make it work:

Downloading the original library Original library from the library folder pick up fuzzy * .inc dictionaries

We present dictionaries in a form convenient for PHP. I converted the data to a binary file and loaded it using the file_get_contents function.

Before converting, you need to edit C ++ files with dictionaries.
1. Add the tag "<? Php" to the beginning of the file
2. Add to the end of the file "?>"
3. Replace "{" with "$ fuzzy = array ("
4. Replace "}" with ");"

After that, run the conversion script and the files will be converted.

<? php
include "fuzzyuk.inc" ;
$ fp = fopen ( 'fuzzyuk.dat' , 'w' ) ;
foreach ( $ fuzzy as $ v )
fwrite ( $ fp , chr ( $ v ) ) ;
fclose ( $ fp ) ;
include "fuzzyru.inc" ;
$ fp = fopen ( 'fuzzyru.dat' , 'w' ) ;
foreach ( $ fuzzy as $ v )
fwrite ( $ fp , chr ( $ v ) ) ;
fclose ( $ fp ) ;
?>

If there is no desire to convert - here already converted dictionaries fuzzyuk.dat (243 KB) fuzzyru.dat (403 KB)

Stemmer is ready to go. Example of use:

<? php
include "stemka.php" ;
$ stemka = new stemka ( ) ;
$ str = 'interleave' ;
echo $ stemka -> GetStemCrop ( $ str , 'uk' ) ;
?>

or demo version

I do not pretend to complete the coverage of the topic, I just decided to share the code, but what if someone comes in handy ...

You can criticize and minus.

Source: https://habr.com/ru/post/102037/

All Articles

Probabilistic morphological analyzer of Russian and Ukrainian languages ​​in PHP

More articles:

Probabilistic morphological analyzer of Russian and Ukrainian languages in PHP