📜 ⬆️ ⬇️

Integration of Russian words stemming algorithm in fts3 SQLite

In this article, I want to share the experience of integrating the extensions in SQLite code. All actions were performed in Ubuntu OS 11.10.

Problem


In fts3 SQLite there is a simple stemmer that implements Porter’s stemming algorithm, but there is no implementation for Russian words. Those. MATCH cannot find entries containing the word 'hotel', etc.

Preparing to compile


What is needed



Further it is supposed that source codes of sqlite3 lie in $ HOME / SQLite.
')
Code stemmer

Encoding of Russian characters UTF-8.
Stemmer uses Porter's built-in stemmer for Latin words, and implements a similar algorithm for Russian words.
Initially, the code was written for C ++ and loaded as an extension for SQLite. I modified it so that you can compile it on the C compiler, so it's very far from beautiful and rigorous here. Here's what I got:
fts3_porter_ext.c
Put our stemmer in $ HOME / SQLite / ext / fts3 / fts3_porter_ext.c

Edit files

Makefile.in

Rule $ HOME / SQLite / Makefile.in.

fts3.c

Rule $ HOME / SQLite / ext / fts3 / fts3.c.
Add after line
void sqlite3Fts3PorterTokenizerModule(sqlite3_tokenizer_module const**ppModule);

the string
void sqlite3Fts3PorterTokenizerModule1(sqlite3_tokenizer_module const**ppModule);

After line
sqlite3Fts3PorterTokenizerModule(&pPorter);

Add initialization of our module
const sqlite3_tokenizer_module *pPorter1 = 0;
sqlite3Fts3PorterTokenizerModule1(&pPorter1);

Finally after
|| sqlite3Fts3HashInsert(pHash, "porter", 7, (void *)pPorter)

add our module to the hash of embedded tokenizers
|| sqlite3Fts3HashInsert(pHash, "russian", 8, (void *)pPorter1)

mkfts3amal.tcl

Rule $ HOME / SQLite / ext / fts3 / mkfts3amal.tcl
After line
fts3_tokenizer1.c

Add
fts3_porter_ext.c

mksqlite3c.tcl

Rule $ HOME / SQLite / tool / mksqlite3c.tcl
After line
fts3_tokenizer1.c

Add
fts3_porter_ext.c


Compilation


Perform the following (--prefix = $ HOME is better to replace with something more sane. This will be the installation path)
cd $HOME/SQLite && mkdir build && cd build && ../configure --prefix=$HOME CFLAGS='-DSQLITE_SOUNDEX -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS' && make

Now we’ll check that our stemmer is in sqlite3.c
grep fts3_porter_ext.c sqlite3.c

It should get something like this:
/************** Begin file fts3_porter_ext.c *********************************/
/************** End of fts3_porter_ext.c *************************************/

Now install sqlite3 on the computer:
sudo make install


Using


When creating fts3 tables, you need to specify our stemmer, for example:
CREATE VIRTUAL TABLE tag_fti USING fts3(name, tokenize=russian);

Now, with MATCH queries on the tag_fti table, our stemmer will be used.

Total


We received 2 files sqlite3.c and sqlite3.h, which can be connected to our projects.
No need to load extension modules.
We received a console client that correctly processes requests to the fts3 tables that our applications will create. The opposite is also true that the tables created by the console client will be processed by our applications.
I would be glad if the article for someone will be useful.

Upd: corrected links

Source: https://habr.com/ru/post/131265/


All Articles