Fuzzy search in relational databases

To search for the necessary information on websites and in mobile applications it is often used to search by words or phrases that the user freely enters from the keyboard (rather than choosing from a list for example). Naturally, the user can make mistakes and typos. In this case, full-text search, full-text indexes that are implemented in most databases do not give the expected result and are practically useless. Such functionality is increasingly implemented on the basis of elasticsearch.

Solutions using elasticsearch have one major drawback - a very high probability of misalignment of the main database, for example PostgreSQL, MySQL, mongodb and elasticsearch, in which indexes are stored for the search.

The ideal option would be to have a “bridge” that would assume the function of data matching, in the event that the base for the search engine is unavailable during the update of the main database. But I have not yet found the realization of such a bridge. For example, in one of the mongodb and lucene bundles projects, this problem is mentioned.

Distributed to existing nodes. This allows for rolling shutdowns. Existing nodes without committing. These indexes could require rollback or repair. It is not always possible to use it. Since the documents were stored in MongoDB and not in the index, it was possible. MongoDB also provides seamless failover through replication. MongoDB's replication is data center aware backups across datacenters are possible.

How is this problem solved in practice? Yes, nothing. If the data is not very large, then the database is simply re-indexed by timer. If the base is large and it is often impossible to reindex it, then everything remains as it is, inconsistent, it is a little more difficult to identify this discrepancy.

I hope that sustainable "bridges" that will sustainably update indices in elasticsearch or lucene will appear sooner or later. Now there is a need to find a working solution.
')
One option is to use a single database for data storage and search. Regarding the use as such a single base elasticsearch, in almost all discussions on the forums there was a consensus that such a solution is not suitable. Therefore, I began to search for a database in which to create full-text indexes that support fuzzy search. Since the main engine for such indices, lucene, was developed in java, the range of databases in which I was looking for such a possibility was clearly outlined.

As it turned out, there are at least two solutions that use the lucene library and are at the poduction ready application level: these are orientdb and h2.

In orientdb, working with fuzzy full-text search is very simple:

create class russian create property russian.message string create index russian.message on russian(message) fulltext engine lucene metadata { "analyzer": "org.apache.lucene.analysis.ru.RussianAnalyzer" } select * from russian where message lucene '~0.5' limit 2

The h2 is a bit more complicated. The index is a separate table with which you need to link the main table. But a little harder it does not mean difficult.

 CREATE ALIAS IF NOT EXISTS FTL_INIT FOR "org.h2.fulltext.FullTextLucene.init"; CALL FTL_INIT(); DROP TABLE IF EXISTS TEST; CREATE TABLE TEST(ID INT PRIMARY KEY, FIRST_NAME VARCHAR, LAST_NAME VARCHAR); CALL FTL_CREATE_INDEX('PUBLIC', 'TEST', NULL); INSERT INTO TEST VALUES(1, 'John', 'Wayne'); INSERT INTO TEST VALUES(2, 'Elton', 'John'); SELECT * FROM FTL_SEARCH_DATA('John', 0, 0); SELECT * FROM FTL_SEARCH_DATA('LAST_NAME:John', 0, 0);

apapacy@gmail.com
April 22, 2018

Source: https://habr.com/ru/post/354032/

All Articles

Fuzzy search in relational databases

More articles: