Trigram index or "Search with typos"

Somehow on duty, it became necessary to add to the search on the site a well-known feature, the service “Perhaps you meant ...” or “Search with typos”. Began to think how to implement. I did not want to use third-party services and api, because the time to a foreign server and back, and in general is not very good. Just by the way came the pg_trgm module, which is looking for words close to the query based on the trigram index.

Implementation

For starters, how it is used.
To search for similar words, you need to create a list of the correct options. Create a table with a text field on which we will later hang the trigram index.

CREATE TABLE " public "."tbl_words" ( "word" VARCHAR (255) NOT NULL ) WITHOUT OIDS; * This source code was highlighted with Source Code Highlighter .

')
You can fill in the table in various ways, for ourselves we solved this question like this:
- Grammar dictionary Zaliznyak (~ 90 thousand words), a dictionary of Russian literature (~ 160 thousand words), either any other or all together. Dictionaries are easily found in the network, usually they are a line-by-line list of words, parsing such in the database is not difficult.
- We have a book online store, about 200 thousand products in the database, each has a name, author, brief summary and so on. Since people are likely to search using words from this data, we collect everything in a heap, divide by whitespace characters, and also fill unique words in the table.

Next, add the index.

CREATE INDEX "tbl_words_trgm_idx" ON " public "."tbl_words" USING gist ("word" " public "."gist_trgm_ops"); * This source code was highlighted with Source Code Highlighter .

Here in this form the scheme can already be used.

select word, similarity(word, '' ) from tbl_words * This source code was highlighted with Source Code Highlighter .

A similar request will give us similar words and a rate that indicates how much the word is similar to the proposed one. Sorting by rating "similarity" in the reverse order, we get the most similar words.

select word from tbl_words where word % '' order by similarity(word, '' ) desc , word limit 5 * This source code was highlighted with Source Code Highlighter .

Next to determine when all the same person was wrong and he must offer the right option. The most obvious option is if the user did not find anything at his request, or, for example, found few results, check for similar options. If the search for similar options yielded a certain number of results, we show a hint to the user.
At this stage, in order to achieve the correct result, it is possible to twist the values to give / not to give, how many gave the search by similar words and other properties. It is necessary to establish the threshold value of “similarity rating”, to which the words should not even be considered as similar.

This method, however, has several big drawbacks:
1) The dictionary we have consists of individual words, while the search queries are usually complex. Search for similar phrases has to be carried out separately by words, then they are combined. That is, for example, having the search phrase “Pushkin's poignant poetry,” which does not yield results, one has to look for similar words, respectively, for “Poigny,” “poetry,” “Pushkin.” If you take 2 similar words, the number of variations will be equal to 8. This creates a good load on the base, which usually does not please.
2) Even with the installation of all the necessary parameters, funny curiosities happen, when searching for a word that is not related to the query, will produce more results than the original query. Thus, you get quite funny clues, for example, when searching for the word “Tina”, a suggestion appears, “maybe you meant Putin,” or, God forbid, on the contrary)

Total:
Pros - easy implementation, a large number of options tips.
Cons - loading the database for large queries, occasionally incorrect hints.

Option 2.

If you keep statistics on search queries, you can use another option.
A table is created with unique search queries, a trigram index is added to the field with queries according to the same principle. And the match is not searched separately by words, but by the full search phrase in the saved queries.

Total:
Pluses - phrases in tips are written by users, less likelihood of “silly” tips.
Cons - the completeness of the database depends only on your statistics requests.

After repeated testing, we decided to abandon the first method in favor of the second, its results were painfully unpredictable) The results of the second method can be found here .

How it works.

To make it all clear, I will tell you how it works. It's pretty simple.
Each word is divided into three-letter combinations - trigrams.
For example:

When searching for a similar phrase, the same trigrams are searched. The more equal trigrams, the more the phrase is similar to the original.

Similar products.

The trigram index was useful in one more case. On the product page, we have various blocks of sentences: “Products by this author”, “Products from this publisher” and so on. One of them - "Similar products."
Many sites have a similar function, but from experience, it is usually either products from the same category as the main product, or a manually assigned list. There was also an implementation in which the products found using full-text search were displayed, for example, by two or one word from the name of the main product. But it also gives, often, not very predictable results. For example, the book Architecture of C ++ was issued books on architecture and construction.

The trigram index, established on the names of goods, gave quite good "relevant" results) An example is here , or on any product page on the site.

If someone is interested, next time I'll post the performance tests and source codes.

UPD: Here is another example of the search - with the permutation of letters .

Source: https://habr.com/ru/post/78566/

All Articles

Trigram index or "Search with typos"

Implementation

Option 2.

How it works.

Similar products.

More articles: