Improving search relevance in sphinxsearch

Sphinxsearch is a search engine for fast fulltextsearch, can receive data from mysql, oracle and mssql, can act as a repository itself (realtime indexes). Also, sphinx has an operation mode via api and via sphinxql - an analogue of the sql protocol (with some restrictions), which allows you to connect a search through sphinx on the site with minimal code changes. This is one of the few great, large and open projects developed in Russia. In my life, I have seen sphinx handle about 100-200 search queries for 2 million records from mysql and at the same time the server was breathing freely and was not sick, mysql is starting to die at 10 requests per second on a similar config.

The main problem of sphinx documentation in my opinion is a small number of examples for most interesting settings, today I will try to tell you about them in examples. The options that I’ll touch on relate mainly to algorithms and search variations. Anyone who works closely with sphinx will not learn anything new, and newbies hopefully can improve the quality of search on their sites.

Sphinx contains two independent programs, indexer and searchd. The first builds indexes according to data taken from the database, the second searches the built index. Now let's move on to the search settings in sphinx.
')

morphology

Allows you to set the morphology of words, I use only stemming. The algorithm of stemming with the help of a set of rules for the language cuts off the endings and suffixes. Stemming does not use ready-made bases of words, but is based on certain rules of circumcision for the language, which makes it small and fast, but this also adds to its disadvantages as it can make mistakes.

An example of the normalization of the word stemming in Russian.
The words “apple”, “apples”, “apple” will be cut in “apples” and any search query with a variation of the word “apple” will also be normalized and will find records with the words that were described above.

For English, the words “dogs” and “dog” will be normalized to “dog”.
For example, in sphinx should put the word curly into the index, the word curly into the index, and there will be variations of the curly, curly, etc.
Enable stemming is possible for Russian, English or both languages.

morphology = stem_en
morphology = stem_ru
morphology = stem_enru

You can also use the options Soundex and Metaphone they allow you to use for the English language, taking into account the sound of words. I do not use these morphology algorithms in my work, so if someone knows a lot about them I will be happy to read. For the Russian language, such algorithms would allow to obtain from the words “sun” and “sun” the normalized form “sun”, which is obtained on the basis of the sound and pronunciation of these words.

morphology = stem_enru, Soundex, Metaphone

You can connect and external engines for morphology or write your own.

Wordforms

Allows you to connect your word form dictionaries, is well applied on specialized subject sites, has a good example in the documentation.

core 2 duo> c2d
e6600> c2d
core 2duo> c2d

Allows you to find an article on core 2 duo for any search query from the model before name variations.

hemp> grass
dope> weed
my beauty> grass
grass of freedom> grass
why smoke> grass
there is something> grass

And this dictionary will allow your user to easily find information about weed on the site.

For word forms, files in ispell or MySpell format are used (which can be done in Open Office)

wordforms = /usr/local/sphinx/data/wordforms.txt

enable_star

Allows you to use asterisks in queries, for example, on request * pr * prospect, hello, approximation, etc. will be found.

enable_star = 1

expand_keywords

Automatically expands the search query to three queries.

running -> (running | * running * | = running)

Just a word with morphology, a word with asterisks and a complete word match. Previously, this option was not available and in order to search with asterisks I had to manually make an additional request, now everything is turned on by one option. Also, a complete match will be automatically higher in search results than a search with asterisks and morphology.

expand_keywords = 1

index_exact_words

Allows along with the morphologically normalized form to store the original word in the index. This greatly increases the size of the index, but with the previous option allows you to produce results more relevant.

For example, there are three words “melon”, “melon”, “melon” without this option, all three words will be saved in the index as a melon and at the request “melon” will be issued in the order added to the index, that is, “melon”, “melon” “Melon”.
If you turn on the expand_keywords and index_exact_words options, then the query “melon” will have a more relevant issue of “melon”, “melon”, “melon”.

index_exact_words = 1

min_infix_len

Allows you to index parts of the word infixes, and search by them using *, like search *, * search and * search *.
For example, if min_infix_len = 2 and if the word “test” falls into the index, they will be saved into the index “those”, “es”, “st”, “tes”, “eats”, “test” and on request “ec” will be found this word.

I usually use

min_infix_len = 3

A lower value generates too much garbage and remember that using this option greatly increases the index.

min_prefix_len

It is a child of min_infix_len and does almost the same thing only preserves the beginning of words or prefixes.
For example, if min_infix_len = 2 and if the word “test” falls into the index, then the word “te”, “tes”, “test” will be saved and the word “search” will be found.
min_prefix_len = 3

min_word_len

The minimum word size for indexing, defaults to 1 and indexes all words.
I usually use
min_word_len = 3
Smaller words usually have no meaning.

html_strip

Cuts all html tags and html comments. This option is relevant if you build your google / yandex based on sphinxsearch. They launched a spider sparsili site, drove it into the database, slayed the indexer and this option will get rid of the trash in the form of html tags and search only on the site content.

I myself unfortunately did not use it, but the documentation says that it can mess up with all sorts of xml and not standard html (for example, wherever it opens and closes tags, etc.).

html_strip = 1

I will be glad to any questions and clarifications.
Offsite sphinxsearch.com .
If the article was interesting to you, do not be lazy plus.

Source: https://habr.com/ru/post/147745/

All Articles