Search results for your search: fuzzy search implementation

We all make mistakes: in this case we are talking about search queries. The number of sites for the sale of goods and services is growing along with the needs of users, but they may not always find what they are looking for - only because they incorrectly enter the name of the necessary goods. The solution to this problem is achieved by implementing a fuzzy search, that is, using the algorithm for finding the closest values, taking into account possible errors or misprints of the user. The scope of such a search is quite wide - we managed to work on a search for a large online store in the food retail segment.

Initial Search State

The online store was developed on the 1C-Bitrix: Site Management platform. To implement the search, we used the standard bitrix search engine and the full-text sphinxsearch engine. In sphinxsearch, the type of index Real Time (RT) was used, which does not require a static data source, but can be filled at any time in real time. This allows you to flexibly update the search index without reindexing it. Since the RT index uses only SphinxQL as a query protocol, integration was carried out through it. SphinxQL is a mysql-like query protocol that implements the ability to connect via standard mysql clients, while still presenting sql syntax with some limitations and special features. The search module uses select / insert / replace / delete queries.

In the bitrix system, the process of indexing these products, categories and brands was carried out. Information on these entities was transferred to sphinx, which in turn updated the RT index. When updating data in the online store, an event is triggered that updates the data in Sphinx. Data consistency is provided through the entity identifier in the online store.

When a user searches in an online store, the system makes a query with a search phrase in Sphinx and receives entity identifiers, also selects information from the database on them and generates a page with the results of search results.
At the time of the beginning of solving the problem of fuzzy search, the general scheme of the search architecture on the project was as follows:
')

Choice of technology

We were faced with the task: to guess the user's request, which, possibly, contains typos. To do this, we need to analyze each word in the request and decide whether the user wrote it correctly or not. In case of an error, it was necessary to select the most suitable options. Determining the spelling of words is impossible without a base of words and word forms of the language in which we want to guess them. Simply, such a database can be called a dictionary - it was we who needed it.

To select variants of substitutions for words entered with an error, the popular Damerau – Levenshtein distance calculation formula was chosen. This formula is an algorithm for comparing two words. The result of the comparison is the number of operations to convert one word to another. Initially, Levenshtein distance assumes the use of 3 operations:

insert
deletion
replacement

The Damerau-Levenshtein distance is thus an extended version of the Levenshtein distance and adds another operation: transposition, that is, the permutation of two adjacent symbols.

Thus, the number of received operations becomes the number of errors made by the user when writing the word. We chose a limit of two errors, since a larger number did not make sense: in this case, we get too many options for replacement, which increases the likelihood of a miss.

For a more relevant search for variations of words similar in sound, the metaphone function is used - this function converts a word into its phonetic form. Unfortunately, the metaphone works only with the letters of the English alphabet, so before calculating the phonetic form, we translate the word. The value of the phonetic form is stored in the dictionary, and also calculated in the user's query. The resulting values are compared by the Damerau-Lowenstein distance function.

The dictionary is stored in the MySQL database. In order not to load the dictionary into the application's memory, it was decided to calculate the distance of Damerau-Levenshtein on the side of the base. The user function for calculating the Damerau-Levenshtein distance , written on the basis of a C function written by Linus Torvalds , fully satisfied our requirements. Author of Diego Torres.

After calculating the Damerau-Levenshtein distance, it was necessary to sort the results according to the degree of similarity in order to choose the most suitable one. To do this, we used the Oliver algorithm: the calculation of the similarity of two lines. In php, this algorithm is represented by the similar_text function. The first two parameters of the function take as input the strings that need to be compared. The string comparison order is important, since the function returns different values depending on the order in which the strings are passed to the function. The third parameter must be passed to the variable in which the result of the comparison is placed. This will be a number from 0 to 100, which means the percentage of similarity between the two lines. After calculation, the results are sorted by decreasing percentage of similarity, then the options with the best values are selected.

Since the calculation of the Damerau-Levenshtein distance was made using the word transcription, words with not quite relevant values were included in the results. In this regard, we have limited the selection of options with a coincidence percentage of more than 70%.

During the development process, we noticed that our algorithm can guess words on different layouts. Therefore, we needed to expand the dictionary by adding the meanings of words to the reverse layout. The search requirements featured only two layouts: Russian and English. Each word of the user's search query was duplicated on the reverse layout and added processing of the calculation of the Damerau-Levenshtein distance. Options for direct and reverse layouts are processed independently of each other, the options with the highest percentage of similarity are selected. Only for variants with a reverse layout, the value for the corrected search query will be the word in the direct layout.

Fuzzy search algorithm

Thus, an algorithm of actions was formed from 6 main steps. In the process of testing, we found out that not all the words in the user requests should be processed as they are or in general. To solve such cases, an additional step was introduced, which we called a converter or a filter. As a result, the final algorithm consisted of 7 steps:

Break the query into individual words;
Skip each word through the converter (about it further);
For each word make its form on a different layout;
Compose a transcription;
Make a search query in the dictionary table, comparing each entry through the distance Damerau-Levenshteyn;
Leave only entries with a distance less than or equal to two;
Through the Oliver algorithm, leave only words with a percentage of similarity greater than 70%.

Schematically, this algorithm is as follows:

Functional word conversion and filtering

When we started testing the first prototype without a converter, it became obvious that there is no need to try to guess all the words in the user's request. A converter has been created for these restrictions. It allows you to transform words into the kind we need and weed out those that we think we don’t need to try to guess.

Initially, it was decided that the minimum word length that should be passed through the algorithm should be at least two characters. After all, if a user enters a preposition or a union of one character, then the probability of guessing is almost minimal. The second step was to break up the word query. First of all, we chose a set of characters that can contain words: letters, numbers, a dot and a hyphen (dash). Spaces and other characters are word delimiters.

Very often, users enter numbers to indicate volume or quantity. In this case, it will be incorrect to correct such a request. For example, the user entered the query “water 1.1 liters”. If we correct his request for 1.5 or 1.0, it will be wrong. In such cases, we decided to skip words with numbers. They, bypassing our algorithm, are transmitted in the full-text search Sphinx.

Another transformation is associated with dots in brand names - for example: Dr. Pepper or Mr. Proper. In cases where a dot character is in the middle of a word, then we divide it into two by adding a space. The second case with a dot in the middle of a word - here we remember brands abbreviations. For example, the brand ROCS - in this case we cut out the points and get a single word. This conversion works if there is only one letter between the points.

In cases of the presence in the word hyphen (dash), you should split the word into several and try to guess them separately, and then glue together the best options.

As a result, a converter was developed for the most accurate recognition of the request - most of the work on setting up the entire functionality was taken precisely by this development. Largely thanks to him, the whole fuzzy search is corrected and adjusted. Let us briefly repeat the rules by which the words are screened and transformed:

the word must be longer than 2 characters
the word must contain only letters, a point symbol and a hyphen (dash)
if the point is in the middle of a word, a space is added after it
if it is an abbreviation - the points are cut, and the letters are glued together
do not try to guess a word if there is a number in it
if the word contains a hyphen (dash), then break it into several, search separately and at the end glue

Compiling a dictionary

As mentioned earlier, in order to correct a user's request, it is necessary to determine which words are written with an error and which are not. For this, a dictionary has been created in the system where the words with the correct spelling should be contained.

At the first stage, the question of filling the dictionary arose - as a result, we decided to use the content of the catalog to compile it. Since the information on the goods was imported from the external system from time to time and indexed for the Sphinx full-text search system, it was decided to add the dictionary compilation functionality at this stage. We followed the following logic: if the words are not in the content of goods, then why try to guess it?
Information about the product was combined into a common text and passed through the converter. The mode of operation of the converter was slightly modified: when splitting words with a hyphen (dash), each of the parts was saved separately into the dictionary; when replacing points, the source and modified data are added to the dictionary. And since when comparing words, the word transcription is used to calculate the Damerau-Levenshtein distance, additionally a transcription is added to the dictionary.

There were many typos in the content of goods, for this purpose a flag is laid in the dictionary, when installed, the word is no longer used in the search. Content of 35 thousand products allowed to create a dictionary of 100 thousand unique words, which in the end was not enough for some user requests. In this regard, it was necessary to provide a functional load for its enrichment. A console command was created that allows you to load dictionaries. The format of files with data of dictionaries should correspond to csv. Each entry contains only one field with the dictionary field itself. In order for the uploaded data to be distinguished from the data generated on the basis of the content of the goods, a special flag was added.
As a result, the dictionary table has the following scheme:

Field name	Field type
Word	string
Transcription	string
Added manually	bool
Do not use	bool

Prior to the development of a fuzzy search functional, there were fields in the products that contained a set of words with misspellings. At the first run of the dictionary generation they got into its content. As a result, a dictionary with typos was obtained, the content of which was not suitable for the functionality to work correctly. Therefore, another console command was added that had the reverse dictionary generation functionality. Using the content of the specified product field, the team searched for words in the dictionary and cleared them from the dictionary. After cleaning, such fields were excluded from indexing.

Bitrix integration

To implement the minimum required functionality, three classes were created:

DictionaryTable - ORM system Bitrix to work with the dictionary
Dictionary - dictionary formation class
Search - search implementation class

For integration into Bitrix, it was necessary to make changes to 2 components:

bitrix: search.page
bitrix.search.title

Before executing the request, the processing method is called to detect errors and select suitable options:

To compile a dictionary, an event was recorded for indexing by the search module of information block elements with goods (search: BeforeIndex).

Future plans

This approach is not perfect in terms of performance. By increasing the size of the dictionary to 1M + words, the response time of the base can increase significantly. While the dictionary is small, we are satisfied with the performance. It may be necessary to implement the algorithm later on a Levenshtein automaton or a prefix tree.

Conclusion

So, no search engine is spared from the appearance of queries that violate the generally accepted rules of spelling - whether it is a random typo or a real ignorance of the spelling of words. Therefore, even without resorting to classic versions of a fuzzy Google search or Yandex, you can create your own, thanks to which both the user and the site owner will be able to get the desired result.

The code for our implementation can be viewed in the repository: github.com/qsoft-dev/damlev-bitrix

Source: https://habr.com/ru/post/432892/

All Articles