We did you mean part two

Hello

Not so long ago I wrote about the correct did you mean .
Despite all my improvements, guess-ing still often was wrong, and gave out strange results.

However, recently, I managed to significantly improve the quality of guess-inga, and I decided that it would be nice to write a “patch” to my previous article :)

')

Pre-intro

Anticipating in advance the accusations of doing unnecessary work in our age (“why do we need all this if there is aspell / lucene / substitute to taste”),
I will also answer in advance that:

Lucene / SpellChecker is far from perfect, and in some cases loses to my solution
Aspell is good, but Aspellov suggest-ing, being written in php, inhibits godlessly

Ingredients

Everything is designed for PHP4 / PHP5 + MySQL

Bug work

Problem number one is the selection of the initial array of words. In the previous version, I used soundex for this purpose, but the result did not suit me at all, because along with the words
having a chance to be the right result, the words were chosen completely far from the result.

The second problem is the very function of determining the distance between words. The function used by me in the previous version works, of course, better than the editing distance (like
Levenshtein distance), but still for some cases works just the same magic way.

Selection of the initial array of words

Unlike the first time, when it was necessary to quickly give out something working, the second time was more time, so I decided to smoke a materiel in the form of a wiki and various
open source programs with suggesting function. As a result, a search based on n-gram was chosen.

Those who wish to enrich themselves with knowledge, I suggest to proceed to Wikipedia , but I still tell you very briefly:
n-gram is a n-character substring. For example, for the stock line you can select 2 x 4-gram: stoc and tock (as well as 3 x 3-gram-a and 4 x 2-gram-a).

Ideally, each n-gram should be represented as a vector in n-dimensional space, and the distance between them should be calculated using the scalar product of these vectors,
but in order to do this, it is necessary to do it (I have not found ready-made solutions). And do laziness. As a result, I stopped at the solution applied in SpellChecker (plugin for Lucene).

Create a table for the indices (suppose it is called guess_index ), with the following fields:

id - what about without it
word - the word itself
gram1 - 1- grames (n-grames inside fields are separated by spaces)
gram2 - 2-gram
gram3 - 3-gram
gram4 - 4-gram

To search for possible options, we divide the search word into n-grames, and find all the records with matching n-grames in the corresponding fields.
Example for the word bakc :

1-gram: bakc
2-gram: ba ak kc
SQL: WHERE gram1 LIKE '% b%' OR gram1 LIKE '% a%' OR gram1 LIKE '% k%' OR gram1 LIKE '% c%' OR gram2 LIKE '% ba%' OR gram2 LIKE '% ak%' OR gram2 LIKE '% kc%'

Further, for all the found words, we calculate their rating: each n-gram-a match adds a point to the result, the starting n-gram-a match - 3 points, the final one - 2 points.
Then we normalize the result by dividing it by the minimum of the number of n-gram-s in the desired word and the number of n-gram-s in the word from the base
(it may be worth dividing by the maximum, but for my case the minimum gives the best results).

All this can be seen in the function of the function of the file search_guess.php

We get the result

To begin with, sort the resulting array of words by rating and, sparing nothing, we’ll erase half (the words with a low rating are unlikely to be the desired ones, but
in the result due to the magic in words_dist).

Ideally, the function words_dist should consider:

The number of extra / missing letters, and non-linear (the more mistakes there are, the more likely it is that another word was meant).
Change letters in places. Again, the greater the distance between the exchange letters, the more likely it is that another word was meant.
Do not confuse clerks (incorrect letters in the same place) with changing letters in places.
For all the preceding paragraphs, consider the possibility of such an error when typing on the keyboard (the distance from the typed letter to the desired letter on the keyboard).
The word rating obtained by n-gram search.

But as a result, I just modified the previous version :) (see the function words_dist from the file search_guess.php ),
but he became better honestly :)

Exceptions

However sad it may sound, some words are simply impossible to find with the help of such an implementation of n-gram search, for example, ma gr in - margin.
Currently, the best way out of the situation (the result / amount of time spent on implementation) I consider the exception table, which will be listed
pairs of the wrong word / correct word.

Demo

I invite you to try out the work "guessing" :)

Conclusion

The year will soon be 2009, and on many sites and forums (especially forums) there is still a poorly understood search (or nonexistent). I hope my articles
will help to bring the quality of search to the search giants a bit :) (well, or, if it’s really lazy to do, Google Custom Search is a good way out)

Link to the topic: Yandex-like search do it yourself .

Source: https://habr.com/ru/post/26966/

All Articles