📜 ⬆️ ⬇️

Make the right did you mean

Zatakt: this is my first post, and the first post as always pancake :).

Recently, a task was received to modernize the search on the site, and it so happened that it was necessary to make the functionality “Did You Mean”.

By the way, thank you very much comrade alexblack alexblack for his article Yandex-like do-it- yourself search , without it I would be like without hands :)
')
Now I will start listing how I did all this. PHP, MySQL database, site language - English.
(the right decision is at the end :)

0) Creating indexes

I wrote a small script that runs around the entire content of the site (since it’s all in the database) and creates a guess_word table with the word fields — the word itself, search_index — the word index (soundex or metaphone, depending on the settings) and count — the word frequency.

All words were previously converted to lowercase and everything except letters was cut out of them.

1) soudex + meeting frequency

If the word was not found, then we take its soudex index and look in the database back sorting by frequency of encounters.
It works, but badly. Not always the most common word is the correct clue.

2) soundex + levenshtein

We find all the words with the same soundex index, then select from them the one with the lowest Levenshteyn distance.
Much better, but still there are problems. On the standard typo, teh finds tea (and not the, for the distance between teh and the is greater than between teh and tea).

3) metaphone

Refused the metaphone, as it gives the worst results. (Although strange, of course).

4) Calculate the correct distance between words.

Levenshtein distance takes into account letter replacements, deletions and insertions, but does not take into account the exchange of places.

Idea A: we count how many letters there are in a word, and add the result to the distance. at the same time, in the distance of the Levenshtein we make the price for the removal / insertion lower than for the replacement, because the removal / insertion is already calculated.
Result A: not that. now te is closest to teh

Idea B: play around with the prices in levenshte.
Result B: a little bit not what I expected. For example, when raising the price of a replacement, the algorithm decides that it is more profitable to convert teh through the removal and insertion. Well, of course, but not suitable.

Idea C: abandon levenshteyn in favor of self-writing function.
function words_dist($str_a, $str_b)
{
if ($str_a == $str_b) return 0;

// (begin of) . btlfsa
// ( php.net/levenshtein)
// btlfsa teh:tehh = 0, words_dist teh:tehh = 1

$arr_a = array();
for ($i = 0; $i < strlen($str_a); $i++) {
$arr_a[$str_a{$i}] = (array_key_exists($str_a{$i}, $arr_a) ? $arr_a[$str_a{$i}] : 0) + 1;
}

$arr_b = array();
for ($i = 0; $i < strlen($str_b); $i++) {
$arr_b[$str_b{$i}] = (array_key_exists($str_b{$i}, $arr_b) ? $arr_b[$str_b{$i}] : 0) + 1;
}

foreach($arr_a as $k=>$v) {
if (!array_key_exists($k, $arr_b)) $arr_b[$k] = 0;
}

$dst = 0;
foreach ($arr_b as $k=>$v) $dst += abs((array_key_exists($k, $arr_a) ? $arr_a[$k] : 0) - $v);

// (end of)

// /
$dst *= 2;

if (strlen($str_a) < strlen($str_b))
{
$l = strlen($str_b)-strlen($str_a);
for ($i = 0; $i < $l; $i++) $str_a .= ' ';
}
elseif (strlen($str_b) < strlen($str_a))
{
$l = strlen($str_a)-strlen($str_b);
for ($i = 0; $i < $l; $i++) $str_b .= ' ';
}

//
for ($i = 0; $i < strlen($str_a); $i++) { if ($str_a{$i} != $str_b{$i}) $dst++; }

return $dst;
}

Link to the code without comments (just in case)

Result C: works great. (for now :), maybe later there will be various inaccuracies)

5) Correct the search by index

soudex index does not always give the same index for similar words. For example, stcok and stages have the same indexes, while stcok and stock are different (still funny theaters and tetrachloride).

Idea A: Divide the index into 2 parts (letter and number) and search from [letter] [number-2] to [letter] [number + 2]
Result A: I did not do it, because I thought that they could be mistaken in the first letter, and then soudex is powerless.

Idea B: Add the same first letter to the words, for example, 'L'. since now the first letter is the same, putting it in the index does not make sense. $ index = substr (soundex ('L'. $ word), 1);
Result B: exceeded my expectations. I did not even have to do a search by range (-2 ... +2).

*) Results

a) index soundex
b) before indexing, add the same first letter to the words, $ index = substr (soundex ('L'. $ word), 1)
c) from the list of possible words choose a samopina function, and not use the Levenshtein distance

And also as a bonus: we will make a coloring from the guessed word (we will highlight the wrong letters in red). The code is quite obvious and voluminous, so I’ll just give you a link to it: zame-dev.org/pub/highlight.html

Demonstration

upd: probably it is necessary to search all the same by the range, because soundex ('Lteom') is not the same as soundex ('Lterm') (but still much closer than soundex ('teom') and soundex ('term'))

Source: https://habr.com/ru/post/24504/


All Articles