📜 ⬆️ ⬇️

I think I began to understand what you meant!

To be sealed is a simple matter; be sealed in the search query and double. Read all the great web search engines today are able to correct errors in keywords in 1x and suggest queries in 2x; after them the same I want to search for a smaller one. Both pieces can be deftly implemented using an open search engine named Sphinx ; In this post I will tell you exactly how.

Well, for did you mean ("what did you mean") and other query completion ("do not you look for Vasya").


Listen to your favorite song "Valenki"!


')
Let's start with the simple ones. query completion. The user begins to enter a search line, as you type, you want him to immediately show prompts (so that he doesn’t look for anything, but listened to “Valenki”, like everyone else). Perhaps even with the number of results. How to do where to get the data?

There are about two options: you can directly suggest user queries, and you can individual keywords.

Keywords hint especially easy. The indexing program with the unexpected name indexer has two equally unexpected keys: - buildstops , which forces the indexer instead of usually indexing to build a list of N most frequently occurring words (it is a dictionary), and to it - buildfreqs , which also causes the frequencies at the same time count (he is a frequency dictionary). We take indexer, we build 10 thousand (it is possible 100) the most frequent words in our collection of documents:

 $ indexer myindex --buildstops dict.txt 10000 --buildfreqs


We get something like this file:

 i 9533843
 and 5427048
 to 5418872
 the 5371581
 a 4282225
 you 2877338
 ...


Next thing technology. We create a SQL nameplate, write an import script for a dozen lines, use the usual LIKE for the prompt, sort the results by the frequency of the word. LIKE on an index will turn out quite reasonably fast, tk. looking for the beginning of the key. For 1-letter queries, it would be nice to immediately calculate and cache the results so as not to shovel thousands of lines for each sneeze. MySQL query cache, however, in theory will save, if enabled.

 CREATE TABLE keywords
 (
    keyword VARCHAR (255) NOT NULL,
    freq INTEGER NOT NULL,
    INDEX (keyword, freq)
 );

 SELECT * FROM keywords WHERE keyword LIKE 'valen%' ORDER BY freq DESC LIMIT 10


Queries are almost the same. Neither the sign nor LIKE go anywhere; only instead of individual words now I want full lines. You can and should take them in your request log, and from there, count the frequencies. The log will have to be slightly processed with a file: the line “vasya pupkin” from the point of view of the search coincides with “Vasya! Pupkin! ”, Why consider them different requests is not very comme il faut. This is blocked by the Sphinx API BuildKeywords () method: take an arbitrary query line, build keywords on it (with case reduction and all such), restore the normalized query string. It is better to do all this by setting a persistent connection using the Open () method, otherwise it will work many times slower. Well, and then on the thumb; log, file, sort, uniq -c, import script, SQL table, LIKE, profit. The file should look something like this:

 $ cl = new SphinxClient ();
 $ cl-> Open ();
 foreach ($ log as $ entry)
 {
    $ keywords = $ cl-> BuildKeywords ($ entry, "myindex", false);
    foreach ($ keywords as $ keyword)
       print $ keyword ["tokenized"].  "";
    print "\ n";
 }


The import script from a text file about two fields in the SQL database is left as homework for readers.

I understood this hint.



Spread the tips, proceed to the correction of errors. As you know, a Britney typo can be entered in slightly less than 600 different ways . But she also has a surname. But they can search and not at all her! Edak request with errors just will not find anything; the page will be empty; Adsense / Ydirect / Unameit will show bad ads; no one will click; your startup will burn out; There will be no one to buy commercial services about Sphinx and the project will also die. This is unacceptable, urgently need to correct keywords!

Clearly, there is always an option to screw in ispell, aspell, hunspell, or any-trendy-today-spell. Clearly, it always rests on either the quality of the xxxspell dictionary, or stupidly in its absence for the desired language. It is clear that it does not help with neologisms (helper), spetsterminami (acidium acetylosalicylium), geographic names, etc. Through this, I still want more. And the blog is not about ispell, you need to match.

Again, you need a frequency dictionary. True, a larger size than 10K keywords - “correct” a rare correct word into a close frequent word is not worth it. A dictionary of 1 million words should usually suffice, for 10 million absolutely should, those. the team turns into an indexer - buildstops dict.txt 10,000,000 - build freqs MYINDEXNAME (works at a speed of over 20 MB / sec on the C2D E8500, by the way). Unlike prompts, searching SQL in such a dictionary does not help; and more data, and the type of requests is not the same. But help Sphinx.

The main idea is as follows. We generate for each word from the dictionary a set of trigrams, those. 3 consecutive characters . We index trigrams with Sphinx. To search for replacement options, we build trigrams for the error-word-error too, we look for them in the index. There are several candidates. The more trigrams coincided, the shorter the word length differs, and the more often the found variant is found, the better. And now we will sort all this in more detail, on a live example.

The created indexer --buildstops dictionary still looks like this (I chose another piece so that the words were longer, and the example is clearer):

 ...
 deal 32431
 created 32429
 light 32275
 needed 32252
 mood 32185
 death 32140
 behind 32136
 usually 32113
 action 32053
 line 32052
 pissed 32043
 ...


For each word we need to create a unique ID, save the word itself and its frequency, build trigrams, and save all this to the base. If there are typos in the indexed database, it makes sense to drop too few words, maybe. with a good chance this is a typo.

 CREATE TABLE suggest (
    id INTEGER PRIMARY KEY AUTO_INCREMENT NOT NULL,
    keyword VARCHAR (255) NOT NULL,
    trigrams VARCHAR (255) NOT NULL,
    freq INTEGER NOT NULL
 );

 INSERT INTO suggest VALUES
 ...
 (735, 'deal', '__d _de dea eal al_ l__', 32431),
 (736, 'created', '__c _cr cre rea eat ate ted ed_ d__', 32429),
 (737, 'light', '__l _li lig igh ght ht_ t__', 32275),
 (738, 'needed', '__n _ne nee eed ede ded ed_ d__', 32252),
 (739, 'mood', '__m _mo moo ood od_ d__', 32185),
 (740, 'death', '__d _de dea eat ath th_ h__', 32140),
 (741, 'behind', '__b _be beh ehi hin ind nd_ d__', 32136),
 (742, 'usually', '__u _us usu sua ual all lly ly_ y__', 32113),
 (743, 'action', '__a _ac act cti tio ion on_ n__', 32053),
 (744, 'line', '__l _li lin ine ne_ e__', 32052),
 (745, 'pissed', '__p _pi pis iss sse sed ed_ d__', 32043),
 (746, 'bye', '__b _by bye ye_ e__', 32012),
 ...


You only need to index the field with trigrams, but to rank the candidates (those. Choosing the best correction) you still need the word length and the frequency of its occurrences in the collection.

 sql_query = SELECT id, trigrams, freq, LENGTH (keyword) AS len FROM suggest sql_attr_uint = freq sql_attr_uint = len 


We identify suspicious words from the search query results: if the search results are too small (or not at all), analyze the $ result [“words”] response section, look at the number of documents for each word. If there are few documents, try to correct such a word. For example, for the query “green liight” in my test index, the number of occurrences for “green” is 34421, for “liight” only 2. Which of them to be transferred to correctional work is immediately clear. Specific thresholds for “little” are very individual for different collections of documents and requests. Look in your dictionary and query log, pick up magic constants.

Build trigrams, run a query on a trigram spec. Since the word is typed with errors, all trigrams are unlikely to match. On the other hand, if only one trigram coincides, such a candidate is also of little interest: this can happen only if the three letters in the middle of the word (and nothing more) match, or one letter at the beginning (and nothing more). Well, we use the quorum operator , and that’s exactly what it is looking for: it gives out all documents where at least 2 trigrams coincided. We also introduce a restriction on the length: we assume that the length of the correct variant differs by no more than 2 characters.

 $ len = strlen ("liight");
 $ cl-> SetFilterRange ("len", $ len-2, $ len + 2);
 $ cl-> Query ('"__l _li iig igh ght ht_ ht __" / 2', 'suggest');


A bunch of found candidates must be sorted, and the best one out of it. We recall what factors we have:
  1. the more trigrams matched, the better;
  2. the smaller the word length, the better;
  3. the more common the found variant, the better.


All these factors fresh version of Sphinx is able to calculate and sort completely on the server side. The number of matched trigrams can be calculated by the SPH_RANK_WORDCOUNT ranker (each trigram is a separate key word during our special search). The difference in word length is equal to abs (len- $ len), and the frequency is stored in the freq attribute. We calculate factors, put some together, and choose the best:

 $ cl-> SetMatchMode (SPH_MATCH_EXTENDED2);
 $ cl-> SetRankingMode (SPH_RANK_WORDCOUNT);
 $ cl-> SetSelect ("*, @ weight + 2-abs (len- $ len) AS myrank");
 $ cl-> SetSortMode (SPH_SORT_EXTENDED, "myrank DESC, freq DESC");


Hurray, it works! For the word liight, the fix light is successfully located. (More precisely, Sphinx finds the ID, which then gets the string “light” from the base).

This is how the demo attached to Sphinx 0.9.9-rc2 (see the misc / suggest directory inside the archive) is arranged, which can be immediately tested on your data without writing any additional code :-)

The demo is immediately understandable, imperfect and subject to fine-tuning with a file . (Sorry, I could not resist.) There is a danger that, out of the box, PHP will not work with Russian because UTF-8 is expected, substr is used, and correspondingly without mbstring.overload = 7 nowhere. Almost certainly have to rotate the FREQ_THRESHOLD, those. the minimum number of occurrences, below which the word is considered a typo and does not fall into the special index; for small data collections lower, for large ones increase. For the same reasons (in order not to rank rare garbage above frequent non-garbage), it may be necessary to twist the formula for calculating myrank. For example, add an extra oddinick to the number of matched trigrams for frequencies differing 1000 times:

 $ cl-> SetSelect ("*, @ weight + 2-abs (len- $ len) + ln (freq) / ln (1000) AS myrank");


In addition, the trigram focus is effective, but very simple, and does not take into account much. The order of trigrams is not taken into account, although this is not a problem for words of reasonable length. What is more interesting is not taken into account how people make mistakes: rearrangement of 2 adjacent letters by the number of trigrams is indistinguishable from replacing these letters with any (!) Others (for example, light / lihgt / limit); does not take into account the proximity of letters on the keyboard; does not take into account the phonetic proximity (actually / akshully). A fairly obvious idea for improving the quality of corrections with a little blood: instead of 1 best option, we take out 10-20 pieces, count the Levenshtein distance on the client, adjust the calculation results. With big blood, you can just “count” a dozen or two candidates by any other samopisnym algorithm.

In general, the demo will work out of the box. But this is a demo, and nobody canceled a lot of space for further creative work. Create, invent, write blogs, send patches!

Source: https://habr.com/ru/post/61807/


All Articles