Exactly in three days we will reveal to
everyone a lot of secrets: about tuning, optimization, search quality and scaling of the Sphinx (this is still such a full-text search engine and not only) in different directions. Details at the very end of the post.
But one of the secrets about the quality of the search will begin to reveal right here and now. This is a new thing called expression ranker, added in version 2.0.2-beta (the correct Russian translation has not yet been invented), and I’ll tell you more about it under the cut in more detail. In short, it allows you
to set your own ranking formula on the fly , and even separate for each query. In general, a kind of designer who gives everyone the opportunity to try to build their own MatrixNet, with four-dimensional chess and opera singers.
Right off the bat
Emulation of the default ranking mode (for the extended query mode, which is with the syntax) in SphinxQL looks, for example, like this:
SELECT *, WEIGHT() FROM myindex WHERE MATCH('hello world')
OPTION ranker=expr('sum(lcs*user_weight)*1000+bm25')
Via SphinxAPI, respectively:
$client->SetRankingMode(SPH_RANK_EXPR, "sum(lcs*user_weight)*1000+bm25");
How does it work
In the ranking formula, you can use any document attributes and mathematical functions, as in “regular” expressions. But in addition to them in the ranking formula - and
only in it - several more values ​​and functions that are specific to ranking are available. Namely, document level factors, field level factors and functions aggregating over a set of fields. All these additional factors are
textual factors , those. Some numbers that depend on the text of the document and the request, and are calculated on them on the fly. For example, the number of unique words that match in the current field or all of the document will be a text factor. Even in science there are
extra-textual factors , those. Anything that does not depend on the texts. This is something like the number of page views, the price of the product, and the like. But this can simply be put in an attribute and then used both inside and outside the ranking. For reference, factors are also called
signals . It is the same.
')
What are the factors
The document level factors are as follows:
- bm25 , a rough (!) quick estimate of the statistical function BM25 for the entire document. The bore in me cannot remain silent and fail to report that after all coarsening to optimize this, in fact, BM15 is more likely than canonical BM25. A clearer explanation for everyone else: this is a kind of magic integer in the range from 0 to 999, which grows if there are a lot of rare words in the document and drops, if there are a lot of frequent words.
- max_lcs , the maximum possible value of sum (lcs * user_weight). Used to emulate MATCH_ANY, well, in general, for any normalization can be useful.
- field_mask , 32-bit mask of matched fields.
- query_word_count , the number of unique "included" keywords in the query. In other words, the total number of unique keywords corrected for "excluded" words. For example, in the query (one! Two) it will be 1, mk. the word (two) is excluded and never matches. For the query (one one one! Two) is also 1, mk. The unique (!) included word is still only one. And for the query (one two three) already, respectively, the value of the factor will be 3.
- doc_word_count , the number of unique words that matched the current document. It is clear that there should in no way exceed query_word_count.
The field level factors are as follows:
- lcs , the same magic factor that counts the degree of “phrase matching” between a query and a field. Formally, the length is the longest common in the query and document subsequence of words. Equal to 0, if nothing matched at all in the field; equals 1 if at least something matches; and in the limit it is equal to the number of (included) query words, if the query ideally coincided with the field.
- user_weight , custom field weight, assigned via SetFieldWeights () or OPTION field_weights respectively.
- hit_count , the number of matched keywords. One keyword can generate multiple occurrences. For example, if the request (hello world) coincided with a field in which hello occurs 3 times, and world 5, then hit_count will be equal to 8. If, however, the phrase hello world occurs exactly 1 time in the field (despite the fact that the words are mentioned 3 and 5 times) + in addition, the request was “hello world” in quotes, hit_count will be equal to 2. In all, the occurrences of the words are still 8, but in the second case there are only 2 matches from the phrase.
- word_count , the number of unique matched words in the field (NOT occurrences). In both previous examples is 2, for example.
- tf_idf , the sum of TF * IDF for all matched keywords. TF is just the number of occurrences (Term Frequency), but IDF is another magic metric that takes into account the “rarity” of a word: for superpartic words (which are in each document) it is 0, and for a unique word (1 entry into 1 document in entire collection) equals 1.
- min_hit_pos , the very first position of the keyword in the field that matched . The numbering starts from 1. It is useful, for example, to rank matches at the beginning of the field above.
- min_best_span_pos , the first position of the “best” (having the largest) lcs subset of keywords. The numbering is still with 1.
- exact_hit , a boolean flag that is cocked if the field matches completely. Values, respectively, are 0 and 1.
The aggregation function is currently exactly one, SUM. The expression inside the function is calculated for all fields, then the results are summarized. These same SUM can be done several, with different expressions inside.
For obvious reasons, the field level factor should occur
strictly inside the aggregation function. We ultimately need to calculate exactly one number, correspondingly, without “binding” to a specific field, such factors have no physical meaning. Document level factors and attributes, of course, can be used anywhere in the expression.
How to emulate existing rankers
All previously existing rankers, in fact, in the form of new cool formulas, are extremely simple. Maximum of two or three factors on the runner. The list here is:
- SPH_RANK_PROXIMITY_BM25 = sum (lcs * user_weight) * 1000 + bm25
- SPH_RANK_BM25 = bm25
- SPH_RANK_NONE = 1
- SPH_RANK_WORDCOUNT = sum (hit_count * user_weight)
- SPH_RANK_PROXIMITY = sum (lcs * user_weight)
- SPH_RANK_MATCHANY = sum ((word_count + (lcs-1) * max_lcs) * user_weight)
- SPH_RANK_FIELDMASK = field_mask
- SPH_RANK_SPH04 = sum ((4 * lcs + 2 * (min_hit_pos == 1) + exact_hit) * user_weight) * 1000 + bm25
Emulation, of course, will work slower than using the built-in ranker. Still, the compiled code is still faster than our expression reader! However, the slowdown, to which I continue to wonder, is often insignificant. Even in the case when the search matched and should rank hundreds of thousands and millions of documents on my “micro” benchmarks, differences of about 30-50% were obtained (literally, instead of 0.4 seconds with a built-in ranker, about 0.5-0.6 seconds with emulation). I suspect if less than 1-10k of documents coincide, then the differences will not be discerned at all.
What's next!?
What to do with all this? From the point of view of opportunities for improving the quality of search it became possible to quite a lot of things. In fact, now you can turn the ranking as you please. There was a bunch of new factors that had never been considered before, and now you can turn it right on the fly. The technical capability has appeared rather quickly and easily to add new factors for your request - please call, a number of factors were added exactly at the request of commercial customers.
It is clear that this is only the tip of the iceberg, and immediately a lot of questions arise: how to measure this very “quality”, and how exactly to turn the formulas, and so on. But I write slowly, but I speak quickly, so if you want to listen to this post alive and more in detail, learn twice as much about the relevance and quality of the search in 2x, and at the same time listen to a few more reports about everything else announced at the beginning of the post, then welcome to
conference . (Peter. December 4, Sunday. It is free, but registration is needed. There are still a few places left, but we must hurry completely now.)
Hello everyone, good luck with the quality of search :)