📜 ⬆️ ⬇️

About fighting quality

Exactly in three days we will reveal to everyone a lot of secrets: about tuning, optimization, search quality and scaling of the Sphinx (this is still such a full-text search engine and not only) in different directions. Details at the very end of the post.

But one of the secrets about the quality of the search will begin to reveal right here and now. This is a new thing called expression ranker, added in version 2.0.2-beta (the correct Russian translation has not yet been invented), and I’ll tell you more about it under the cut in more detail. In short, it allows you to set your own ranking formula on the fly , and even separate for each query. In general, a kind of designer who gives everyone the opportunity to try to build their own MatrixNet, with four-dimensional chess and opera singers.

Right off the bat


Emulation of the default ranking mode (for the extended query mode, which is with the syntax) in SphinxQL looks, for example, like this:
SELECT *, WEIGHT() FROM myindex WHERE MATCH('hello world')
OPTION ranker=expr('sum(lcs*user_weight)*1000+bm25')

Via SphinxAPI, respectively:
$client->SetRankingMode(SPH_RANK_EXPR, "sum(lcs*user_weight)*1000+bm25");


How does it work


In the ranking formula, you can use any document attributes and mathematical functions, as in “regular” expressions. But in addition to them in the ranking formula - and only in it - several more values ​​and functions that are specific to ranking are available. Namely, document level factors, field level factors and functions aggregating over a set of fields. All these additional factors are textual factors , those. Some numbers that depend on the text of the document and the request, and are calculated on them on the fly. For example, the number of unique words that match in the current field or all of the document will be a text factor. Even in science there are extra-textual factors , those. Anything that does not depend on the texts. This is something like the number of page views, the price of the product, and the like. But this can simply be put in an attribute and then used both inside and outside the ranking. For reference, factors are also called signals . It is the same.
')

What are the factors


The document level factors are as follows:

The field level factors are as follows:

The aggregation function is currently exactly one, SUM. The expression inside the function is calculated for all fields, then the results are summarized. These same SUM can be done several, with different expressions inside.
For obvious reasons, the field level factor should occur strictly inside the aggregation function. We ultimately need to calculate exactly one number, correspondingly, without “binding” to a specific field, such factors have no physical meaning. Document level factors and attributes, of course, can be used anywhere in the expression.

How to emulate existing rankers


All previously existing rankers, in fact, in the form of new cool formulas, are extremely simple. Maximum of two or three factors on the runner. The list here is:

Emulation, of course, will work slower than using the built-in ranker. Still, the compiled code is still faster than our expression reader! However, the slowdown, to which I continue to wonder, is often insignificant. Even in the case when the search matched and should rank hundreds of thousands and millions of documents on my “micro” benchmarks, differences of about 30-50% were obtained (literally, instead of 0.4 seconds with a built-in ranker, about 0.5-0.6 seconds with emulation). I suspect if less than 1-10k of documents coincide, then the differences will not be discerned at all.

What's next!?


What to do with all this? From the point of view of opportunities for improving the quality of search it became possible to quite a lot of things. In fact, now you can turn the ranking as you please. There was a bunch of new factors that had never been considered before, and now you can turn it right on the fly. The technical capability has appeared rather quickly and easily to add new factors for your request - please call, a number of factors were added exactly at the request of commercial customers.
It is clear that this is only the tip of the iceberg, and immediately a lot of questions arise: how to measure this very “quality”, and how exactly to turn the formulas, and so on. But I write slowly, but I speak quickly, so if you want to listen to this post alive and more in detail, learn twice as much about the relevance and quality of the search in 2x, and at the same time listen to a few more reports about everything else announced at the beginning of the post, then welcome to conference . (Peter. December 4, Sunday. It is free, but registration is needed. There are still a few places left, but we must hurry completely now.)

Hello everyone, good luck with the quality of search :)

Source: https://habr.com/ru/post/133790/


All Articles