Over time, Sphinx has acquired a large bunch of search and ranking modes. Regularly there are questions about different things (from “how to pull out a document on the 1st place” to “how to draw from 1 to 5 stars depending on the degree of coincidence”), which are in fact the questions about the internal structure of those modes. In this post I’ll tell you everything I remember: how the search modes and ranking modes are arranged, what are the ranking factors, how exactly the factors are calculated, how the final weight, all that. And, of course, about the stars!
About search and ranking modes
First of all, let us see what those regimes generally do. Through API about them 2 different methods are now available,
SetMatchMode () and
SetRankingMode () . It would seem different; but in fact, inside there is now the same thing. Previously, up to version 0.9.8 inclusive, only search modes were available, those. SetMatchMode (). All of them were implemented by different branches of the code. Each of the branches of the code itself did a search for documents and their ranking. And, of course, in different ways. In version 0.9.8, the development of a new unified document search engine was started. However, in order not to break compatibility, this engine was available only in the SPH_MATCH_EXTENDED2 mode. Starting from 0.9.9, it became clear that the new engine was already generally quite stable and fast (and for any special case, if it was overlooked, there is a version 0.9.8). This allowed us to remove a bunch of old code, and starting from 0.9.9 all search modes are processed by the new engine. For compatibility, when using outdated search modes, a simplified query parsing code (which ignores operators of the full-text query language) is used, and the correct (corresponding to the search mode) ranker is automatically set, but that’s the difference. Therefore, in fact, the
weight of the document ( weight ) depends only on the ranking mode (ranker) . Therefore, the following two queries will give the same weight:
// 1 $cl->SetMatchMode ( SPH_MATCH_ALL ); $cl->Query ( "hello world" ); // 2 $cl->SetMatchMode ( SPH_MATCH_EXTENDED2 ); $cl->SetRankingMode ( SPH_RANK_PROXIMITY ); $cl->Query ( "hello world" );
')
In the second variant, you can write
@title hello
(see query language). The first is impossible (see compatibility).
Rankers calculate the weight of a document based on several internal factors available to them (and only them). It can be said that different rankers simply collect factors into final weight in different ways. The two most important factors are 1) the classical statistical factor BM25, used by most (if not all) search engines since the 80s, and 2) the phrase phrase factor specific to Sphinx.
About BM25
BM25 is a real number in the range from 0 to 1, which
depends on the frequencies of the keywords in the current document on the one hand and the total set of documents (collections) on the other, and
only on the frequencies . The current BM25 implementation in Sphinx is calculated based on the
total word frequency in the document, and not just the frequency of actual matches with the query. For example, for the query
title hello (which matches 1 copy of the word hello in the title), the BM25 factor will be calculated the same as for the query @ (title, content) keyword. The simplified implementation was done intentionally: in the standard ranking mode, the BM25 factor is secondary, the differences in the rankings were insignificant when testing, but the speed differed quite measurable. The exact calculation algorithm looks like this:
BM25 = 0 foreach ( keyword in matching_keywords ) { n = total_matching_documents ( keyword ) N = total_documents_in_collection k1 = 1.2 TF = current_document_occurrence_count ( keyword ) IDF = log((N-n+1)/n) / log(1+N) BM25 = BM25 + TF*IDF/(TF+k1) } BM25 = 0.5 + BM25 / ( 2*num_keywords ( query ) )
TF stands for Term Frequency (the frequency of the keyword in the document being ranked). IDF stands for Inverse Document Frequency (the reverse frequency of documents in the entire collection). IDF for less frequent keywords in the collection is less, for rare words more. Peak values come out IDF = 1 for a word that occurs in exactly one document, and IDF ~ = -1 for a word that is in each document. TF theoretically varies from 0 to 1, depending on k1; when k1 = 1.2 is selected, it actually varies from 0.4545 ... to 1.
BM25 grows when keywords are rare and enters the document many times, and falls when keywords are frequent. The highest value BM25 is reached when all the keywords coincide with the document, and the words are super-rare (only in this 1 document from the entire collection they are found), and still enter it many times. The smallest, respectively, when only one super-frequent word coincides with the document, which occurs generally in all documents, and ... also enters the document many times.
It should be noted that too frequent words (more common than in half of the documents)
reduce BM25! In fact, if a word is found in all documents except one, then this one document
without a super-frequent word is special, deserves more weight.
About phrase weight
The weight of the phrase (it’s also the proximity with the request, it’s the query proximity) is considered completely different. This factor does not take into account the frequency at all, but takes into account the relative position of the keywords in the request and in the document. To calculate it, Sphinx analyzes the
positions of keywords in each field of the document, finds the longest continuous match with the query, and considers it, matches, the length in the keywords. Formally speaking, it finds the largest common subsequence (Longest Common Subsequence, LCS) of the keywords between the request and the field being processed, and assigns the phrase weight for this field to be the length of the LCS. Translated back into Russian, the
weight (sub) of a phrase is the number of keywords that appeared in the field in exactly the same order as in the request . Here are some examples:
1) query = one two three, field = one and two three
field_phrase_weight = 2 ( "two three", 2 )
2) query = one two three, field = one and two and three
field_phrase_weight = 1 ( , )
3) query = one two three, field = nothing matches at all
field_phrase_weight = 0
To get the final phrase weight for the document, the phrase weights in each field are multiplied by the user-defined field weights (see the SetFieldWeights () method) and the multiplication results are added together. (By the way, the default field weights are 1, and cannot be set below 1.) Pseudocode looks like this:
doc_phrase_weight = 0
foreach ( field in matching_fields )
{
field_phrase_weight = max_common_subsequence_length ( query, field )
doc_phrase_weight += user_weight ( field ) * field_phrase_weight
}
For example:
doc_title = hello world
doc_body = the world is a wonderful place
query = hello world
query_title_weight = 5
query_body_weight = 3
title_phrase_weight = 2
body_phrase_weight = 1
doc_phrase_weight = 2*5+3*1 = 13
About a bright future
The two factors described today are major, but generally not the only ones. It is technically possible to consider other textual factors. For example, read the “correct” BM25 based on actual matches; consider the weight of sub-phrase more cunningly, taking into account the frequency of incoming words; additionally take into account the number of words in the field; etc. You can also take into account any extra-text factors at the level of the ranker itself; in other words, to use some attributes
in the process of calculating the
weight , and not in addition to the calculation.
About rankers
Finally, about ranking modes, for short, rankings. They collect from all sorts of different factors the final weight. The weights at the output of the rankers are integer.
The default ranker (in extended / extended2 modes) is called
SPH_RANK_PROXIMITY_BM25 and combines the weight of the phrase with the BM25. The BM25 factor is located in the bottom 3 decimal places, the weight of the phrase starting from the 4th and up. There are two related rankers,
SPH_RANK_PROXIMITY and
SPH_RANK_BM25 . The first one returns the quality factor of the phrase as a weight itself. The second honestly considers only BM25, and the weight of the phrase in each matched field instead of long calculations is quickly taken to be equal to one.
// SPH_RANK_PROXIMITY_BM25 ( )
rank_proximity_bm25 = doc_phrase_weight*1000 + doc_bm25*999
// SPH_RANK_PROXIMITY ( SPH_MATCH_ALL)
rank_proximity = doc_phrase_weight
// SPH_RANK_BM25 (, . )
rank_bm25 = sum ( matching_field_weight )*1000 + doc_bm25*999
SPH_RANK_PROXIMITY_BM25 is selected by default, TC. there is an opinion that it is from those available that it gives the most relevant result. Silence in the future may change; plans to make smarter and overall better rankers are quite self. The SPH_RANK_PROXIMITY racker emulates the SPH_MATCH_ALL mode (by the way, the first one, from which in 2001 it all began). SPH_RANK_BM25 should be used if the weight of the phrase is not important for any reason; or just if you want to speed up requests. Incidentally, the
choice of the ranker significantly affects the speed of the request ! Typically the most expensive part of the calculation is the analysis of the positions of the words within the document. Therefore, SPH_RANK_PROXIMITY_BM25 will always be slower than SPH_RANK_BM25, and that in turn is always slower than SPH_RANK_NONE (which does not count anything at all).
Another ranker used to process historical search modes is
SPH_RANK_MATCHANY . He also considers the weight of the sub-phrase, as well as the other two proximityors about proximity. But this one also counts the number of matched keywords in each field, and combines it with sub-phrase weights so that a) a longer sub-phrase in
any field ranks the whole document above; b) with the same length of the sub-phrase, the document ranked higher with a greater number of matched words. On the fingers, if in document A there is a more accurate (long) query subphrase than in document B, then it is necessary to rank document A above. Otherwise (if the subphrase has the same length), we simply look at the number of words. The algorithm is as follows:
k = 0
foreach ( field in all_fields )
k += user_weight ( field ) * num_keywords ( query )
rank = 0
foreach ( field in matching_fields )
{
field_phrase_weight = max_common_subsequence_length ( query, field )
field_rank = ( field_phrase_weight * k + num_matching_keywords ( field ) )
rank += user_weight ( field ) * field_rank
}
The SPH_RANK_WORDCOUNT
ranker stupidly summarizes the number of occurrences of keywords in each field multiplied by the field weight. Easier except that
SPH_RANK_NONE , which does not consider anything at all.
rank = 0
foreach ( field in matching_fields )
rank += user_weight ( field ) * num_matching_occurrences ( field )
Finally,
SPH_RANK_FIELDMASK returns the bit mask of the fields that matched the query. (Too simple, yes ...)
rank = 0
foreach ( field in matching_fields )
rank |= ( 1<< index_of ( field ) )
About asterisks
The question regularly arises of what the maximum possible weights are equal to, and how to correctly count them into asterisks (usually 5-final, but variants are possible), percentages, or points on an intuitive scale from 1 to 17. As you can see, the
maximum weight is highly dependent and from the ranker, and from the request . For example, the absolute maximum weight at the output of SPH_RANK_PROXIMITY_BM25 depends on the number of keywords and field weights:
max_proximity_bm25 = num_keywords * sum ( field_weights ) * 1000 + 999
But this
absolute maximum will be reached only when
all fields contain
an exact copy of the query in-1x, plus the query searches in all the fields in-2x. And if the request is made with restriction on one specific field (for example,
title hello world)? The absolute maximum is unattainable in principle, for
this particular type of request, the maximum practical weight will be equal to:
max_title_query_weight = num_keywords * title_field_weight * 1000 + 999
So exactly analytically calculate the "correct" maximum weight is quite difficult. If there is no life without asterisks, it is either relatively easy to read the “absolute” maximum (which will almost never be achieved), or even to take the maximum
weight for each particular sample, and normalize everything relative to it. Through the multi-query mechanism, this is done almost for free, but this is a topic for a separate article.
About full matches (update)
A good question arose in the comments, why complete
field matches with a query are not ranked higher, and how to achieve it.
The point is just in the logic of the rankers. The default rangers of proximity and proximity_bm25 in the case when the request is met in the field completely, assign the maximum phrase weight to such a field (and this is their main ranking factor). In this case, the
length of the field itself is not taken into account, the fact of complete coincidence of the
field with the request is not taken into account either. Such is the historically established behavior. Apparently, the situation when the request completely coincides with this or that field, for some reason, it has not been encountered very often before.
In the current trunk (version 0.9.10), work is already underway, an experimental runner SPH_RANK_SPH04 has been added, which ranks the full matches of the field above. The technical opportunity appeared in 0.9.9, TC. in 0.9.8, the format of the index does not provide the necessary data (for the curious, there the saved positions do not have the flag “it was the end of the field”).
Something can be done without a new ranker.
For example, you can try to manually increase the weight in case of a full match. To do this, we remove all punctuation and upper case handles from the field itself (during indexing) and from the query, count crc32 from the field, save its index attribute. Then, when searching, we calculate the expression
weight + if (fieldcrc == querycrc, 1000,0) and sort the results by this expression. Quite crooked, but in some cases can help.
You can also slightly change the task, and take into account not the fact of complete coincidence, but the
length of the document (or a separate field). To do this, when indexing, we save LENGTH (myfield) in the attribute, when searching, we rank by the expression of the form (this is just an example!)
Weight + ln (max (len, 1)) * 1000
In some cases (such as the example of indexing song titles from comments), it may be sensible to index the fields not separately, but together, so that the phrase “group - song” matches the greater weight of the phrase in the “glued” field. Otherwise, all fields are considered separate, a match “across the field boundary” will not be ranked higher.
About file space
Is there any place to finish this whole kitchen? And how. Having understood how the existing rankers are arranged, what factors they consider and how they combine, you can immediately immediately consciously correct
weight (more precisely, specify a new arithmetic expression involving
weight , and sort the output by it). What is more interesting, technically, you can add new specialized rankers (a minute of advertising on the first: rangers SPH_RANK_WORDCOUNT and SPH_RANK_FIELDMASK did not come up with me,
commercial users requested). From the C ++ code of the rankers, there is immediate access to the query, the keywords, the document being ranked, and (most importantly) a list of all the keyword matches the request along with the positions in the document ... yes, it’s quite possible to assemble a Soviet fighter from these parts of the Soviet locomotive; most importantly, masterfully apply the magic of the file.