📜 ⬆️ ⬇️

We are looking for three times faster: multi-queries and faceted search

In today's article I’ll tell you about the feature of Sphinx called multi-queries: embedded in it optimization, implementation tn. faceted search, and in general how sometimes you can use it to make the search three times faster.

But first, 15 seconds of political information (you cannot praise yourself, no one will praise). This year, Sphinx went to the second round of the Sourceforge Awards 2009 in the SysAdmins and Enterprise nominations (it is said that in the nomination Developers didn’t get quite a bit). Voting will last another week (until the 20th). In addition to the working email address, nothing is needed. Thanks in advance to everyone who will not give us the abyss!

And back to the design. What is multi-requests in general, and where does the promise come from three times faster?

Multi-queries is a mechanism that allows you to send several search queries in one batch.
')
API methods that implement the multi-query mechanism are called AddQuery () and RunQueries () . (By the way, the “normal” Query () method internally uses them: once it calls AddQuery (), and then immediately RunQueries ()). The AddQuery () method saves the current state of all query settings made by previous API calls and remembers the query. The settings of an already stored request will not be changed anymore, any API calls will not affect them, so for subsequent requests you can use any other settings (other sorting mode, other filters, etc.). The RunQueries () method actually sends all stored requests in one packet and returns several results. There are no restrictions on participating requests. The number of requests, just in case, is limited by the max_batch_queries directive (added at 0.9.10, a previously fixed number of 32), but this is in general only a check against dead packets.

Why use multi-queries? Generally speaking, it all comes down to performance. First, by sending requests to searchd in one packet, we always save a little bit of resources and time by sending fewer network packets back and forth. Secondly, much more importantly, searchd gets the opportunity to do some optimization over the entire batch of requests. Over time, new optimizations are gradually added, so it makes sense whenever you can send requests in batches - then when updating Sphinx, new batch optimizations will turn on fully automatically. In the case when no batch optimizations can be applied, the requests will simply be processed one by one, without any visible differences for the application.

Why (more precisely when) do NOT use multi-requests? All queries in the package must be independent, but sometimes this is not the case, and query B may depend on the results of query A. For example, we may want to show search results from an additional index only when nothing is found in the main index. Or simply choose a different offset in the 2nd set of results depending on the number of matches in the 1st set. In such cases, you will have to use separate requests (or individual packages).

There are two important batch optimizations that you should know about: general query optimization (available starting with version 0.9.8), and optimization of common subtrees (available starting from the development version 0.9.10).

General query optimization works like this. searchd selects from the package all requests that differ only in the sorting and grouping settings, and the full-text part, filters, and so on match - and searches only once. For example, if in packet 3 queries, the text part is on all ipod nano, but the 1st query selects the 10 cheapest results, the 2nd groups the results by store ID and sorts the stores by rating, and the 3rd query simply selects the maximum price, search “ipod nano »Will work only once, but from its results 3 differently sorted and grouped responses will be constructed.

The so-called faceted search is a special case for which this optimization is applicable. In fact, it can be implemented by running several search queries with different settings: one for the main search results, a few more with the same search query, but different grouping settings (top-3 authors, top-5 stores, etc.). When everything except for sorting and grouping is the same, the optimization is turned on and the speed grows well (example below).

Optimizing common subtrees is even more interesting. It allows searchd to use similarities between different queries inside the package. Inside all came separate - different! - full-text queries are revealed common parts, and if they are, the intermediate results of the calculation are cached and divided between requests. For example, here in this package of 3 requests

 barack obama president
 barack obama john mccain
 barack obama speech


There is a common part of 2 words (“barack obama”), which can be calculated exactly once for all three queries and zakeshirovat. This is the optimization of common subtrees. The maximum cache size per packet is strictly limited to the directives subtree_docs_cache and subtree_hits_cache, so that if the common part of “i am” is found in one hundred million documents, the server’s memory will not end suddenly.

Let's return back to optimization about the general requests. Here is an example of code that runs the same query, but with three different sorting modes:
sorting modes:

 require ("sphinxapi.php");
 $ cl = new SphinxClient ();
 $ cl-> SetMatchMode (SPH_MATCH_EXTENDED2);

 $ cl-> SetSortMode (SPH_SORT_RELEVANCE);
 $ cl-> AddQuery ("the", "lj");
 $ cl-> SetSortMode (SPH_SORT_EXTENDED, "published desc");
 $ cl-> AddQuery ("the", "lj");
 $ cl-> SetSortMode (SPH_SORT_EXTENDED, "published asc");
 $ cl-> AddQuery ("the", "lj");
 $ res = $ cl-> RunQueries ();


How do I know if the optimization worked? If it worked, in the corresponding lines of the log there will be a field with a “multiplier”, which shows how many requests were processed together:

 [Sun Jul 12 15: 18: 17.000 2009] 0.040 sec x3 [ext2 / 0 / rel 747541 (0.20)] [lj] the
 [Sun Jul 12 15: 18: 17.000 2009] 0.040 sec x3 [ext2 / 0 / ext 747541 (0.20)] [lj] the
 [Sun Jul 12 15: 18: 17.000 2009] 0.040 sec x3 [ext2 / 0 / ext 747541 (0.20)] [lj] the


Pay attention to "x3", this is it - it means that the request was optimized and processed in the package of 3 requests (including this one). For comparison, the log looks like this, in which the same requests were sent one by one:

 [Sun Jul 12 15: 18: 17.062 2009] 0.059 sec [ext2 / 0 / rel 747541 (0.20)] [lj] the
 [Sun Jul 12 15: 18: 17.156 2009] 0.091 sec [ext2 / 0 / ext 747541 (0.20)] [lj] the
 [Sun Jul 12 15: 18: 17.250 2009] 0.092 sec [ext2 / 0 / ext 747541 (0.20)] [lj] the


It can be seen that the search time for each query in the case of multi-query improved from 1.5 to 2.3 times, depending on the sorting mode. In fact, this is not the limit. For both optimizations, there are cases when the speed improved 3 or more times - and not on synthetic tests, but quite in production. Optimization of general queries rather well rests on vertical searches for products and online stores, the cache of common subtrees coherently on data mining queries; but, of course, strictly these areas are not limited to applicability. For example, you can do a search without any full-text part at all and read several different reports (with different sorting, grouping, etc.) using the same data for one query.

What other optimizations can be expected in the future? It depends on you. So far, in the long term, a clear optimization has been recorded about identical requests with different sets of filters. Do you know another frequent pattern that can be cleverly optimized? Please send!

Source: https://habr.com/ru/post/64318/


All Articles