📜 ⬆️ ⬇️

Sphinx distribute. Accelerate the search

I recently needed to revisit the work of the full-text search engine Sphinx , since some of the frequent requests took seconds, and some even more than ten. After searching for vulnerabilities and ways to optimize, I found a simple way to improve performance - load parallelization across multiple threads, as a result I got a good reduction in query time.

One of the unpleasant features of Sphinx is very poor information in Russian. Surprised that the topic of load distribution was not addressed, I decided to share this solution on Habré.

Objective : To improve the performance of Sphinx by splitting the load into multiple threads.

Solution : split indexes and specify the number of threads in the configuration.
')

Execution threads


Let's start with a simpler one — specifying the number of threads to execute. Suppose that our server has a quad core processor, so the best way is to use four threads. To do this, use the dist_threads directive in the searchd section of the configuration file.

searchd { ... dist_threads = 4 ... } 

The directive indicates the maximum number of threads to process a request. The default value is 0, which implies non-use of multi-threading.

Index Separation


Next, we divide the indices so that each thread processes its own interval of records. In other words: for example, in our table there are 1,000,000 entries and four streams. It is necessary for each stream to process 1,000,000 / 4 = 250,000 entries, so that Sphinx later obtains the results of the work of these streams and gives the most relevant result. It is logical that four threads processing 250,000 records will do their job more quickly than one stream processing 1,000,000 records almost four times.

Suppose we have some source and index:

 source books { type = mysql sql_query = SELECT id, name FROM tb_books } index books { source = books min_infix_len = 3 } 

For example, let's leave the min_infix_len directive.
To divide the index into four parts, create four sources with limited recording intervals and assign them by index 'at:

 source books_base { type = mysql } source books0: books_base { sql_query = SELECT id, name FROM tb_books WHERE id % 4 = 0 } source books1: books_base { sql_query = SELECT id, name FROM tb_books WHERE id % 4 = 1 } source books2: books_base { sql_query = SELECT id, name FROM tb_books WHERE id % 4 = 2 } source books3: books_base { sql_query = SELECT id, name FROM tb_books WHERE id % 4 = 3 } index ind_books_base { min_infix_len = 3 } index ind_books0: ind_books_base { source = books0 } index ind_books1: ind_books_base { source = books1 } index ind_books2: ind_books_base { source = books2 } index ind_books3: ind_books_base { source = books3 } index ind_books { type = distributed local = ind_books0 local = ind_books1 local = ind_books2 local = ind_books3 } 

The easiest way to divide the table into approximately equal parts is to specify the number of entries in the query id , but this is not the only way. Alternatively, you can use the sql_query_range directive, but in my case this method did not work due to the non-uniform distribution of id -records of records over the table.

It is good practice to point out the ancestors of indexes and sourcs to inherit from them and put some duplicate directives in them. In this case, I have issued type and min_infix_len directives in them.

In order to have some index that could be accessed for results, we created an ind_books index with a type of distributed, in local directives which indicated the names of the indices, the results of which must be obtained.

Delta


If you use the delta index, then the simplest solution is to merge it with one of the obtained indices.

However, in this case, it is necessary to keep in mind that if delta turns out to be too large, then the selected index, with which we merge it, will be noticeably more than the others, which can negatively affect performance. To prevent this, it is best to merge it with all indices in turn.

Ultimately, using this method allowed me to reduce the query time from 2 to 10 times.

Source: https://habr.com/ru/post/263849/


All Articles