📜 ⬆️ ⬇️

Sphinx: delta indexes and multiple search servers

More or less big project sooner or later comes to the need for full-text content search.
For this purpose, the search engine Sphinx was invented.

When the base becomes large or the indexes a lot of reindexing begins to take quite a long time, which can have various negative consequences for the project. At this point, it is worth thinking about using delta indexes .
The author was faced with this need at the moment when the reindexing began to take more than an hour.

But this is all described in detail in the documentation and is done quite simply:
source src_mysql { type = mysql sql_host = localhost sql_user = sphinx sql_pass = secret sql_range_step = 1000 } source src_news : src_mysql { sql_db = project sql_query_pre = SET NAMES utf8 sql_query_pre = SET CHARACTER SET utf8 sql_query_pre = REPLACE INTO sph_counter SELECT 'src_news', MAX(id) FROM news sql_query_range = SELECT MIN(id), MAX(id) FROM news sql_query = SELECT id as news_id, title, content FROM news WHERE id>=$start AND id<=$end } source src_news_delta : src_news { sql_query_pre = SET NAMES utf8 sql_query_pre = SET CHARACTER SET utf8 sql_query_range = sql_query = SELECT id as news_id title, content FROM news \ WHERE id > ( SELECT max_value FROM sph_counter WHERE source = 'src_news') sql_query_post = REPLACE INTO sph_counter SELECT 'src_news', MAX(id) FROM news } 


Now consider the case when there are several indexing servers. There may be several reasons for this, but a problem appears: one server, after indexing, rewrites the value of the last ID in the database. When starting indexing on the next server, the records indexed by the previous server will not be included in the selection. There are "gaps" in the indices and search results will be constantly different, and random results will be missed.
')
What to do?

It is necessary in the sph_counter table to additionally store the ID of the indexing server.

Add a hostname column.

As a result, the table will have the following form:
 CREATE TABLE IF NOT EXISTS `sph_counter` ( `source` varchar(100) NOT NULL DEFAULT '', `max_value` bigint(20) NOT NULL, `hostname` varchar(50) NOT NULL DEFAULT '', PRIMARY KEY (`source`,`hostname`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; 

* Pay attention to the index, it is necessary for the correct operation of the REPLACE INTO

Further, depending on your configuration option 2:

1. The Sphinx process runs on the same server as the database, master-master replication is configured

In this situation, the problem is solved quite simply: use the global variable MySQL - 'hostname'
Our configuration takes the following form:
 source src_news : src_mysql { sql_db = project sql_query_pre = SET NAMES utf8 sql_query_pre = SET CHARACTER SET utf8 sql_query_pre = REPLACE INTO sph_counter SELECT 'src_news', MAX(id), @@hostname FROM news sql_query_range = SELECT MIN(id), MAX(id) FROM news sql_query = SELECT id as news_id, title, content FROM news WHERE id>=$start AND id<=$end } source src_news_delta : src_news { sql_query_pre = SET NAMES utf8 sql_query_pre = SET CHARACTER SET utf8 sql_query_range = sql_query = SELECT id as news_id, title, content FROM news \ WHERE id > ( SELECT max_value FROM sph_counter WHERE source = 'src_news' AND hostname = @@hostname) sql_query_post = REPLACE INTO sph_counter SELECT 'src_news', MAX(id), @@hostname FROM news } 


2. MySQL and Sphinx servers are different, requests go to master

In such a situation, using the global hostname variable no longer works - we will always get the name of the server on which the database is running.

Output: we will use the connection information and user variables.

 source src_news : src_mysql { sql_db = project sql_query_pre = SET NAMES utf8 sql_query_pre = SET CHARACTER SET utf8 sql_query_pre = SELECT @sphinx_instance:=IF(STRCMP(@sphinx_host:=SUBSTRING_INDEX(host,':',1),'localhost'),@sphinx_host,@@hostname) AS sphinx_instance \ FROM information_schema.processlist WHERE ID=connection_id(); sql_query_pre = REPLACE INTO sph_counter SELECT 'src_news', MAX(id), @sphinx_instance FROM news sql_query_range = SELECT MIN(id), MAX(id) FROM news sql_query = SELECT id as news_id, title, content FROM news WHERE id>=$start AND id<=$end } source src_news_delta : src_news { sql_query_pre = SET NAMES utf8 sql_query_pre = SET CHARACTER SET utf8 sql_query_pre = SELECT @sphinx_instance:=IF(STRCMP(@sphinx_host:=SUBSTRING_INDEX(host,':',1),'localhost'),@sphinx_host,@@hostname) AS sphinx_instance \ FROM information_schema.processlist WHERE ID=connection_id(); sql_query_pre = REPLACE INTO sph_counter SELECT 'src_news', MAX(id), @sphinx_instance FROM news sql_query_range = sql_query = SELECT id as news_id, title, content FROM news \ WHERE id > ( SELECT max_value FROM sph_counter WHERE source = 'src_news' AND hostname = @sphinx_instance) sql_query_post = REPLACE INTO sph_counter SELECT 'src_news', MAX(id), @sphinx_instance FROM news } 


This configuration will work in both cases.

Source: https://habr.com/ru/post/146999/


All Articles