InnoDB full text search

Hi, Habrachitatel!
InnoDB full-text data retrieval is the well-known headache of many MySQL / InnoDB developers. For those who are not up to date, I will explain. In the MyISAM table type, there is a full-featured full-text data search, but the table itself historically has limitations that are fundamental in individual projects. In the more advanced InnoDB table type, there is no full-text search. So poor developers have to put up with either MyISAM limitations or the lack of InnoDB search. I want to talk about how there are ways to organize a full-fledged search in InnoDB without magic and exclusively by regular means. It will also be interesting to compare the speed characteristics of each method.

For example, take a small table with 10,000 entries.

CREATE TABLE users (
id INT ( 11 ) NOT NULL AUTO_INCREMENT ,
login VARCHAR ( 255 ) DEFAULT NULL ,
`password` VARCHAR ( 255 ) DEFAULT NULL ,
name VARCHAR ( 255 ) DEFAULT NULL ,
surname VARCHAR ( 255 ) DEFAULT NULL ,
email VARCHAR ( 255 ) NOT NULL ,
country VARCHAR ( 255 ) DEFAULT NULL ,
city VARCHAR ( 255 ) DEFAULT NULL ,
PRIMARY KEY ( id )
)
ENGINE = INNODB

In this table we store the data of users of the site. On the site itself there is a user search form in which you can enter an arbitrary query like "Tolstoy Yasnaya Polyana". To process such a request, the search must be carried out on several fields at once. We need a search for the fields login , name , surname , city , country . A query can be either a single word (name or city) or in the form of a set of words, separated by a space. The problem is that we need to search for this set of words in several fields at once, which is difficult to do in InnoDB without using additional functions.
')
There are several relatively simple ways to search full-text data in InnoDB:

Consider each of them in more detail.

Using the mirror table in MyISAM

The first proposed method is to create an additional table in MyISAM. As you know MyISAM quite well supports full-text search and it can be used. All data from the main table ( users ) will be copied to this additional table. Synchronization will be provided by triggers. In the new table, add the fields login , name , surname , city , country . Thus, we will create a “mirror” of the main table, and we will work with it. To enable full-text search, add the FULLTEXT index to all 5 fields together:

CREATE TABLE search (
id INT ( 11 ) DEFAULT NULL ,
login VARCHAR ( 255 ) DEFAULT NULL ,
name VARCHAR ( 255 ) DEFAULT NULL ,
surname VARCHAR ( 255 ) DEFAULT NULL ,
country VARCHAR ( 255 ) DEFAULT NULL ,
city VARCHAR ( 255 ) DEFAULT NULL ,
FULLTEXT INDEX IX_search ( city , country , login , name , surname )
)
ENGINE = MYISAM

To synchronize data between the main table and the “mirror” table on users, we set the triggers for writing, changing and reading:

Write trigger:

CREATE
TRIGGER `insert`
AFTER INSERT
ON users
FOR EACH ROW
BEGIN
INSERT INTO search ( `id` , ` login` , `name` , ` surname` , `country` , ` city` ) VALUES (
NEW . `id` ,
NEW . `login` ,
NEW . `name` ,
NEW . `surname` ,
NEW . `country` ,
NEW . `city`
) ;
END

Trigger on change:

CREATE
TRIGGER `update`
AFTER UPDATE
ON users
FOR EACH ROW
BEGIN
DELETE FROM `search` WHERE` id` = NEW . `id` ;
INSERT INTO `search` ( ` id` , `login` , ` name` , `surname` , ` country` , `city` ) VALUES (
NEW . `id` ,
NEW . `login` ,
NEW . `name` ,
NEW . `surname` ,
NEW . `country` ,
NEW . `city`
) ;
END

And a simple delete trigger:

CREATE
TRIGGER `delete`
AFTER DELETE
ON users
FOR EACH ROW
BEGIN
DELETE FROM `search` WHERE` id` = OLD . `id` ;
END

Search by using the following query:

SELECT `users` . * FROM` users`
INNER JOIN `search`
ON `search` . `id` = ` users` . `id`
WHERE
MATCH ( `search` . City , ` search` . Country , `search` . Login , ` search` . Name , `search` . Surname ) AGAINST ( 'Vladimir Tupin St. Petersburg' IN BOOLEAN MODE ) > 0
ORDER BY MATCH ( `search` . City , ` search` . Country , `search` . Login , ` search` . Name , `search` . Surname ) AGAINST ( 'Vladimir Tupin St. Petersburg' IN BOOLEAN MODE ) DESC

Here, data is searched in the search table, the result is sorted by relevance, and at the output we get the corresponding records from the users table.

The main advantage of this approach is the flexibility of the search by adding additional indexes and compiling new search combinations (country + city or login + first name + last name). Thus, we are free to form new search sets and relevance rules.
The disadvantages of this method (as well as all methods with the creation of a “mirror”) are redundant data storage. Therefore, it is advisable to use it with small amounts of data as in our example.

Using a mirrored table in MyISAM with cached data

The second way is also to create a data mirror, but here we will store data in only one field. In the task, the search is carried out immediately by a group of fields and we will try to combine them into one text field, separated by spaces. Thus, the entire data set in the users table will correspond to a single field. Create a search table with two fields id and text . Id will correspond to the id of the main table ( users ), text is our “cached” data.

CREATE TABLE search (
id INT ( 11 ) DEFAULT NULL ,
`text` TEXT DEFAULT NULL ,
FULLTEXT INDEX IX_search_text ( `text` )
)
ENGINE = MYISAM

Synchronization is also performed using triggers:

Addition:

CREATE
TRIGGER `insert`
AFTER INSERT
ON users
FOR EACH ROW
BEGIN
INSERT INTO search ( `id` , ` text` ) VALUES ( NEW . `Id` ,
LOWER (
CONCAT_WS ( ' ,
NEW . `name` ,
NEW . `surname` ,
NEW . `login` ,
NEW . `country` ,
NEW . `city`
)
)
) ;
END

Change:

CREATE
TRIGGER `update`
AFTER UPDATE
ON users
FOR EACH ROW
BEGIN
DELETE FROM search WHERE `id` = NEW . `id` ;
INSERT INTO search ( `id` , ` text` ) VALUES ( NEW . `Id` ,
LOWER (
CONCAT_WS ( ' ,
NEW . `name` ,
NEW . `surname` ,
NEW . `login` ,
NEW . `country` ,
NEW . `city`
)
)
) ;
END CREATE

Uninstall:

TRIGGER `delete`
AFTER DELETE
ON users
FOR EACH ROW
BEGIN
DELETE FROM search WHERE `id` = OLD . `id` ;
END

The search query looks like this:

SELECT `users` . * FROM` users`
INNER JOIN `search`
ON `search` . `id` = ` users` . `id`
WHERE
MATCH ( `search`.` Text` ) AGAINST ( 'Vladimir Tupin St. Petersburg' IN BOOLEAN MODE ) > 0
ORDER BY MATCH ( `search`.` Text` ) AGAINST ( 'Vladimir Tupin St. Petersburg' IN BOOLEAN MODE ) DESC

This method is not as flexible as the previous one, however, as we will see later, it wins in speed with a large number of different requests.

Using the keyword table in MyISAM

The third method is based on the creation of a list of "keywords" - search tags. Keywords are fields in the users table. For example, for a user with fields (id=2144; login= leo; name=;surname=;city=' ';country=;email=leo@tolstoy.ru;password=;) keywords will be («leo»; «»; «»; « »; «») . All these words we will write in a separate table MyISAM, in which there will be two fields id and text . Id corresponds to the id of the main table ( users ). And text is the field in which keywords tags will be written. Each user from the users table will correspond to 5 entries in the new search table. Thus, we received the table of tags of each user.

CREATE TABLE search (
id INT ( 11 ) DEFAULT NULL ,
`text` VARCHAR ( 255 ) DEFAULT NULL ,
FULLTEXT INDEX IX_search_text ( `text` )
)
ENGINE = MYISAM

Synchronization of data is also carried out by triggers:

Creature:

CREATE
TRIGGER `insert`
AFTER INSERT
ON users
FOR EACH ROW
BEGIN
INSERT INTO search ( `id` , ` text` ) VALUES
( NEW . `Id` , NEW.` Login` ) ,
( NEW . `Id` , NEW.` Name` ) ,
( NEW . `Id` , NEW.` Surname` ) ,
( NEW . `Id` , NEW.` Country` ) ,
( NEW . `Id` , NEW.` City` ) ;
END

Change:

CREATE
TRIGGER `update`
AFTER UPDATE
ON users
FOR EACH ROW
BEGIN
DELETE FROM search WHERE `id` = NEW . `id` ;
INSERT INTO search ( `id` , ` text` ) VALUES
( NEW . `Id` , NEW.` Login` ) ,
( NEW . `Id` , NEW.` Name` ) ,
( NEW . `Id` , NEW.` Surname` ) ,
( NEW . `Id` , NEW.` Country` ) ,
( NEW . `Id` , NEW.` City` ) ;
END

Uninstall:

CREATE
TRIGGER `delete`
AFTER DELETE
ON users
FOR EACH ROW
BEGIN
DELETE FROM search WHERE `id` = OLD . `id` ;
END

Search query:

SELECT `users` . * FROM` users`
INNER JOIN `search`
ON `search` . `id` = ` users` . `id`
WHERE
MATCH ( `search`.` Text` ) AGAINST ( 'Vladimir Tupin St. Petersburg' IN BOOLEAN MODE ) > 0
GROUP BY `search` . `id`
ORDER BY COUNT ( * ) DESC

Please note that if earlier relevance was determined by the built-in search engine MyISAM, then in this case we define it ourselves. As a result of the search, we received only those tags that match the query. And the more tags one user, the higher it is in the sample.
The given example has a drawback: with an equal number of tags, several records have a natural sorting, which is not always true in terms of relevance.
However, this method has a high potential for further development. First, we can add to the ORDER BY sorting the sum of relevance scores from the MATCH AGAINST query. Thus, the above disadvantage will be eliminated. Secondly, we can add to this table an additional weight field of the weight tag, and each field of the main table should be assigned a value of this weight. Thus, we can add sorting based on the importance (weight) of individual fields. This gives us the opportunity to focus on some fields without compromising the quality of the search.

Parsing the request and direct search in InnoDB

The fourth method is harsh and does not use MyISAM as previous ones. It also does not have additional tables and triggers. We will simply search for the existing table. First we need to index all the fields in which the search will be performed.

CREATE TABLE users (
id int ( 11 ) NOT NULL ,
login VARCHAR ( 255 ) DEFAULT NULL ,
`password` VARCHAR ( 255 ) DEFAULT NULL ,
name VARCHAR ( 255 ) DEFAULT NULL ,
surname VARCHAR ( 255 ) DEFAULT NULL ,
email VARCHAR ( 255 ) NOT NULL ,
country VARCHAR ( 255 ) DEFAULT NULL ,
city VARCHAR ( 255 ) DEFAULT NULL ,
PRIMARY KEY ( id )
INDEX city ( city ) ,
INDEX country ( country )
INDEX email ( email )
INDEX login ( login ) ,
INDEX name ( name ) ,
INDEX password ( password ) ,
INDEX surname ( surname )
)
ENGINE = INNODB

In InnoDB, we can perform a search only with the help of the LIKE operator, but for its effective work it is necessary to split the word query, otherwise queries consisting of several words will remain without result. To break up words and compose a query, we will write a function:

CREATE
FUNCTION search ( str VARCHAR ( 255 ) )
RETURNS varchar ( 255 ) CHARSET cp1251
BEGIN
DECLARE output VARCHAR ( 255 ) DEFAULT '' ;
DECLARE temp_str varchar ( 255 ) ;
DECLARE first_part VARCHAR ( 255 ) DEFAULT "CONCAT_WS (", `name`,` surname`, `login`,` country`, `city`) LIKE '%" ;
DECLARE last_part VARCHAR ( 255 ) DEFAULT "% '" ;

WHILE LENGTH ( str ) ! = 0 DO
SET temp_str = SUBSTRING_INDEX ( str , '' , 1 ) ;
IF temp_str = str
THEN
SET str = '' ;
ELSE
SET str = SUBSTRING ( str , LENGTH ( temp_str ) + 2 ) ;
END IF ;

IF output! = ""
THEN
SET output = CONCAT ( output , 'OR' ) ;
END IF ;

SET output = CONCAT ( output , first_part , temp_str , last_part ) ;

END WHILE;
RETURN output;
END

The function returns us a fragment of the generated search query, which just needs to be substituted and executed:

SET @ WHERE = CONCAT ( 'SELECT * FROM `users` WHERE' , search ( 'Habra Habrovich' ) ) ;
PREPARE prepared FROM @ WHERE ;
EXECUTE prepared;

You can also use temporary tables, they will give a tangible convenience when processing the results of the query.

Using third-party solutions

There are a number of third-party full-text search solutions. The most popular platforms are Sphinx and Apache Lucene- based projects. Their use is meaningless for small amounts of data (such as in our example), and sometimes it is simply impossible due to restrictions (hoster, evil admin, curved hands, etc.).

Comparison

Let's compare the shown methods of full-text search (except third-party solutions) for the speed of execution of typical queries. We will compare on the example of the execution of 50 queries of varying complexity. To do this, we write a PHP script that will objectively calculate the average search speed for each of the above methods. In order to bring measurements closer to real conditions, we will conduct a second control measurement in which the same search queries will be used. Here it will be possible to assess how well MySQL caching mechanisms are used in each method.

Comparison of the speed of execution of search queries in the MySQL database in the InnoDB table using various methods:

More details:

Method	The average speed of a single request (sec.)	Average speed of execution of one repeated request (sec.)
Using the mirror table in MyISAM	0.029738	0.011974
Using a mirrored table in MyISAM with cached data	0.025652	0.012027
Using the keyword table in MyISAM	0.027876	0.008866
Parsing the request and direct search in InnoDB	0.136091	0.09541

As expected, the direct LIKE search in InnoDB turned out to be the slowest and significantly loses everything else. Of course, this method can still be optimized, but this is unlikely to give a significant gain in speed.
The three remaining search methods showed themselves at about the same level. As practice has shown, with a large number of identical queries, the use of keywords (tags) in MyISAM gives a tangible advantage. With a large number of various search queries, the win gives the second method - creating a cached mirror. If some fields are very different in size from others (the content of the article, the text of the news), then the first method shows itself more effectively - creating a mirror table.

Creating MyISAM mirrors should be used for small tables (10-50 thousand records in a table), if there are more records in the table, and technical capabilities allow using third-party mechanisms (Sphinx, Apache Lucene).

Source: https://habr.com/ru/post/114572/

All Articles