Full-text search and its capabilities

Many DBMS support full-text search methods (Fulltext search), which allow you to very quickly find the necessary information in large amounts of text.

Unlike the LIKE operator, this type of search provides for the creation of the corresponding full-text index, which is a peculiar dictionary of references to words in fields. A word is usually understood as a collection of at least 3 non-whitespace characters (but this can be changed). Depending on the data of the dictionary, relevance can be calculated - a comparative measure of the query matching the found information.

The article describes how to work with full-text search on the example of MySQL databases, as well as give examples of the "non-standard" use of this mechanism.
')
In MySQL, full-text search capabilities (only for MyISAM-tables) are supported starting from version 3.23.23. In subsequent versions, the mechanism suffered significant improvements and extensions, in turn, becoming a powerful tool for creating search engines for web applications. The main feature is a quick search for words in very large volumes of textual information.

FULLTEXT Index

So, to work with full-text search, we first need to create the corresponding index. It is called FULLTEXT , and can be superimposed on the CHAR, VARCHAR and TEXT fields. And, as in the case of a regular index - if a search is performed on 2 fields, then a combined index of 2 fields is needed, use a search on one field - only an index of this field is needed. For example:

CREATE TABLE `articles` (
`id` int(10) unsigned NOT NULL auto_increment,
`title` varchar(200) default NULL,
`body` text,
PRIMARY KEY (`id`),
FULLTEXT KEY `ft1` (`title`,`body`),
FULLTEXT KEY `ft2` (`body`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

This example creates a table with 2 full-text indexes: ft1 and ft2 , which can be used to search in the title and body fields, or only in the body . Only in the title field will not work.

MATCH-AGAINST construction

Actually for the most full-text search in MySQL, the MATCH (filelds) ... AGAINST (words) construct is used. It can work in various modes, which are quite different from each other. For all, the following rule applies: this construction returns conditional relevance, but the method of calculating which may be different depending on the mode. Another thing to add is that in all modes, the search is always case-sensitive. Further details on each of them.

MATCH-AGAINST IN NATURAL LANGUAGE MODE

- this is the main type of search, which is used by default, i.e. if mode is not specified:

SELECT * FROM `articles` WHERE MATCH (title,body) AGAINST ('database');

In this example, we search for the word database in the title and body fields of the articles table based on the index ft1 (see the example of creating the table above). The sample will be automatically sorted by relevance - this happens if the MATCH-AGAINST clause is specified inside the WHERE block and the ORDER BY sort condition is not specified.

By the way, despite the possibilities of aliases, when querying, the structure has to be repeated in different places, which complicates the queries. For example, you cannot write like this:

SELECT *, MATCH (title,body) AGAINST ('database') as REL
FROM `articles`
WHERE REL > 0;

- this request will generate an error: the Rel field is undefined . What would work would have to duplicate this design:

SELECT *, MATCH (title,body) AGAINST ('database') as REL
FROM `articles`
WHERE MATCH (title,body) AGAINST ('database') > 0;

However, no matter how much you use the same construction (of course, with the same parameters), it will be calculated only once .

In the example above, in the variable REL , relevance will be calculated. This value depends primarily on the number of words in the tilte and body fields, how close the word is to the beginning of the text, the ratio of the number of words encountered to the number of all words in the field, etc.

For example, the relevance will not be null if the word database is found either in the title or body , but if it meets both there and there, the value of relevance will be higher than if it occurs twice in body .

Relevance alone does not determine anything. This is only a comparative characteristic by which you can sort the result of the sample, nothing more.

It should also be noted that the so-called “50% threshold” is valid for IN NATURAL LANGUAGE MODE. This means that if a word is found in more than 50% of all the fields viewed, then it will not be taken into account, and a search for this word will not produce results.

MATCH-AGAINST IN BOOLEAN MODE

In binary mode, unlike other modes, relevance is calculated somewhat differently - as a conditional measure of the coincidence of a given template. The position of the desired template in the text, the number of encountered options do not matter.

The most important feature of the binary mode is the ability to specify logical operators . I will not give the operators themselves; they are well described in the original MySQL documentation .

Another feature of the binary mode is the lack of automatic sorting in case the WHERE clause is specified, however, an alias can be used for sorting:

SELECT *,
MATCH (title,body) AGAINST ('+database MySQL' IN BOOLEAN MODE) as REL
FROM `articles`
WHERE MATCH (title,body) AGAINST ('+database MySQL' IN BOOLEAN MODE)
ORDER BY REL;

The example will display all records containing the word database , but if the word MySQL is in the record, its relevance will be higher. Entries will be sorted by relevance.

In binary mode, there is no limit of “50% threshold” . Binary mode can be used without creating a full-text index, but this will work very slowly.

MATCH-AGAINST IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION

Or simply "WITH QUERY EXPANSION" . It works in approximately the same way as NATURAL LANGUAGE MODE, with the only difference, then the search result includes not only matches with the template, but also possible logical matches. It works like this:

First, MySQL performs a query similar to NATURAL LANGUAGE MODE and generates a result. According to this result, an attempt is made to calculate words that also have a high relevance for the resulting sample. If these words are present, a search is also performed for them too, but their significance for relevance will be significantly lower. A mixed sample is offered - first, those results where the word is present, and then those that were obtained as a result of a “repeated” search.

WITH QUERY EXPANSION is not recommended for large volumes of information, as the result can get a lot of excess.

Using FULLTEXT SEARCH

A couple of words about the search algorithms

Well, of course, full-text search can be used primarily to write search algorithms. :-) I will not focus on them, I’ll just say that when indexing text information, you may need a complex processing algorithm, like this:

remove all HTML tags
remove all non-printable characters, punctuation and the like
remove all words less than 3 characters long
translate all words to lower case

- this is only in the simplest case, without taking into account the morphology, highlighting, keyword accounting and encoding.

Accordingly, with the search query must be done the same. The search mode is used any way it is more convenient ... In general, the search is a separate topic, about which a separate article is needed.

Much-to-many disclosure

In some cases - not all - with the help of full-text search, it is possible to reveal many-to-many relationships without invoking the third table.

Suppose we have two large tables: with users and user groups . Moreover, each user is related to a large number of different groups, in turn the groups include a large number of users. With a normal ratio (i.e., disclosure through the 3rd table), in order to select all the groups that belong to a certain user, you will need to make a query that combines 2 or 3 tables, which even with the presence of indices is very expensive.

However, you can denormalize as follows:

Now, to select the groups belonging to user 2 you can do:

SELECT *
FROM `groups`
WHERE MATCH (groups) AGAINST ('+user2' IN BOOLEAN MODE);

This will work much faster than the original version (with the 3rd table). Similarly with groups, but if we don’t need such samples in principle, then we can do without the corresponding field in the group table. Then something will turn out like a “one-sided” M: N connection. That is, you can calculate all M, which belong to N, you can not do the opposite.

In this case, as a rule, IN BOOLEAN MODE is used.

- By the way, tagging information is very good for this scheme, but everything is not so simple there, and this is again a separate topic.

Using relevance as a measure of the relationship of one object to another

One of the algorithms for calculating articles that are “similar” to this article . Everything is simple: the tags of this article are taken, and a full-text query is made across the field with the tags of all other articles, sorted by relevance (if needed). Naturally, first come out those that contain the maximum match on the tags.

You can and without tags. If articles are indexed for full-text search, a dozen of the most frequently used words are selected from the index, after which a search is made for them.

Or another example - the interests of users. Using the exact same scheme, you can easily find other users whose interests best match yours.

And something in conclusion:

Despite all the possibilities of full-text indexes, it should be borne in mind that the indexes themselves occupy a very significant space on the disk, and changes to the tables with them will take much longer.

Source: https://habr.com/ru/post/40218/

All Articles