I often see errors related to creating indexes in MySQL. Many developers (and not only beginners in MySQL) create many indexes on those columns that will be used in the samples, and they consider this the optimal strategy. For example, if I need to execute a query like
AGE = 18 AND STATE = 'CA' , then many people will simply create 2 separate indexes on the columns AGE and STATE.
Much better (
hereafter approx. Translator: and usually the only correct one ) strategy is to create a combined type index (AGE, STATE). Let's look at why this is so.
')
Usually (but not always) indexes in MySQL are
BTREE-indices - this type of index is able to quickly look through the information contained in its prefixes and sort through ranges of sorted values. For example, when you request
AGE = 18 with a BTREE index on the AGE column, MySQL will find the first matching line in the table and continue searching until it finds the first unsuitable row - then it stops the search, since believes that further nothing suitable will be. Ranges, such as queries like
BETWEEN 18 AND 20 , work in a similar way — MySQL stops at other values.
The situation with requests of type
AGE IN (18,20,30) is somewhat more complicated, since in fact, MySQL has to go through the index several times.
So, we discussed
how MySQL searches by index, but did not determine
what it returns after the search — usually (if we are not talking about covering indexes), we get a “string pointer”, which can be the primary key value (if the InnoDB engine is used ), physical file offset (for MyISAM) or something like that. It is important that the internal MySQL engine can find a complete line with all the necessary data corresponding to the specified index value using this pointer.
And what options does MySQL have if you created two separate indexes? It can either use only one of them to select suitable rows (and then filter the extracted data, guided by WHERE -
but without using indexes ), or it can get pointers to rows from all suitable indexes and calculate their intersection, and then return the data .
Which method is more appropriate depends on the selectivity and correlation of the indices. If, after working WHERE, 5% of the rows will be selected by the first column, and applying further WHERE to the second column filters up to 1% of the total number, then the use of intersections, of course, makes sense. But if the second WHERE filters only up to 4.5%, then it is usually much more profitable to use only the first index and filter out the rows that we do not need after extracting the data.
Let's look at a few examples:
CREATE TABLE `idxtest` (
`i1` int(10) UNSIGNED NOT NULL,
`i2` int(10) UNSIGNED NOT NULL,
`val` varchar(40) DEFAULT NULL,
KEY `i1` (`i1`),
KEY `i2` (`i2`),
KEY `combined` (`i1`,`i2`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
I created columns i1 and i2 independent of each other, each of which selects about 1% of the rows in the table, which contains a total of 10 million records.
mysql [localhost] {msandbox} (test)> EXPLAIN SELECT avg(length(val)) FROM idxtest WHERE i1=50 AND i2=50;
+----+-------------+---------+------+----------------+----------+---------+-------------+------+-------+
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra
+----+-------------+---------+------+----------------+----------+---------+-------------+------+-------+
| 1 | SIMPLE | idxtest | ref | i1,i2,combined | combined | 8 | const,const | 665 |
+----+-------------+---------+------+----------------+----------+---------+-------------+------+-------+
1 row IN SET (0.00 sec)
As you can see, MySQL chose to use a combined index, and the query ran in less than 10 ms!
Now suppose that we have an index only for individual columns (to tell the optimizer to ignore the combined index):
mysql [localhost] {msandbox} (test)> EXPLAIN SELECT avg(length(val)) FROM idxtest IGNORE INDEX (combined) WHERE i1=50 AND i2=50;
+----+-------------+---------+-------------+---------------+-------+---------+------+------+-------------------------------------+
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra
+----+-------------+---------+-------------+---------------+-------+---------+------+------+-------------------------------------+
| 1 | SIMPLE | idxtest | index_merge | i1,i2 | i1,i2 | 4,4 | NULL | 1032 | USING intersect(i1,i2); USING WHERE
+----+-------------+---------+-------------+---------------+-------+---------+------+------+-------------------------------------+
1 row IN SET (0.00 sec)
As you can see in this case, MySQL did a search for intersection of indexes, and it took 70 ms to execute the query -
7 times longer!Now let's see what happens if you use only one index and filter the data:
mysql [localhost] {msandbox} (test)> EXPLAIN SELECT avg(length(val)) FROM idxtest IGNORE INDEX (combined,i2) WHERE i1=50 AND i2=50;
+----+-------------+---------+------+---------------+------+---------+-------+--------+-------------+
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra
+----+-------------+---------+------+---------------+------+---------+-------+--------+-------------+
| 1 | SIMPLE | idxtest | ref | i1 | i1 | 4 | const | 106222 | USING WHERE
+----+-------------+---------+------+---------------+------+---------+-------+--------+-------------+
1 row IN SET (0.00 sec)
This time, MySQL had to bypass significantly more rows, and the execution of the query took
290 ms . Thus, we see that using intersection of indices is much better than using a single index, but it is much better to use combined indexes.
However, problems with intersection of indices do not end there. Currently, the possibilities of using this procedure in MySQL are significantly limited, therefore MySQL does not always use them:
mysql [localhost] {msandbox} (test)> EXPLAIN SELECT avg(length(val)) FROM idxtest IGNORE INDEX (combined) WHERE i1=50 AND i2 IN (49,50);
+----+-------------+---------+------+---------------+------+---------+-------+--------+-------------+
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra
+----+-------------+---------+------+---------------+------+---------+-------+--------+-------------+
| 1 | SIMPLE | idxtest | ref | i1,i2 | i1 | 4 | const | 106222 | USING WHERE
+----+-------------+---------+------+---------------+------+---------+-------+--------+-------------+
1 row IN SET (0.00 sec)
As soon as a query on one of the columns becomes not a comparison, but an enumeration, MySQL will no longer be able to use the intersection of the indexes, despite the fact that in this case, when requesting
i2 IN (49.50), this would be more than reasonable, because . The query remains quite selective.
Now let's do another test. I cleared the table and re-filled it with data so that the values in i1 and i2 strongly correlated. In fact, they are now generally equal:
mysql [localhost] {msandbox} (test)> UPDATE idxtest SET i2=i1;
Query OK, 10900996 rows affected (6 min 47.87 sec)
Rows matched: 11010048 Changed: 10900996 Warnings: 0
Let's see what happens in this case:
mysql [localhost] {msandbox} (test)> EXPLAIN SELECT avg(length(val)) FROM idxtest WHERE i1=50 AND i2=50;
+----+-------------+---------+-------------+----------------+-------+---------+------+------+-------------------------------------+
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra
+----+-------------+---------+-------------+----------------+-------+---------+------+------+-------------------------------------+
| 1 | SIMPLE | idxtest | index_merge | i1,i2,combined | i2,i1 | 4,4 | NULL | 959 | USING intersect(i2,i1); USING WHERE
+----+-------------+---------+-------------+----------------+-------+---------+------+------+-------------------------------------+
1 row IN SET (0.00 sec)
The optimizer decided to use the intersection of indices, although this was perhaps the worst solution! The request took
360 ms . Also note the large margin of error in estimating the approximate number of rows.
This was due to the fact that MySQL considers the values in columns i1 and i2 to be independent, and therefore chooses the intersection of indices. In fact, he can not suggest another, because He has no statistics on the correlation of values in the columns.
mysql [localhost] {msandbox} (test)> EXPLAIN SELECT avg(length(val)) FROM idxtest IGNORE INDEX(i2) WHERE i1=50 AND i2=50;
+----+-------------+---------+------+---------------+------+---------+-------+--------+-------------+
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra
+----+-------------+---------+------+---------------+------+---------+-------+--------+-------------+
| 1 | SIMPLE | idxtest | ref | i1,combined | i1 | 4 | const | 106222 | USING WHERE
+----+-------------+---------+------+---------------+------+---------+-------+--------+-------------+
1 row IN SET (0.00 sec)
And now, when we have banned MySQL from using the index on column i2 (which means it cannot even find the intersection of indices), it uses the index in one column rather than a combined one. This happened because MySQL has statistics about the approximate number of matching rows, and since it is equal for both indices, MySQL chose a smaller one. Execution of the request again took
290 ms - exactly the same as last time.
Make MySQL use only the combined index:
mysql [localhost] {msandbox} (test)> EXPLAIN SELECT avg(length(val)) FROM idxtest IGNORE INDEX(i1,i2) WHERE i1=50 AND i2=50;
+----+-------------+---------+------+---------------+----------+---------+-------------+--------+-------+
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra
+----+-------------+---------+------+---------------+----------+---------+-------------+--------+-------+
| 1 | SIMPLE | idxtest | ref | combined | combined | 8 | const,const | 121137 |
+----+-------------+---------+------+---------------+----------+---------+-------------+--------+-------+
1 row IN SET (0.00 sec)
It can be seen that MySQL is about 20% mistaken in estimating the number of rows being iterated, which, of course, is incorrect, because the same prefix is used as when using the index only on column i1. MySQL does not know this, because looks at the statistics for individual indices and does not attempt to reconcile them.
Due to the fact that the combined index used is larger than the index in one column, the request took
300 ms to complete .
Thus, we see that MySQL may decide to use the intersection of indexes even if it is the worst option, although from a technical point of view it will certainly be the best plan, considering that it does not have other statistics.
There are simple ways to force MySQL not to use intersection of indexes, but, unfortunately, I do not know how to force it to use intersections if it considers this option to be non-optimal. I hope that this opportunity will be added in the future.
Finally, let's consider a situation where the procedure for finding the intersection of indices works much better than the combined indices in several columns. This is the case when we use
OR when sampling between columns. In this case, the combined index becomes completely useless, and MySQL has a choice between full table scanning (FULL SCAN) and performing a union (UNION) of values instead of searching for an intersection on the data that it received from one table.
I again changed the values in columns i1 and i2 so that they contained independent data (typical situation for tables).
mysql [localhost] {msandbox} (test)> EXPLAIN SELECT avg(length(val)) FROM idxtest WHERE i1=50 OR i2=50;
+----+-------------+---------+-------------+----------------+-------+---------+------+--------+---------------------------------+
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra
+----+-------------+---------+-------------+----------------+-------+---------+------+--------+---------------------------------+
| 1 | SIMPLE | idxtest | index_merge | i1,i2,combined | i1,i2 | 4,4 | NULL | 203803 | USING union(i1,i2); USING WHERE
+----+-------------+---------+-------------+----------------+-------+---------+------+--------+---------------------------------+
1 row IN SET (0.00 sec)
Such requests are performed
660 ms . Turning off the index on the second column we get FULL SCAN:
mysql [localhost] {msandbox} (test)> EXPLAIN SELECT avg(length(val)) FROM idxtest IGNORE INDEX(i2) WHERE i1=50 OR i2=50;
+----+-------------+---------+------+---------------+------+---------+------+----------+-------------+
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra
+----+-------------+---------+------+---------------+------+---------+------+----------+-------------+
| 1 | SIMPLE | idxtest | ALL | i1,combined | NULL | NULL | NULL | 11010048 | USING WHERE
+----+-------------+---------+------+---------------+------+---------+------+----------+-------------+
1 row IN SET (0.00 sec)
Please note that MySQL has indicated the i1, combined keys as
possible to use, but in fact it
does not have such a possibility. Execution of such requests takes
3370 ms !
Also note that the execution of the query took 5 times longer despite the fact that the FULL SCAN passed about 50 times more rows. This shows a very large performance difference between a full pass through the table and access by key, which takes 10 times longer (in the sense of the “cost” of access per row), despite being performed in memory.
In the case of UNION, the optocoupler is more advanced and quite capable of handling the ranges:
mysql [localhost] {msandbox} (test)> EXPLAIN SELECT avg(length(val)) FROM idxtest WHERE i1=50 OR i2 IN (49,50);
+----+-------------+---------+-------------+----------------+-------+---------+------+--------+--------------------------------------+
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra
+----+-------------+---------+-------------+----------------+-------+---------+------+--------+--------------------------------------+
| 1 | SIMPLE | idxtest | index_merge | i1,i2,combined | i1,i2 | 4,4 | NULL | 299364 | USING sort_union(i1,i2); USING WHERE
+----+-------------+---------+-------------+----------------+-------+---------+------+--------+--------------------------------------+
1 row IN SET (0.00 sec)
Summing up
In most cases, using combined indexes across multiple columns is the best solution if you use AND between similar columns in WHERE. Using intersection of indexes improves performance in principle, but it is still significantly worse than when using combined keys. In case you use OR between columns, you will need to have an index on each of the columns in order for MySQL to find their intersections, and combined indexes cannot be used in such queries.