Why putting the most selective columns in the composite index prefix is not always good

tl; dr In this article, we will consider the case when it is better to move the most selective attribute from the prefix of the composite index to the suffix.

And also consider what a pipeline and how to use it to select data that is already sorted.

Domain Description

There is an event logger in system X. You need to make an application to view these logs from this system.

It is assumed that the system has its own error codes and corresponding messages. About 10k of them are unique.

There are three types of messages:

notice - notification
warning - warning
error - cranes.

Log table:

 create table `log` ( `message` text not null, `datetime` datetime not null, `type` enum('notice', 'warning', 'error') not null default 'notice' ); create index `datetime_message` on `log`(`datetime`, `message`(150));

There are 10 million entries in the table. datetime is always unique, but the message field has only 10k unique fields.

To generate records, I made a procedure that generates a random log. datetime changes as if logs are written once a second.

The procedure itself

 delimiter // create procedure `generate_logs`(`amount` int, `amountOfUniqueMessages` int) not deterministic modifies sql data sql security invoker begin declare i int default 1; set @datetime = cast(current_date as datetime) - interval 9 year; --    input_params: begin if (amount <= 0 or amountOfUniqueMessages <= 0) then leave input_params; --   end if; end; start transaction; --  [amountOfUniqueMessages]     datetime --        , --   datetime  interval- while i < amount DO set @message = concat('message ', i % amountOfUniqueMessages); insert into `log`(`message`, `datetime`) values (@message, @datetime + interval i second); end while; commit; end; // delimiter ;

Implementation

We assume that the most frequent user operation of the application will be the output of the entire log for a specific date .
This is best done through a type query.

 select `message`, `datetime` from `log` where `datetime` >= '2017-04-01 00:00:00' and `datetime` < '2017-04-02 00:00:00' order by `datetime`;

upd : thanks VolCh for advice with dates, fixed <= '2017-04-01 23:59:59' on < '2017-04-02 00:00:00' . Read more in the comments to the post.

Those. select all records for a certain date, sorted by it. And if the date comes first in the composite index, then it is not even necessary to sort it, it is returned in a sorted form.

Explain of this query shows good results:

  id: 1 select_type: SIMPLE table: log partitions: NULL type: range possible_keys: datetime_message key: datetime_message key_len: 5 ref: NULL rows: 172242 filtered: 100.00 Extra: Using index condition

Affected 172k fields. This is quite an expected result, provided that the data was generated as if the logger was writing something to the database every second.

Order by asc / desc

I note that even if the sorting is descending , all the same, the data is fetch already sorted, and they should not be sorted by filesort : th:

Request with order by ... descending :

 select `message`, `datetime` from `log` where `datetime` >= '2009-03-24 00:00:00' and `datetime` < '2009-03-25 00:00:00' order by `datetime` desc;

His explain:

  id: 1 select_type: SIMPLE table: log partitions: NULL type: range possible_keys: datetime_message key: datetime_message key_len: 5 ref: NULL rows: 172242 filtered: 100.00 Extra: Using index condition

Without filesort - and without temporary . Everything is exactly the same as in the first case.
This phenomenon is called the pipeline , for the fact that the data is stored as if chained, one after another. And you can pull all the values, starting with the initial link ( order by asc ), and from the final link ( desc ).

To understand how message is sorted in a composite index, you can imagine school classes. Pupils in each class are sorted from a to z :

1 "a"	1 b
Ivanov	Kuznetsov
Petrov	Popov
Sidorov	Novikov

If select -it all students from 1 "a", then they will return already sorted without using filesort or temporary ; no matter what was used, ascending or descending :

 select `surname` from `schoolkids` where `class` = '1' and `liter` = '';

will return

surname
Ivanov
Petrov
Sidorov

However, all we have to do is to take all students from both classes and sort them out, as explain immediately issue an ominous Using filesort or Using temporary :

 select `surname` from `schoolkids` where `class` = '1' and `liter` in ('', '') order by `surname`

surname
Ivanov
Kuznetsov
Novikov
Petrov
Popov
Sidorov

This happened obviously because the values can no longer be taken along the pipeline , so the DBMS needs to sort them by itself.

Let's look at another example: you need to sort the previous request by message . In this case, the attribute is already sorted, but relative to the index prefix, i.e. relatively datetime .

 select `message`, `datetime` from `log` where `datetime` >= '2009-03-24 00:00:00' and `datetime` < '2009-03-25 00:00:00' order by `message` desc;

Explain:

  id: 1 select_type: SIMPLE table: log partitions: NULL type: range possible_keys: datetime_message key: datetime_message key_len: 5 ref: NULL rows: 172242 filtered: 100.00 Extra: Using index condition; Using filesort

Why did filesort ? Let us recall an example with schoolchildren: if thirty schoolchildren (index suffix) study in the same class (index prefix), then they are sorted by pipeline –y; however, when selecting several classes, you need to sort by hand (pick up a journal and create a new sorted list with all students of the 1st grade on a new sheet of paper). Here is the same principle, but with the amendment that the datetime attribute is completely unique (equivalent to the fact that only one schoolchild learns in each class). This means that the DBMS needs to do the sorting itself. Therefore, in this query, filesort is the norm from which you cannot get anywhere.

Everything works, but suddenly ...

However, after analyzing the most frequent sql – queries made on the log table, it turned out that the most frequent operation performed in the application is to search for logs with specific message and type, without specific time intervals.
For example, search for all errors with the message " message 183 ".
Such a request will not be optimal and will be executed in about 30 seconds:

 select `datetime`, `message` from `log` where `message` = 'message 183' and `type` = 'error';

Explain of this request gave the following picture:

  id: 1 select_type: SIMPLE table: log partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 10010745 filtered: 3.33 Extra: Using where

Now it is clear that the index is not used at all. This is understandable: it is too expensive to search for information on the index suffix.

We conclude that we need to change the structure of the index so that the message comes first:

 drop index `datetime_message` on `log`; create index `message_datetime` on `log`(`message`(150), `datetime`);

Now the query that would have dropped the database when the index was past, looks quite optimal:

  id: 1 select_type: SIMPLE table: log partitions: NULL type: ref possible_keys: message_datetime key: message_datetime key_len: 452 ref: const rows: 1000 filtered: 100.00 Extra: Using where

However, the old request for receiving messages for a specific date is now non-optimal.
But if it is rarely executed, then it can be left non-optimal , since the main optimization task is completed: all frequent queries to the database are optimized.

Let's sum up

Not always the most selective column should be in the prefix of the composite index.
There are situations when the attribute that has a bunch of repetitions in the table is the most frequently selected. And there is no sense to put it to the right, because the search operations on it will lead to a complete search of the index tree.

There are people who consider it a myth to put the most selective column to the left.
It is difficult to call it a myth, because in practice the most selective column gives more advantage in the search over the others.

In addition to selectivity, it is necessary to pay attention to the subject area itself, and, first of all, to make a start from its requirements, and not just from dry data.

useful links

Why putting the most selective column to the left is a myth
Use the index, Luke! This is probably the must have of any database development. The optimal use of indexes, a large number of examples with queries and chips for various DBMS are very well explained.
Once again on the topic of why the most selective column on the left is a myth

Source: https://habr.com/ru/post/351942/

All Articles

Why putting the most selective columns in the composite index prefix is ​​not always good