📜 ⬆️ ⬇️

Determine the order of the columns in the composite index

I want to share a simple empirical method that I use to determine in which order the columns in the composite index should go. This method is suitable not only for MySQL, it is also applicable to any DBMS in which b-tree indices are used.

Let's start with a query that returns an empty result, but it does a full table scan. EXPLAIN will show on it that there are no indexes available (i.e. possible_keys = NULL)

SELECT * FROM tbl
WHERE
status= 'waiting' AND
source= 'twitter' AND
no_send_before <= '2009-05-28 03:17:50' AND
tries <= 20
ORDER BY date ASC LIMIT 1;

Do not try to understand the meaning of this request, it is given only as an example. In the simplest case, we want to place the most selective column in the index first, so that the possible number of matching rows is minimal, so we will find the necessary rows as quickly as possible. Assuming that all columns have some distribution of values, we can simply count the number of matches for each condition.
')
SELECT
sum (status= 'waiting' ),
sum (source= 'twitter' ),
sum (no_send_before <= '2009-05-28 03:17:50' ),
sum (tries <= 20),
count (*)
FROM tbl\G

*************************** 1. row ***************************
sum (status = 'waiting' ): 550
sum (source= 'twitter' ): 37271
sum (no_send_before <= '2009-05-28 03:17:50' ): 36975
sum (tries <= 20): 36569
count (*): 37271


It's simple - I wrapped each condition with the SUM () function, which for MySQL is equivalent to COUNT (number_time_when_tru). As you can see, the most selective condition is “status = waiting”. Let's put this column in the index first, then move the condition from the SELECT to WHERE and run the query again to count the matches in the remaining set.

SELECT
sum (source= 'twitter' ),
sum (no_send_before <= '2009-05-28 03:17:50' ),
sum (tries <= 20),
count (*)
FROM tbl
WHERE
status= 'waiting' \G
*************************** 1. row ***************************
sum (source= 'twitter' ): 549
sum (no_send_before <= '2009-05-28 03:17:50' ): 255
sum (tries <= 20): 294
count (*): 549


Now we have dropped to an acceptable number of rows. It can be seen that “source” does not have selectivity at all, i.e. using it, nothing will be filtered out, and adding it to the index will not do any good. You can filter the remaining set either using 'no_send_before' or 'tries'. Running a query with any of them in where will reduce the number of matches for the other condition to zero.

SELECT
sum (source= 'twitter' ),
sum (no_send_before <= '2009-05-28 03:17:50' ),
sum (tries <= 20),
count (*)
FROM tbl
WHERE
status= 'waiting' AND
no_send_before <= '2009-05-28 03:17:50' \G
*************************** 1. row ***************************
sum (source= 'twitter' ): 255
sum (no_send_before <= '2009-05-28 03:17:50' ): 255
sum (tries <= 20): 0
count (*): 255

***************************************************************

SELECT
sum (source= 'twitter' ),
sum (no_send_before <= '2009-05-28 03:17:50' ),
sum (tries <= 20),
count (*)
FROM tbl
WHERE
status= 'waiting' AND
tries <= 20\G
*************************** 1. row ***************************
sum (source= 'twitter' ): 294
sum (no_send_before <= '2009-05-28 03:17:50' ): 0
sum (tries <= 20): 294
count (*): 294
* This source code was highlighted with Source Code Highlighter .


This means that we can do an index with any of them - (status, tries) or (status, no_send_before) and we will find our zero lines very effectively. Which one is better depends on what this table is really used for (as well as on the availability and structure of other queries to this table - note of the translation) .

Source: https://habr.com/ru/post/61845/


All Articles