Page-by-page navigation with MySQL with a large number of records

Sooner or later, many large projects face performance problems when paging through records. Some of them solve this problem by limiting the number of records available for viewing (say, no more than 1000). It is an acceptable solution. But in this case, there may be problems with indexing the site by third-party search engines, which pose the greatest threat. In this article, I would like to abandon the “1..2..3..4 ..” habitual navigation panel in favor of a simple “forward ... backward” (it will be easier to explain), but it’s not a problem to implement this and with first option.
It is not possible to define the topic more precisely, naming how many records are large enough for brakes to appear, since this figure is different for everyone and depends greatly on how fast your hard drives are, how much memory is, and how much of your data is cached in it and so on. But if you and your servers feel that the n-page when outputting is harder than the first, and you don’t know what to do with this, the article is for you. But for starters, I would like to explain on my fingers why IT is slow.

By the way, the test takes place on a virtual machine, I work with the DBMS under the root, the MySQL version is 5.0.32.

1 Let's start with the data.

For testing, we will create a small table and fill it with something.
')

CREATE TABLE items (
id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
height INT UNSIGNED NOT NULL DEFAULT 0,
width INT UNSIGNED NOT NULL DEFAULT 0,
price DECIMAL(10,2) NOT NULL DEFAULT 0.0,
title VARCHAR(255) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

A small PHP script generated INSERT 100,000 records. Data of this type:
- field order in INSERT:
height, width, price, title
- template for fields:
$ val_tmpl = "\ t (% d,% d,% f, 'Item% d')";
- test values ($ i = 1..100000):
sprintf ($ val_tmpl, rand (0, 120), rand (0, 220), 10 * rand (0, $ i) / $ i, $ i);

We bring it all in our DB. And you can start ...

2 The usual pagination method

Anyone who already knows what COUNT (*) and LIMIT ... OFFSET are bad for can skip this part.

Before drawing the navigator, we do SELECT COUNT (*) ... WHERE (selection conditions). Many, for some reason, are sure that even if we have millions of records, but selection conditions allow us to use an index, such a query will work very quickly. Let's do an experiment. Select the number of records that have a height greater than 100. First, let's see what happens if there is no index across the height field.

FLUSH STATUS;

SELECT SQL_NO_CACHE COUNT (*) FROM items WHERE height> 100;
+ ---------- +
| count (*) |
+ ---------- +
| 16405 |
+ ---------- +
1 row in set (0.09 sec)

SHOW STATUS LIKE 'handler%';

The last command will show how much and what actions the DBMS had to do in order to fulfill our request. Since we don't have an index, MySQL had to read the data directly from the table, so we are interested in the line:
...
| Handler_read_rnd_next | 100001 |
...

That is, MySQL had to do 100001 operation to move to the next record in order to find all matching the query.

Everywhere below, before each SELECT, FLUSH STATUS is meant to be executed, and after it: SHOW STATUS LIKE 'handler%'.

How can the index help us?

ALTER TABLE items ADD INDEX height_idx (height);

SELECT SQL_NO_CACHE COUNT (*) FROM items FORCE INDEX (height_idx) WHERE height> 100;
+ ---------- +
| count (*) |
+ ---------- +
| 16405 |
+ ---------- +
1 row in set (0.04 sec)

In this case, the index was used, so Handler_read_rnd_next will be equal to 0, but
...
| Handler_read_next | 16405 |
...
That is, the index allows you to initially count only those records that are needed, but he still needs to run through them all. There is no magic, MySQL does not store anywhere the number of records matching the request. Therefore, if you have millions of records matching the query conditions, COUNT will work very slowly.

Second moment. LIMIT ... OFFSET. The same experiment. Ask us for 5 entries.

SELECT SQL_NO_CACHE * FROM items FORCE INDEX (height_idx) WHERE height> 100 LIMIT 5;
...
5 rows in set (0.00 sec)
...
| Handler_read_next | 4 |
...
It seems everything is logical. And now we will ask to return other 5 records, since 16401.

SELECT SQL_NO_CACHE * FROM items FORCE INDEX (height_idx) WHERE height> 100 LIMIT 16400, 5;
...
5 rows in set (0.13 sec)

We see that the sampling time has increased significantly. We look at the status:
...
| Handler_read_next | 16404 |
...
That is, MySQL, read all 16405, and only then just threw away all unnecessary.
How to be?

3 How to be

So. We are required to display 10 entries, as well as draw a navigation menu. We realized that in order to get to the record with which we must begin to give us the results, MySQL spends a lot of unnecessary actions. The only way to avoid it is to go directly to the desired one by changing the sampling conditions.
Consider everything on a simple example: let the records be issued sorted by id. In this case, we need, together with the link, to transfer the id of the record where we stopped. And we will stop on record with id = 10. That is, in the parameters of the link to the next page, we need to pass 10. Accordingly, for the second page, the request will look like this:

SELECT SQL_NO_CACHE id FROM items WHERE id> 10 ORDER BY id LIMIT 10;

By the way, in both cases, Handler_read_next will be equal to 9. That is, we jumped to the first entry corresponding to the request (thanks to the index) and made 9 transitions to the next. The most important thing is that whatever number instead of 10 we put in the condition - we will always see the same thing as a result of SHOW STATUS, and the execution time of such a query will no longer depend on where we are, but will depend only on how much and what we choose.
I hope you understand the meaning. Let's then decide what to do with the navigation menu, and then move on to a more difficult situation. Let us use the keywords next, previous, and last in url. When do I show links “forward”, “backward” and “last”?
Every time when we receive next (request for the next page), we choose not 10 entries, but 11, starting with the id transmitted in the request parameters. If we returned 11 entries, then we should show the link forward with the id of the 10th entry, and the 11th fold. If less than 11 entries are returned, then you do not need to show the forward link. At the same time, we always (always, when next came) show the link back (previous) with the id of the first record from the selection. Links "to the beginning" and "last" are always shown together with "back" and "forward", respectively. That is, if we decide to show “back”, we must show “to the beginning”.
Every time when we receive a previous (request to the previous page), we select 11 entries that have id less than the one specified in the query, sorted in the reverse order. The same thing: if 11 records returned, then the link “back” is shown. Link forward always show.
I hope, clearly wrote ...
What if we received the request "last"? Just show 10 entries, starting with the most recent. I.e:

SELECT id FROM items ORDER BY id DESC LIMIT 10;

Do you think any of the users have enough strength to squander several hundred, or even thousands of pages, in order to accuse you of lying, having finally found out not 10 entries on the first page? Even if that's enough, you can answer that he shook too long ...

The previous example was simple because id is unique. And what if you want to sort by field, whose values can be repeated? For example, height in our case. By a simple query, it was found out that in the table each height value occurs approximately 800 times. Simply passing the last height displayed in the request parameters is already small. It will help us all the same id. We are asked to sort the records by height, but this does not prevent us from sorting them out then by id?
For this we need a new index:

ALTER TABLE items ADD KEY height_id_idx (height, id);

The request for the first page will be:

SELECT SQL_NO_CACHE id, height FROM items ORDER BY height, id LIMIT 10;

In my results, the last entry has height = 0, id = 1174. So it should be transferred to the next page. For example, next_0_1174 or next / 0/1074 - as you prefer.
Now we need to select records for which either height is greater than 0, or height = 0, and id> 1174 (this is what we did for additional sorting).
I.e:

SELECT SQL_NO_CACHE * FROM items WHERE (height> 0) OR (height = 0 AND id> 1174) ORDER BY height, id LIMIT 10;

I hope, to explain why this is not necessary. The status still shows only 9 steps forward.
Thus, we can add other sortings. For example, if we want to display all records sorted by price and height, the query would be:

SELECT SQL_NO_CACHE * FROM items WHERE (price> 5) OR (price = 5 AND height> 0) AND (price = 5 AND height = 0 AND id> 1174) ORDER BY price, height, id LIMIT 10;

It remains only to transmit all the necessary data, process them correctly and substitute them in the request. And do not forget about the index.

4 What to do with the number of results

What if we want to show users how many results were found? Since we are talking about large numbers, it is unlikely that someone will check us. The same Google can show that it has found 1000000 pages matching the query, but you will not see more than 1000. We can also give the number of results only approximately. Where to get it and how to evaluate? Remember, we ran the query:

SELECT SQL_NO_CACHE COUNT (*) FROM items FORCE INDEX (height_idx) WHERE height> 100;

And let's do this:
EXPLAIN SELECT SQL_NO_CACHE * FROM items FORCE INDEX (height_idx) WHERE height> 100;

As a result, we get something like this:
+ ---- + ------------- + ------- + ------- + -------------- - + ------------ + --------- + ------ + ------- + ---------- --- +
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+ ---- + ------------- + ------- + ------- + -------------- - + ------------ + --------- + ------ + ------- + ---------- --- +
| 1 | SIMPLE | items | range | height_idx | height_idx | 4 | NULL | 22616 | Using where |
+ ---- + ------------- + ------- + ------- + -------------- - + ------------ + --------- + ------ + ------- + ---------- --- +

The rows column shows the estimated number of records to be viewed. 22616 and 16405 - the difference is not at all great. You can round up to ~ 20,000, and all right. Come down. Just remember that if you use, for example, subqueries and / or joins, then EXPLAIN will return several rows. They must all be read and multiplied by the rows.

Conclusion

This problem has already been highlighted casually at habr, but not so detailed.
The article turned out more than expected, although the text is diluted with inserts of queries and results. At the moment, there is no place to put the script used to generate data + sql files. In general ... there is no strength left for the conclusion :)

Source: https://habr.com/ru/post/44608/

All Articles