Accelerate the selection of arbitrary MySQL records

Recently, the public has revived with the question of a random sample from the table. There are plenty of optimization solutions, and I’m probably not going to show you anything new now, just to remind you about the basic optimization methods — simplifying the query and indexing. Without prefaces about freelancers, immediately to the point;)

Create a table:

CREATE TABLE `a` ( `id` int(8) unsigned NOT NULL AUTO_INCREMENT, `md5` char(32) NOT NULL PRIMARY KEY (`id`) ) INSERT INTO `a` (`id`) VALUES (null),(null),(null),(null)... 163712 ;) UPDATE `a` SET md5 = MD5(`id`);

Such a table on my antediluvian computer is enough to check the effectiveness.
Here is a simple ORDER BY RAND sample:

SELECT * FROM `a` ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.3345 sec) SELECT * FROM `a` ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.2538 sec) SELECT * FROM `a` ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.2299 sec)

I create an indexed field in order not to do a full-scan of the entire table:

ALTER TABLE `a` ADD `f` SMALLINT(3) UNSIGNED NOT NULL, ADD INDEX (`f`); UPDATE `a` SET `f` = RAND()*1000;

The number 1000 is the main accelerating factor. To them, I reduce the table in which an ordinary ORDER BY RAND will pass 1000 times. With a table of 163712 rows, you should get about 164 rows per f. Checking:

SELECT COUNT(1) FROM `a` WHERE `f` = 123; -> 169

Random is random, a uniform distribution would be good, but this is fantastic (although you know, you can use the first MD5 signs (`id`) and translate it into INT, there is nowhere more uniform). So, now for me one f came across both 200 rows and 100. If this indicator becomes ineffective with time, then you can always increase the factor and get, say, 25-75 rows per index. The main thing that there was at least as many rows as we need to pull randomly. Column f can be regenerated periodically, or after 1000 calls to the table. When inserting new rows get the value f = 0, which will not affect the quality of the samples, or set random values for inserts too.

I do a test sample of 10 rows using what I have indexed:

SELECT * FROM `a` WHERE `f` = 231 ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.0801 sec) SELECT * FROM `a` WHERE `f` = 231 ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.0017 sec)

Aha, the second time turned out a re-sorted sample from the mysql cache. If the repetition of the results is not very scary, then such a more nimble result will fit, but it is better to change the number f with each query.
')
I repeat the test by changing f:

SELECT * FROM `a` WHERE `f` = 396 ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.0147 sec) SELECT * FROM `a` WHERE `f` = 753 ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.0020 sec) SELECT * FROM `a` WHERE `f` = 189 ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.0019 sec) SELECT * FROM `a` WHERE `f` = 945 ORDER BY RAND() LIMIT 10; -> (10 rows, Query took 0.0235 sec)

In general, the sampling performance has risen 120 times and this is without any kind of complication. I see in this solution a lot of amenities:

no need to climb far into the wilds and wrestle with "a pancake like this one I had ... oh yeah the procedure."
simple integration;) the request code was complicated by just one condition.
Well, the third is called extensibility: add your conditions and you will further reduce the size of the scan, thereby increasing the sampling rate.

If one gap of 160 rows is not enough for us, then you can include as many intervals as you want:

SELECT * FROM `a` WHERE `f` IN (100,500) ORDER BY RAND() LIMIT 10;

More life example

For this example, take the top comment from the next post , which is solved as follows. We emit the RSS feeds table by adding the feed field containing the feed number:

ALTER TABLE `a` ADD `feed` TINYINT(1) UNSIGNED NOT NULL, ADD INDEX (`feed`); UPDATE `a` SET feed = RAND()*9;

And now, actually, feint ears:

(SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 0 ORDER BY RAND() LIMIT 10) UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 1 ORDER BY RAND() LIMIT 10) UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 2 ORDER BY RAND() LIMIT 10) UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 3 ORDER BY RAND() LIMIT 10) UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 4 ORDER BY RAND() LIMIT 10) UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 5 ORDER BY RAND() LIMIT 10) UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 6 ORDER BY RAND() LIMIT 10) UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 7 ORDER BY RAND() LIMIT 10) UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 8 ORDER BY RAND() LIMIT 10) UNION (SELECT * FROM `a` WHERE `f` = 283 AND `feed` = 9 ORDER BY RAND() LIMIT 10) ORDER BY feed; -> (97 rows, Query took 0.7973 sec) f -> (99 rows, Query took 0.0093 sec) f -> (98 rows, Query took 0.0197 sec)

Here it is desirable to choose more than 10 rows for fidelity;) and filter the extra for PHP.

In PHP, it is also advisable to set the number f, so as not to make two requests to the MySQL server. Although it is not critical. This will also work very quickly:

SET @rnd = RAND(); SELECT * FROM `a` WHERE `f` = @rnd ORDER BY RAND() LIMIT 10;

As you can see, not only by complication can optimization be achieved (in this article I optimized speed). Now you have a question to think about, and I will state in a future article. How can you optimize the quality of random samples, first of all get rid of repetitions? ;)

Respectfully,
Mayam

Source: https://habr.com/ru/post/55864/

All Articles

Accelerate the selection of arbitrary MySQL records

More life example

More articles: