📜 ⬆️ ⬇️

Quick selection of random values ​​from large MySQL tables by condition

The task of choosing random lines from a table quite often arises before developers.
If MySQL is used, it is usually solved in the following way:

SELECT *
FROM users
WHERE role_id=5
ORDER BY rand()
LIMIT 10


Such code works extremely slowly for large tables.
If the query does not need to use WHERE or the table is small, there are effective solutions, for example habrahabr.ru/post/54176 or habrahabr.ru/post/55864 .
But I did not find any ready-made solutions for a large table and the need to filter by condition, getting new values ​​for each request, so my method description is under the cut.


As it turned out, MySQL does not know how to efficiently select random rows using ORDER BY rand () LIMIT N, where you need to filter rows by condition (although the same MSSQL copes with the selection of random rows from a table with a large number of records).
')
So, solving the problem "in the forehead," the query (in the table of 5 million records):

SELECT *
FROM users
WHERE role_id=5
ORDER BY rand()
LIMIT 10


The request took 41.3544 seconds, which is unacceptable for a long time. It is impossible to find the maximum and minimum id, and then select random id from the gap in this case: because of the WHERE condition, id is no longer in order and discharged.

My solution is as follows: the random_seed table is added, containing the id and random_seed fields, filled with random numbers, an index is added to this column, and the index is also added to the column that will be sampled.
Now, in order to select random lines by condition, the query needs to be changed as follows (in table 5 million records):

SELECT
u1.*
FROM
users u1,
random_seed rs
WHERE
u1.role_id=5 AND u1.id=(rs.id+random_from_php)
ORDER BY
rs.random_seed
LIMIT 10


The request took 0.0460 seconds, which is already more than an acceptable result. The variable random_from_php is generated by the code calling the request, which provides a random set of values ​​for each request, this number will provide a sample of new random numbers. In the table, random_seed should be as many values ​​as in the table, from which you need to take random rows + N records, where N is the maximum possible value of random_from_php.

A real example from my work is the selection of random categories of products from different combinations (total 4,000,000 entries):

“Normal”: request"Accelerated" request:
SELECT
oc1. *
FROM
object_category oc1
WHERE oc1.region_id = 6
ORDER BY RAND ()
LIMIT 10
SELECT
oc1. *
FROM
object_category oc1, random_seed rs
WHERE
oc1.id = (rs.id + 564756) AND oc1.region_id = 6
ORDER BY
rs.random_seed
LIMIT 10
Lead time:
1.7260.007s
1.851s0.010s
1.803s0.006s
1.7840.008s


The advantages of the method described above:
+ The fastest possible way to select random rows from a table by condition
+ No need to re-generate random numbers for each of the rows in the table.
+ The request of all necessary values ​​does not occur iteratively, in one request

Minuses:
- The need to enter an additional table
- The need to change the usual requests

Source: https://habr.com/ru/post/207096/


All Articles