MySQL Select random strings in one query

What do we have?

There is a weak laptop, a table of several million lines and you need to choose a different number of random lines in one query. Further samples do not interest us.

The table (test) has the following structure:

- pk_id (primary key)
- id (field filled with different numbers)
- value (field filled with rand ())

The primary key has no holes and starts with 1 .

Ways to get

ORDER BY rand + LIMIT

Getting one line:
```
SELECT pk_id FROM test ORDER BY rand() LIMIT 1 
```
The average execution time in MySQL is 6.150 seconds.
')
Let's try to take 100 entries.
```
 SELECT pk_id FROM test ORDER BY rand() LIMIT 100 
```
Average Execution Time 6.170-6.180 seconds
That is, the time difference between getting 1 and 100 random lines is not significant.
COUNT * rand ()

Getting one line:
```
 SELECT t.pk_id FROM test as t, (SELECT ROUND(COUNT(pk_id)*rand()) as rnd FROM test LIMIT 1) t WHERE t.pk_id = rnd 
```
Via
```
 ROUND(COUNT(pk_id)*rand()) 
```
we get a random number from 0 to the number of rows in the table.
Next, we assign an alias “rnd” to our random number and use it in WHERE for an equivalent comparison with pk_id.
Average execution time - 1.04 seconds
Next, you need to slightly modify this query, so that you can pull out a few lines.
Add a few more fields to our subquery and change the WHERE check from "=" to IN
```
 SELECT t.pk_id FROM test as t, (SELECT ROUND(COUNT(pk_id)*rand()) as rnd, ROUND(COUNT(pk_id)*rand()) as rnd2, ROUND(COUNT(pk_id)*rand()) as rnd3 FROM test LIMIT 1) t WHERE t.pk_id IN (rnd,rnd2,rnd3) 
```
The average execution time is 1.163 seconds.
As the number of rows received increases, the query execution time increases noticeably.
It’s even scary to think about 100 lines :)
INFORMATION_SCHEMA + LIMIT

Getting one line:
```
 SELECT t.pk_id FROM test as t, (SELECT ROUND((SELECT table_rows as tr FROM information_schema.tables WHERE table_name = 'test') *rand()) as rnd FROM test LIMIT 1) tmp WHERE t.pk_id = rnd 
```
Using the subquery, we get the number of rows in the 'test' table, without using the aggregate function COUNT and further comparison occurs as in method 2.
Average execution time - 0.042 seconds
The minimum observed execution time is 0.003 seconds.
Let's try to get 100 lines:
```
 SELECT t.pk_id FROM test as t, (SELECT ROUND((SELECT table_rows as tr FROM information_schema.tables WHERE table_name = 'test') *rand()) as rnd FROM test LIMIT 100) tmp WHERE t.pk_id in (rnd) ORDER BY pk_id 
```
Change in WHERE "=" to IN and change the limit of returned rows by the subquery to 100.
Average execution time - 0.047 seconds
Time to receive 1000 entries - 0.053 seconds
Time to receive 10,000 records ~ 0.21 seconds
And finally, we take 100,000 records in 1.9 seconds.
The disadvantage of this approach is that the resulting number of rows from INFORMATION_SCHEMA is slightly larger than COUNT (*) and therefore, 7-8 rows are lost when returning 100,000 rows. At 1-100, there is almost no such thing (the larger the table, the less chance). But you can always take 1-2 lines more for reinsurance :)

MAX * rand ()

Getting one line:

 SELECT t.pk_id FROM test as t, (SELECT ROUND((SELECT MAX(pk_id) FROM test) *rand()) as rnd FROM test LIMIT 1) tmp WHERE t.pk_id = rnd

Average execution time - 0.001 seconds
Getting 100 lines:

 SELECT t.pk_id FROM test as t, (SELECT ROUND((SELECT MAX(pk_id) FROM test) *rand()) as rnd FROM test LIMIT 100) tmp WHERE t.pk_id in (rnd) ORDER BY pk_id

Average execution time - 0.003 seconds

The fastest way, relative to the previous ones.

findings

The first way is good because in any case it will return you a random string, regardless of the holes in the fields and their initial value, but the slowest
The second method is much better suited to tables where there are no holes. Works 6 times faster than the first method (on returning one line).
The third method can be used at your own risk (which is very minor), because you can lose a line (s) when the rand () value is as close as possible to 1. The return rate per line differs 150 times from the first method.
If we’re not returning 100 lines, but 99, then you can once again send a request to the server.
The fourth method is the fastest and 6000 times faster ORDER BY rand ()

UPD: In the case of holes in the table, when returning one line in the second and third way, you can do a non-equivalent check a> = and add LIMIT 1. Then the value will be returned, even if it falls into the “hole”
Thanks xel for this remark.
UPD2: Added 4 way to get. Thank you smagen for the idea.

Source: https://habr.com/ru/post/154905/

All Articles

MySQL Select random strings in one query

What do we have?

Ways to get

ORDER BY rand + LIMIT

COUNT * rand ()

INFORMATION_SCHEMA + LIMIT

MAX * rand ()

findings

More articles: