Postgres. Sample N random entries

When working on one project, it became necessary to write some sort of test system. The task was formulated like this:

among N records in the database, it is necessary to select m (3-5) random rows in a series of k samples (mostly k = 2).

And now the same thing with human language: from the table you need two times to choose 3-5 random entries. There should be no duplicates and sampling should occur randomly.

The first thing that comes to mind:
')

SELECT * FROM data_set WHERE id NOT IN (1,2,3,4, 5) ORDER BY random() LIMIT 5;

And it will even work. That's just the price of such a decision ...

Therefore, we had to use the "higher mind" , which gave a hint of a solution .

 WITH RECURSIVE r AS ( WITH b AS (SELECT min(id), max(id) FROM table1) ( SELECT id, min, max, array[]::integer[] AS a, 0 AS n FROM table1, b WHERE id > min + (max - min) * random() LIMIT 1 ) UNION ALL ( SELECT t.id, min, max, a || t.id, rn + 1 AS n FROM table1 AS t, r WHERE t.id > min + (max - min) * random() AND t.id <> all(a) AND rn + 1 < 10 LIMIT 1 ) ) SELECT t.id FROM table1 AS t, r WHERE r.id = t.id;

That's just the solution is ... "slightly" incomprehensible. And to drag incomprehensible decisions into the project is somehow reluctant. Therefore, it was decided to go by a proven path , where we managed to find an explanation of the essence of recursive queries .

What. The logic in general has become more or less understandable: n times we select one line at a time with a random offset from the minimum ID value (primary key). Thus, the query affects a maximum of N rows instead of brute forceing the table contents.

But "it was smooth on paper." When trying to use it "as is" (adjusted for table / field names), "surprises" surfaced:

The array in line 4 is created empty. Because of this, duplicates may appear in the final sample. Let's say a query in lines 4-7 returns id == 5. After that, it executes the request in the UNION block (lines 9-15) and at some stage returns the same id == 5, since the previous id value did not fall into the array “ a ” and the test in line 13 “t.id <> all (a) "was successful;
The number of values at the “output” was no more than indicated (p. 14). But less than this - easily. Even if this quantity was guaranteed in the table. Sometimes an empty sample was returned (the number of values is “0”).

It turned out to be relatively easy to cope with the first “feature” - it was enough just to change the line with array initialization. Like that:

 SELECT id, min, max, array[]::integer[] || id AS a, 0 AS n

But the second point forced poraskinut brains. The catch came to light in the very heart of the algorithm — selection of a record from a range. Indeed, a recursive query tries to select a line for which the condition would be met:

 id > min + (max - min) * random()

But in the case when random () returns “1”, this condition transforms into:

 id > max

Naturally, in this case, the query will not find anything and stop working. And if this happens at the first pass of the request, the “output” will be empty. Even if the database obviously contains the necessary records.

The first impulse was to slightly change the condition and bring it to something like this:

 id >= min + (max - min) * random()

This, so to speak, solution allowed to get at least one line at the exit. But it did not guarantee the receipt of a specified number of rows as a result. I had to think further.

After much deliberation, many attempts, and liters of stimulator came to light
such an option:

request code

 WITH RECURSIVE r AS ( WITH b AS ( SELECT min(t.id), ( SELECT t.id FROM table1 AS t ORDER BY t.id DESC LIMIT 1 OFFSET 9 ) max FROM table1 AS t ) ( SELECT id, min, max, array[]::integer[] || id AS a, 0 AS n FROM table1 AS t, b WHERE id >= min + ((max - min) * random())::int LIMIT 1 ) UNION ALL ( SELECT t.id, min, max, a || t.id, rn + 1 AS n FROM {table} AS t, r WHERE t.id >= min + ((max - min) * random())::int AND t.id <> all(a) AND rn + 1 < 10 LIMIT 1 ) ) SELECT t.* FROM table1 AS t, r WHERE r.id = t.id

All the salt in lines 5-11. Those. in order to ensure that the “output” will have N elements, it is necessary to proceed from the worst possible scenario. In this case - that random N returns 1 times in a row. Therefore, you need to know / have N-1 values before max. How to achieve this in the request? As an option - to sort all the records by ID in descending order and produce an offset "down" by N rows. And since “> =” is used in lines 19 and 25, the offset can be indicated as one less (N-1).

That's good - the main difficulties have been resolved and the "little things" remain: the query in its current form is of little use. After all, it is necessary to make a sample taking into account certain conditions. In the simplest case, exclude the IDs of the records selected in the previous step from the selection. In addition, one cannot exclude the possibility of using any additional conditions imposed on the rows in the table (is_active = true, is_deleted = false, ...). After some reflection, it becomes clear that the conditions will have to be put in all essential parts of the query (almost all subqueries). Approximately as in the following template:

template code

 WITH RECURSIVE r AS ( WITH b AS ( SELECT min(t.{pk}), ( SELECT t.{pk} FROM {table} AS t WHERE {exclude} t.is_active ORDER BY t.{pk} DESC LIMIT 1 OFFSET {limit} - 1 ) max FROM {table} AS t WHERE {exclude} t.is_active ) ( SELECT t.{pk}, min, max, array[]::integer[] || t.{pk} AS a, 0 AS n FROM {table} AS t, b WHERE t.{pk} >= min + ((max - min) * random())::int AND {exclude} t.is_active LIMIT 1 ) UNION ALL ( SELECT t.{pk}, min, max, a || t.{pk}, rn + 1 AS n FROM {table} AS t, r WHERE t.{pk} >= min + ((max - min) * random())::int AND t.{pk} <> all(a) AND rn + 1 < {limit} AND {exclude} t.is_active LIMIT 1 ) ) SELECT {fields} FROM {table} AS t, r WHERE r.{pk} = t.{pk}

Here in curly brackets are the named parameters to be replaced:

{table} - the name of the table;
{pk} - PrimaryKey-field name;
{fields} - list of fields for selection (you can also specify "*");
{exclude} - condition (set of conditions) for excluding records from the selection. For example, "t.id NOT IN (1,2,3,4)";
{limit} - the number of records in the final sample

And, finally, the last one on the list, but not least, is the question “is it worth the candle”? What is the "profit" from this brain structure? We will check.

To begin, create a table in which the experiments will be conducted.

 DROP TABLE IF EXISTS ttbl; CREATE TABLE IF NOT EXISTS ttbl ( id serial NOT NULL, is_active BOOL NOT NULL DEFAULT true, CONSTRAINT ttbl_pkey PRIMARY KEY (id) ) WITH (OIDS=FALSE);

Now fill it with data. In this case, it is necessary that the id values do not go sequentially, but have “holes”. Those. not "1, 2, 3, 4, 5 ..." but at least "1, 4, 5, 8 ....". To do this, we sketch a simple script.

script code

 import random import psycopg2 DB_TABLE = 'ttbl' PK_NAME = 'id' DATABASES = { 'NAME': 'test_db', 'USER': 'user_test', 'PASSWORD': 'R#Q9Jw*ZEKWOhBX+EP|3/xGkQn3', 'HOST': 'localhost', 'PORT': '', } TOTAL = 10000000 current = 0 step = 10000 dev = 8 while current <= TOTAL: data = set() for x in range(current, current+step, 10): data.add(x + int(random.random() * dev)) x = cur.execute( "INSERT INTO {t_table} VALUES {t_items};".format( t_table=DB_TABLE, t_items='(' + '), ('.join(map(str, data)) + ')' ) ) current += step print(x, current) cur.execute('COMMIT;')

As can be seen from the code - each request inserts hundreds of records. Values change in increments of about “10”. Approximately, because each of the values may deviate by a random value in the range of 0-dev. Those. for two consecutive values of x "340" and "350", any values from the range 340-348 and 350-358 (342, 347, 340 ...; 351, 355, 358 ...) can be inserted into the table.

Total in the table turned

 select count(id) from ttbl;

1001,000 records

Pretty decent amount. Let's try to make a sample. It is clear that a single sample is not an indicator. Therefore, we produce a series of consecutive launches. For definiteness - a series of 5 launches of requests of each type. The results are tabulated and we calculate the average.

First simple query

 select t.* from ttbl t where t.id not in (1, 3, 10, 89, 99, 22, 24, 25, 28, 30) AND t.is_active order by random() limit 5;

Results:

Item number	time ms
one	697
2	605
3	624
four	593
five	611

As you can see, the average query execution time is about * 600ms.

And now - "monster."

watch monster

 WITH RECURSIVE r AS ( WITH b AS ( SELECT min(t.id), ( SELECT t.id FROM ttbl AS t WHERE t.id NOT IN (1, 3, 10, 89, 99, 22, 24, 25, 28, 30) AND t.is_active ORDER BY t.id DESC LIMIT 1 OFFSET 5 - 1 ) max FROM ttbl AS t WHERE t.id NOT IN (1, 3, 10, 89, 99, 22, 24, 25, 28, 30) AND t.is_active ) ( SELECT id, min, max, array[]::integer[] || id AS a, 0 AS n FROM ttbl AS t, b WHERE id >= min + ((max - min) * random())::int AND t.id NOT IN (1, 3, 10, 89, 99, 22, 24, 25, 28, 30) AND t.is_active LIMIT 1 ) UNION ALL ( SELECT t.id, min, max, a || t.id, rn + 1 AS n FROM ttbl AS t, r WHERE t.id > min + ((max - min) * random())::int AND t.id <> all( a ) AND rn + 1 < 5 AND t.id NOT IN (1, 3, 10, 89, 99, 22, 24, 25, 28, 30) AND t.is_active LIMIT 1 ) ) SELECT t.* FROM ttbl AS t, r WHERE r.id = t.id

Results:

Item number	time ms
one	15
2	17
3	eight
four	12
five	12

Average time about * 15ms.

Total difference is about one and a half order (40-50 times). It was worth it.

Requests were checked including. and with a cold start (after restarting the machine / daemon). And although the execution time in absolute terms changed, but the ratio remained constant (as far as possible). Those. A recursive query was always a few dozen times faster than a head-on solution.

Notes

* About, because the exact value of the role does not play due to deviations caused by the Postgres cache, the influence of the operating system, hardware, other software, etc.

Source: https://habr.com/ru/post/242999/

All Articles

Postgres. Sample N random entries

Notes

More articles: