PostgreSQL 9.5: what's new? Part 2. TABLESAMPLE

We continue the review of innovations in PostgreSQL 9.5.
Part 1. INSERT ... ON CONFLICT DO NOTHING / UPDATE and ROW LEVEL SECURITY .
Part 3. GROUPING SETS, CUBE, ROLLUP

From the author

I apologize for the delay in the release of the second part. Initially, I planned to release the second part of the article a week after the first, but, due to the large employment, I could not do it. Therefore, I decided that I would not publish large articles, but in small portions, but more often.

Sometimes there are tasks in which you need to select a certain number of random entries from a table; for this, sophisticated queries were written (to get really random data, you need to sweat a lot). With the release of PostgreSQL 9.5, this task will become easier.
With the help of the keyword TABLESAMPLE, you can not select all the data from the table, but only some part of them, choose a sample.
The syntax will be something like this:

SELECT ... FROM TABLE_NAME ... TABLESAMPLE sampling_method ( argument [ , ... ] ) [ REPEATABLE ( seed ) ]

sampling_method is the sampling method, the default in PostgreSQL 9.5 is two: SYSTEM and BERNOULLI , as an argument they take a floating point number (or any valid expression that results in a number) that is interpreted as a percentage for the sample: from 0 to 100.

Let's look at examples of how sampling in PostgreSQL 9.5 works.
Suppose we have a table with transactions, which stores the transaction id, the amount of the transaction and the date with the time when the transaction was completed. Add 100000 entries to the table.
')

CREATE TABLE transactions (
id SERIAL PRIMARY KEY ,
amount NUMERIC ( 15 , 2 ) ,
ending_time TIMESTAMP
) ;

INSERT INTO transactions ( amount , ending_time )
SELECT
( round ( CAST ( random ( ) * 100000 AS NUMERIC ) , 2 ) ) ,
now ( ) - random ( ) * CAST ( '1 day' AS INTERVAL )
FROM generate_series ( 1 , 100000 ) ;

Let's try to take a sample of records with a size of 0.1% of the source table (100 records):

SELECT * FROM transactions TABLESAMPLE SYSTEM ( 0.1 )

Total query runtime: 213 ms .
157 rows retrieved .

Why did we not get 100 records, but 157? The fact is that PostgreSQL stores the table data as an array of 8 kb pages (by default, this parameter can be changed when building the server from the source code) and using the SYSTEM sampling method, it simply takes the required number of random pages for a given number of percentages and returns them " as it is". In this case, 157 records fit into one page. If you request 2 times more records for the sample, then data from 2 pages will be taken:

SELECT * FROM transactions TABLESAMPLE SYSTEM ( 0.2 )

Total query runtime: 21 ms .
314 rows retrieved .

It should be understood that different pages can store different numbers of records and therefore the number of records returned can vary from request to request.
In order to get the exact number of records, you can use the expression LIMIT , but it is worth understanding that all the same, in this case we will get records from one page. Therefore, if the values in the records depend on the order in which these records are inserted or the nature of the values in the records themselves is chronological (as in our case, in the ending_time field), then you will most likely get meaningless results by making samples. For example, if we want to know through the sample, the maximum date when the transaction was performed, then, doing the same query several times, we will get completely different results:

SELECT MAX ( ending_time ) FROM transactions TABLESAMPLE SYSTEM ( 0.1 )

max
2014-11-08 22: 30: 32.720855

SELECT MAX ( ending_time ) FROM transactions TABLESAMPLE SYSTEM ( 0.1 )

max
2014-12-02 11: 42: 32.720855

SELECT MAX ( ending_time ) FROM transactions TABLESAMPLE SYSTEM ( 0.1 )

max
2014-10-21 09: 40: 32.720855

Whereas the real value will be:

SELECT MAX ( ending_time ) FROM transactions

max
2014-12-07 04: 04: 32.720855

In order to get a more distributed sample, you can use the BERNOULLI sampling method , which scans the entire table (in fact, “throws a coin” for each record) and selects random entries:

SELECT MAX ( ending_time ) FROM transactions TABLESAMPLE BERNOULLI ( 0.1 )

max
2014-12-07 00: 06: 32.720855

Now let's look at the performance, try to analyze the receipt of the average transaction amount in three ways and get the average itself:

1) No sample:

EXPLAIN ANALYZE SELECT AVG ( amount ) FROM transactions
"Aggregate (cost = 1887.00..1887.01 rows = 1 width = 8) (actual time = 25.795..25.795 rows = 1 loops = 1)"
"-> Seq Scan on transactions (cost = 0.00..1637.00 rows = 100000 width = 8) (actual time = 0.005..12.438 rows = 100000 loops = 1)"
"Planning time: 0.055 ms"
"Execution time: 25.816 ms"

SELECT AVG ( amount ) FROM transactions
50028.8742828

2) Sample method SYSTEM :

EXPLAIN ANALYZE SELECT AVG ( amount ) FROM transactions TABLESAMPLE SYSTEM ( 0.1 )
"Aggregate (cost = 1.25..1.26 rows = 1 width = 8) (actual time = 0.088..0.088 rows = 1 loops = 1)"
"-> Sample Scan (system) on transactions (cost = 0.00..1.00 rows = 100 width = 8) (actual time = 0.017..0.048 rows = 157 loops = 1)"
"Planning time: 0.068 ms"
"Execution time: 0.120 ms"

SELECT AVG ( amount ) FROM transactions TABLESAMPLE SYSTEM ( 0.1 )
53628.223694267516

3) Sample according to the BERNOULLI method:

EXPLAIN ANALYZE SELECT AVG ( amount ) FROM transactions TABLESAMPLE BERNOULLI ( 0.1 )
"Aggregate (cost = 638.25..638.26 rows = 1 width = 8) (actual time = 2.847..2.847 rows = 1 loops = 1)"
"-> Sample Scan (bernoulli) on transactions (cost = 0.00..638.00 rows = 100 width = 8) (actual time = 0.020..2.780 rows = 104 loops = 1)"
"Planning time: 0.145 ms"
"Execution time: 2.872 ms"

SELECT AVG ( amount ) FROM transactions TABLESAMPLE BERNOULLI ( 0.1 )
50285.863240740741

We see that the SYSTEM sampling is faster, but its accuracy is lower, while the BERNOULLI sampling is slower, but its accuracy is higher. You can choose a compromise between speed and accuracy. Also note that a new type of scan is used for sampling: Sample scan.
Add more records to the table, let it be 20 million records:

INSERT INTO transactions ( amount , ending_time )
SELECT
( round ( CAST ( random ( ) * 100000 AS DECIMAL ) , 2 ) ) ,
now ( ) - INTERVAL '1 year' + ( i * INTERVAL '1 minute' )
FROM generate_series ( 100001 , 20000000 ) i;

EXPLAIN ANALYZE SELECT AVG ( amount ) FROM transactions;

"Aggregate (cost = 377372.70 ..377372.71 rows = 1 width = 8) (actual time = 4604.297..4604.297 rows = 1 loops = 1)"
"-> Seq Scan on transactions (cost = 0.00 ..327375.96 rows = 19998696 width = 8) (actual time = 0.027..2043.846 rows = 20000000 loops = 1)"
"Planning time: 0.063 ms"
"Execution time: 4604.325 ms"

SELECT AVG ( amount ) FROM transactions;
50002.888681451

EXPLAIN ANALYZE SELECT AVG ( amount ) FROM transactions TABLESAMPLE SYSTEM ( 0.1 )
"Aggregate (cost = 757.99..758.00 rows = 1 width = 8) (actual time = 7.309..7.309 rows = 1 loops = 1)"
"-> Sample Scan (system) on transactions (cost = 0.00 ..707.99 rows = 19999 width = 8) (actual time = 0.057..4.588 rows = 20096 loops = 1)"
"Planning time: 0.073 ms"
"Execution time: 7.340 ms"
SELECT AVG ( amount ) FROM transactions TABLESAMPLE SYSTEM ( 0.1 )
50323.198322551752

EXPLAIN ANALYZE SELECT AVG ( amount ) FROM transactions TABLESAMPLE BERNOULLI ( 0.1 )
"Aggregate (cost = 127638.99..127639.00 rows = 1 width = 8) (actual time = 751.831..751.832 rows = 1 loops = 1)"
"-> Sample Scan (bernoulli) on transactions (cost = 0.00 ..127588.99 rows = 19999 width = 8) (actual time = 0.260..747.682 rows = 19899 loops = 1)"
"Planning time: 0.055 ms"
"Execution time: 751.879 ms"

SELECT AVG ( amount ) FROM transactions TABLESAMPLE BERNOULLI ( 0.1 )
50043.386386377336

We see that with an increase in the number of records, the BERNOULLI method loses more performance. This is because it does, in fact, a full scan of the table, while SYSTEM simply returns a few pages.

Now let's try to increase the percentage for selecting records:

EXPLAIN ANALYZE SELECT AVG ( amount ) FROM transactions TABLESAMPLE SYSTEM ( 1 )
"Aggregate (cost = 7591.84..7591.85 rows = 1 width = 8) (actual time = 65.055..65.055 rows = 1 loops = 1)"
"-> Sample Scan (system) on transactions (cost = 0.00 ..7091.87 rows = 199987 width = 8) (actual time = 0.043..37.939 rows = 200018 loops = 1)"
"Planning time: 0.053 ms"
"Execution time: 65.083 ms"

EXPLAIN ANALYZE SELECT AVG ( amount ) FROM transactions TABLESAMPLE BERNOULLI ( 1 )
"Aggregate (cost = 129888.84..129888.85 rows = 1 width = 8) (actual time = 799.826..799.826 rows = 1 loops = 1)"
"-> Sample Scan (bernoulli) on transactions (cost = 0.00 ..129388.87 rows = 199987 width = 8) (actual time = 0.035..769.899 rows = 199682 loops = 1)"
"Planning time: 0.063 ms"
"Execution time: 799.859 ms"

As you can see, the SYSTEM method loses more performance when the percentage of the sample increases. This is logical, since BERNOULLI both did a full scan and does it, while SYSTEM should return 10 times more pages.
As a result, it can be noted that the SYSTEM method at a small percentage of the sample works much faster than BERNOULLI , but at the same time gives a less random selection of records. But with the interest of interest, this advantage is lost.

With the optional REPEATABLE keyword, we can set the seed for the random variable generator. If two queries have the same sampling method, sampling percentage and seed , then the same sample will be selected for these two queries:

SELECT MAX ( amount ) FROM transactions TABLESAMPLE SYSTEM ( 0.1 ) REPEATABLE ( 50 )
99997.91

SELECT MAX ( amount ) FROM transactions TABLESAMPLE SYSTEM ( 0.1 ) REPEATABLE ( 300 )
99999.15

SELECT MAX ( amount ) FROM transactions TABLESAMPLE BERNOULLI ( 0.1 ) REPEATABLE ( 50 )
99995.9

SELECT MAX ( amount ) FROM transactions TABLESAMPLE SYSTEM ( 0.1 ) REPEATABLE ( 50 )
99997.91

As we saw above, if the REPEATABLE keyword is not specified, then each time the sample will be different from the previous one.

Separately, it is worth noting that sampling is performed BEFORE the WHERE condition, that is, you cannot select a sample by condition. In this case, the sample will be selected first, and then the WHERE condition will be applied, but since the probability of getting records with id <100 from a table of 20,000,000 records is very small, the sample will be empty in the end:

SELECT * FROM transactions TABLESAMPLE SYSTEM ( 1 ) WHERE id < 100
Total query runtime: 31 ms .
0 rows retrieved .

SYSTEM and BERNOULLI are not the only possible options for sampling, if you wish, you can write your own method for sampling. The documentation for this is here . In this case, custom sampling methods can take more than one argument or not take them at all. Also, custom methods can ignore the REPEATABLE keyword.

This concludes my brief story on postgreSQL 9.5 sampling. Thanks for attention!

PS You can roughly estimate the number of records in the table using the sample :)

SELECT COUNT ( * ) * 100.0 FROM transactions TABLESAMPLE SYSTEM ( 1 ) ;
20001800

PPS Do not do as described above, this is a joke and does not always work as it should.
In the next part: GROUPINS SETS, ROLLUP, CUBE .

Source: https://habr.com/ru/post/266759/

All Articles

PostgreSQL 9.5: what's new? Part 2. TABLESAMPLE

More articles: