Often, when developing application software, you may encounter a problem of this kind - for intermediate data, you need to obtain several result sets, for example, for some products you must be able to obtain their presence in current orders and the amount of discounts issued for them earlier; or for some users, get a list of their friends and messages of these users in social networks, etc., etc.
The solution usually looks quite straightforward - first we get a list of, say, users, then for them we build the required result set; then again we get a list of users and build the second set; and everything would be fine if building such a list would not turn out to be a rather costly operation - and thus, if several results are to be built on the basis of this list, then it turns out that this list should be obtained several times with all associated overhead costs. Temporary tables seem to be the obvious solution to this problem, and this is true; unfortunately, a number of not very pleasant features are associated with them - for each temporary table it is required to create a file (and when deleting a table, delete it). In addition, these tables, of course, are not visible for the processes of auto-vacuum and, therefore, are not cleared automatically, and statistics are not collected for them. Worse yet, with long active transactions, unlimited growth of the system catalog can occur; moreover, the operating system cache is filled with data about the created files for temporary tables, which leads to a general performance degradation.
It should also be noted that since the name of the table should be known when compiling a query, the use of different tables can be quite awkward and makes it necessary to resort to the dynamic formation of queries with all the ensuing consequences; if we recall that plpgsql for dynamic queries does not save the plan, then in cases of complex queries this can be a significant problem.
Another obvious solution is arrays, but they also have a number of drawbacks. First, arrays are immutable; for small arrays it is not scary, but for large ones it can be quite unpleasant; secondly, arrays exist only during the execution of the query, and not the entire lifetime of the session; and, thirdly, arrays do not allow searching by key.
Even for the session often requires temporary data available to all functions for the duration of the request - the current user, rights, etc.
Finally, when executing read-only requests on a replica, it is often necessary to store temporary data somewhere.
To avoid such problems, the intended extension is intended.
What does it do?
First, it allows you to define scalar session variables, which in itself can be very valuable - for example, you can store the identifier of the user on whose behalf the request is being made, his, the user, various attributes. For example:
select pgv_set_int('package','usr_id',1)
and then:
select pgv_get_int('package','usr_id')
Here (and further) package
is the name of the package, usr_id
is the name of the variable in this package. Of course, there can be many such variables:
select pgv_set_int('package','#'||n,n), n from generate_series(1,1000000) as gs(n)
In addition to the type of integer
may be others - text
, numeric
, timestamp
, timestamptz
, date
and jsonb
. All of these variables exist during the life of the session that set them, and are not available to others.
In addition to scalar variables, this extension also supports sets.
Hereinafter, a test database of three tables is used, representing hypothetical users ( ord.usr
), products ( ord.goods
) and discounts for users on goods ( ord.discount
). The database was created using the datafiller
utility ( https://www.cri.ensmp.fr/people/coelho/datafiller.html ) with the following settings file:
CREATE TABLE ord.goods( -- df: mult=1000.0 id SERIAL primary key, name TEXT NOT NULL, -- df: lenmin=3 lenmax=30 chars='af ' sub=uniform price numeric, -- df: float=gauss alpha=100.0 beta=30 in_stock_qty int -- df: size=1000 ); create table ord.usr( -- df: mult=100 id serial primary key, email text -- df: pattern='[az]{3,16}\.[az]{3,8}@((gmail|yahoo|mail)\.com|(mail|yandex|inbox)\.ru)' ); create table ord.discount( -- df: mult=100 goods_id int not null references ord.goods(id), usr_id int not null references ord.usr(id), pct numeric not null, -- df: alpha=0.01 beta=0.07 from_date date not null, -- df: start=2010-01-01 end=2016-04-01 duration integer not null -- df: offset=1 size=361 step=30 )
Let's look in more detail. First, build a list of all mail.ru users:
select pgv_insert('package', 'set', row(u.id, u.email)) from ord.usr u where u.email like '%@mail.ru'
How expensive is it?
QUERY PLAN Seq Scan on usr u (cost=0.00..2041.96 rows=23984 width=30) (actual time=0.022..24.893 rows=16426 loops=1) Filter: (email ~~ '%@mail.ru'::text) Rows Removed by Filter: 83574 Planning time: 0.070 ms Execution time: 25.404 ms (5 )
They can be obtained as:
explain analyze select * from pgv_select('package','set') as usr(id int, email text)
What is the performance of this operation? Let's get a look:
Function Scan on pgv_select usr (cost=0.00..10.00 rows=1000 width=36) (actual time=4.692..5.503 rows=16426 loops=1) Planning time: 0.026 ms Execution time: 10.733 ms (3 )
Compare with normal sampling:
create temporary table usr_id_email( id int primary key, email text );
and
explain analyze insert into usr_id_email select u.id, u.email from ord.usr u where u.email like '%@mail.ru' : Insert on usr_id_email (cost=0.00..1982.00 rows=23984 width=30) (actual time=31.244..31.244 rows=0 loops=1) -> Seq Scan on usr u (cost=0.00..1982.00 rows=23984 width=30) (actual time=0.007..16.220 rows=16426 loops=1) Filter: (email ~~ '%@mail.ru'::text) Rows Removed by Filter: 83574 Planning time: 0.069 ms Execution time: 31.285 ms
As you can see, the pgv_insert
execution time is noticeably less than the variant with the temporary table; In addition, the execution time of a variant with a temporary table largely depends on the state of the OS cache, since, as noted above, a file is created for each temporary table (and when deleted, respectively, it is deleted).
If you look closely at the code above, you can make quite a fair remark - in the variant with a temporary table, the id column is the primary key; how fair is it with pgv_insert
? In general, quite honestly: from the result constructed by pgv_insert
you can also get a string by user id:
select * from pgv_select('package','set',9545) as t(id int, email text)
When saving the result set, the first column in the added rows is the primary key. Accordingly, it must, firstly, be unique, and, secondly, it can be used to quickly find strings.
How fast is the sample in relation to the time table? Compare:
explain analyze select * from generate_series(1,1000) as gs(n) left outer join pgv_select('package','set') as t(id int, email text) on true : Nested Loop Left Join (cost=0.01..20010.01 rows=1000000 width=40) (actual time=10.282..2495.984 rows=16426000 loops=1) -> Function Scan on generate_series gs (cost=0.00..10.00 rows=1000 width=4) (actual time=0.171..0.279 rows=1000 loops=1) -> Function Scan on pgv_select t (cost=0.00..10.00 rows=1000 width=36) (actual time=0.010..0.817 rows=16426 loops=1000) Planning time: 0.061 ms Execution time: 2991.351 ms
and
explain analyze select * from generate_series(1,1000) as gs(n) left outer join usr_id_email on true : Nested Loop Left Join (cost=0.00..189230.42 rows=15113000 width=40) (actual time=0.172..2390.766 rows=16426000 loops=1) ' -> Function Scan on generate_series gs (cost=0.00..10.00 rows=1000 width=4) (actual time=0.159..0.288 rows=1000 loops=1) -> Materialize (cost=0.00..345.69 rows=15113 width=36) (actual time=0.000..0.738 rows=16426 loops=1000) -> Seq Scan on usr_id_email (cost=0.00..270.13 rows=15113 width=36) (actual time=0.010..2.660 rows=16426 loops=1) Planning time: 0.076 ms Execution time: 2874.250 ms
As you can see, the time is quite comparable - one operation is slower compared to the temporary table by about 0.1 ms.
What is the speed of access to a particular row compared to temporary tables? We'll see:
In one case of pgv_select, you can specify the primary key of the string that we need:
explain analyze select * from usr_id_email uie where exists (select * from pgv_select('package','set',uie.id) as t(id int, email text)) : Seq Scan on usr_id_email uie (cost=0.00..459.04 rows=7556 width=36) (actual time=0.021..19.947 rows=16426 loops=1) Filter: (SubPlan 1) SubPlan 1 -> Function Scan on pgv_select t (cost=0.00..1000.00 rows=100000 width=0) (actual time=0.001..0.001 rows=1 loops=16426) Planning time: 0.047 ms Execution time: 20.704 ms
Compare with our temporary table:
set enable_hashjoin=false; set enable_mergejoin=false; explain analyze select * from usr_id_email uie where exists (select * from usr_id_email uie2 where uie.id=uie2.id) Nested Loop Semi Join (cost=0.29..5620.94 rows=15113 width=36) (actual time=0.016..17.227 rows=16426 loops=1) -> Seq Scan on usr_id_email uie (cost=0.00..270.13 rows=15113 width=36) (actual time=0.007..1.130 rows=16426 loops=1) -> Index Only Scan using usr_id_email_pkey on usr_id_email uie2 (cost=0.29..0.34 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=16426) Index Cond: (id = uie.id) Heap Fetches: 16426 Planning time: 0.082 ms Execution time: 17.976 ms
As we see, time is quite comparable; however, I had to disable some connection methods. By the way, if you turn them on, you can consider another option, and its result in our case is much better:
Hash Semi Join (cost=459.04..936.98 rows=15113 width=36) (actual time=5.171..12.703 rows=16426 loops=1) Hash Cond: (uie.id = uie2.id) -> Seq Scan on usr_id_email uie (cost=0.00..270.13 rows=15113 width=36) (actual time=0.008..1.857 rows=16426 loops=1) -> Hash (cost=270.13..270.13 rows=15113 width=4) (actual time=5.150..5.150 rows=16426 loops=1) Buckets: 32768 (originally 16384) Batches: 1 (originally 1) Memory Usage: 513kB -> Seq Scan on usr_id_email uie2 (cost=0.00..270.13 rows=15113 width=4) (actual time=0.003..2.417 rows=16426 loops=1) Planning time: 0.107 ms Execution time: 13.603 ms
But these are all largely theoretical reflections, and it is worth seeing what will happen in the case of more or less real workload. We will use the pgbench
utility for this, but first we will create the following functions. Simple request:
CREATE OR REPLACE FUNCTION public.get_mailru_discounts_plain() RETURNS TABLE(usr_cnt integer, discounts_cnt integer) AS $BODY$ begin select count(*) into usr_cnt from ord.usr u where u.email like 'ab%@mail.ru'; select count(*) into discounts_cnt from ord.discount d, ord.usr u where u.email like 'ab%@mail.ru' and d.usr_id=u.id; return next; end; $BODY$ LANGUAGE plpgsql;
Using a temporary table:
CREATE OR REPLACE FUNCTION public.get_mailru_discounts_array() RETURNS TABLE(usr_cnt integer, discounts_cnt integer) AS $BODY$ declare ids int[]; begin select array_agg(id) into ids from ord.usr u where u.email like 'ab%@mail.ru'; get diagnostics usr_cnt = row_count; select count(*) into discounts_cnt from ord.discount d where d.usr_id=any(ids); return next; end; $BODY$ LANGUAGE plpgsql VOLATILE COST 100 ROWS 1000;
Using extensions:
create or replace function get_mailru_discounts_pgvariables() returns table(usr_cnt int, discounts_cnt int) as $code$ begin if exists(select * from pgv_list() where package='package' and name='set') then perform pgv_remove('package','set'); end if; perform pgv_insert('package', 'set', row(id)) from ord.usr u where u.email like '%@mail.ru'; get diagnostics usr_cnt = row_count; select count(*) into discounts_cnt from ord.discount d, pgv_select('package','set') u(id int) where d.usr_id=u.id; return next; end; $code$ language plpgsql;
Request files are simple:
select * from get_mailru_discounts_plain(); select * from get_mailru_discounts_temptable(); select * from get_mailru_discounts_pgvariables();
Starting pgbench:
pgbench -h localhost -p 5433 -U postgres -M prepared-c 32 -j 2 -n -f /tmp/test.pgb work
The results are presented in the table:
Option - Transaction | 100 | 1000 | 5000 | 10,000 | 20,000 |
---|---|---|---|---|---|
plain | 10170 | 11349 | 11537 | 11560 | 11639 |
temptable | 3364 | 3380 | 561 | 678 | 378 |
pg_variables | 11852 | 15944 | 16634 | 16748 | 16719 |
As you can see, after a while the already low performance of the procedure using a temporary table drops even more; this is connected, as noted above, with filling the OS cache with garbage data about temporary tables that become unnecessary files. It is curious to follow the output of pgbench (run on 5000 transactions):
... progress: 1.0 s, 2205.8 tps, lat 11.907 ms stddev 13.122 progress: 2.0 s, 2497.0 tps, lat 12.237 ms stddev 14.372 progress: 3.0 s, 1945.0 tps, lat 15.882 ms stddev 22.674 progress: 4.0 s, 2746.1 tps, lat 12.569 ms stddev 16.776 progress: 5.0 s, 1814.2 tps, lat 16.601 ms stddev 27.144 progress: 6.0 s, 2067.4 tps, lat 15.629 ms stddev 24.284 progress: 7.0 s, 1535.0 tps, lat 20.828 ms stddev 30.302 progress: 8.0 s, 862.0 tps, lat 37.671 ms stddev 45.891 progress: 9.0 s, 1312.8 tps, lat 25.218 ms stddev 35.340 progress: 10.0 s, 1213.1 tps, lat 25.686 ms stddev 37.921 progress: 11.0 s, 962.0 tps, lat 33.685 ms stddev 37.641 progress: 12.0 s, 1455.0 tps, lat 22.055 ms stddev 27.562 progress: 13.0 s, 1146.0 tps, lat 28.127 ms stddev 33.272 progress: 14.0 s, 791.0 tps, lat 37.760 ms stddev 41.861 progress: 15.0 s, 659.9 tps, lat 42.713 ms stddev 51.816 ...
You can clearly see how the performance fluctuates - starting relatively quickly from 2205 tps, it quickly rolls down to half a thousand, and then drops even more. Command execution
/bin/echo 3 >/proc/sys/vm/drop_caches
improves the situation somewhat, but not for long.
From the above tests, it is clear that, as a temporary data warehouse, the pg_variables module is much more efficient than temporary tables, and more convenient — in fact, the data set is defined by a packet-variable pair, which can be passed as parameters, returned from etc.
Partly, unfortunately, the data saved by the extension means exists outside the transaction — it is saved both in the case of a committed transaction and in the case of a rollback; Moreover, even with the execution of a separate command, partial data can be obtained:
work=# select pgv_insert('package', 'errs',row(n)) work-# from generate_series(1,5) as gs(n) where 1.0/(n-3)<>0; : work=# select * from pgv_select('package','errs') as r(i int); i --- 1 2 (2 )
On the one hand, this is not very convenient - in some cases it is necessary to provide for the deletion of incorrectly entered data, but in others it can be very useful - for example, to save some data even in case of a rollback of a transaction.
Of course, the presented module has drawbacks - for example, it will not be the most successful solution if you need to perform some complex search in a large amount of temporary data - it will be more convenient and productive to create a temporary table, insert data there, build indices (any! And not just a hash in one column), collect statistics and execute the required query with it.
At the same time, it is probably worth noting that in some cases temporary tables can also be quite a reasonable solution - when the flow of transactions is not too large. In fact, with a load of 10 tps, the option with temporary tables will behave fairly acceptable, unless you remember that lengthy transactions in other sessions may lead to an increase in the size of the system catalog.
Full documentation of the module can be viewed at the link: https://github.com/postgrespro/pg_variables
Source: https://habr.com/ru/post/302200/
All Articles