📜 ⬆️ ⬇️

How to make PostgreSQL read faster


Photo source


Everyone can count, but not everyone can count quickly. In this article we will take a closer look at the PostgreSQL count optimization methods. There are techniques that can speed up the counting of the number of lines by orders of magnitude.


If you approach the issue with all seriousness, you need to select several options for the count , each of which has its own methods. What you need to decide:



We will analyze solutions for each specific situation, as well as compare their speed and resource consumption. Having analyzed the situation from a centralized database, we will use Citus to demonstrate parallel execution of count in a distributed database.


Content



DB Preparation


For the tests, we will use a database called count , for which pgbench is initialized:


[user@comp ~]$ pgbench -i count 

Create a test table:


 --       CREATE TABLE items AS SELECT (random()*1000000)::integer AS n, md5(random()::text) AS s FROM generate_series(1,1000000); --        VACUUM ANALYZE; 

Count with duplicates


Accurate count

So, let's start from the beginning: let's consider getting the exact number of rows of the entire table or its part with duplicates - the good old count(*) . The execution time of this command will give us a basis for estimating the speed of work of other methods of counting the number of rows.


Pgbench is a handy tool for repeatedly running a query and collecting performance statistics.


 #     PostgreSQL 9.5.4 echo "SELECT count(*) FROM items;" | pgbench -d count -t 50 -P 1 -f - # average 84.915 ms # stddev 5.251 ms 

A note about count(1) vs count(*) . You might think that count(1) faster, since count(*) should handle the values ​​of all columns in the current row. In fact, the opposite is true. Unlike the SELECT * construct, the asterisk in count(*) means nothing. PostgreSQL treats the count(*) expression as a special case of count with no arguments. (It would be correct to write this expression in the form of count() ). On the other hand, count(1) takes one argument, and PostgreSQL must make sure for each line that this argument (1) is indeed not NULL.


The previous test with count(1) produced the following results:


 # average 98.896 ms # stddev 7.280 ms 

In any case, both count(1) and count(*) by definition slow. To ensure the consistency of concurrently running transactions, PostgreSQL uses multiversion concurrency control (MVCC) parallel control. This means that each transaction can see different rows and even a different number of rows in the table. Therefore, there is no single correct value for the number of rows that the DBMS could put in the cache, and the system will have to scan all the rows in order to calculate which of them can be seen in a separate transaction. The execution time of the exact count grows linearly following the increase in the size of the table.


 EXPLAIN SELECT count(*) FROM items; Aggregate (cost=20834.00..20834.01 rows=1 width=0) -> Seq Scan on items (cost=0.00..18334.00 rows=1000000 width=0) 

Scan accounts for 88% of the cost of the request. If you double the size of the table, then the query execution time will increase by about two times with the proportional increase in the cost of scan and aggregate .


Number of linesAverage time
1 million85 ms
2 million161 ms
4 million343 ms

How to speed it up? There are two options: decide that the estimated value is enough for us, or put the number of lines in the cache on our own. In the second case, we will have to separately store the values ​​for each table and each WHERE clause for which we want to quickly perform count .


Let's look at an example of manually caching the value of count(*) for the entire items table. The following trigger-based solution is an adaptation of the method proposed by A. Elein Mustain . The MVCC PostgreSQL engine will maintain consistency between items and a table containing the number of rows.


 BEGIN; CREATE TABLE row_counts ( relname text PRIMARY KEY, reltuples bigint ); --       INSERT INTO row_counts (relname, reltuples) VALUES ('items', (SELECT count(*) from items)); CREATE OR REPLACE FUNCTION adjust_count() RETURNS TRIGGER AS $$ DECLARE BEGIN IF TG_OP = 'INSERT' THEN EXECUTE 'UPDATE row_counts set reltuples=reltuples +1 where relname = ''' || TG_RELNAME || ''''; RETURN NEW; ELSIF TG_OP = 'DELETE' THEN EXECUTE 'UPDATE row_counts set reltuples=reltuples -1 where relname = ''' || TG_RELNAME || ''''; RETURN OLD; END IF; END; $$ LANGUAGE 'plpgsql'; CREATE TRIGGER items_count BEFORE INSERT OR DELETE ON items FOR EACH ROW EXECUTE PROCEDURE adjust_count(); COMMIT; 

The speed of reading and updating cached values ​​in this case does not depend on the size of the table, and getting the value of the number of rows is very fast. However, this technique increases the overhead of inserts and deletes. Without a trigger, the following command runs 4.7 seconds, while as an insert with a trigger it is fifty times slower :


 INSERT INTO items (n, s) SELECT (random()*1000000)::integer AS n, md5(random()::text) AS s FROM generate_series(1,1000000); 

Evaluation

Score across the table

The approach in which we cache the number of rows in a table slows down the insert operation. If, instead of the exact number, we are ready to be satisfied with the estimated value, then there is an opportunity to get quick read operations without deteriorating the insert operation time. For this we can use the service data collected by PostgreSQL. Their sources are stats collector and autovacuum daemon .


Options for obtaining estimated values:


 --   "stats collector" SELECT n_live_tup FROM pg_stat_all_tables WHERE relname = 'items'; --  VACUUM  ANALYZE SELECT reltuples FROM pg_class WHERE relname = 'items'; 

But there is a more reliable source, the data in which are updated more often. Andrew Gierth (RhodiumToad) recommends:


Remember: the scheduler does not actually use reltuples ; it multiplies the reltuples / relpages ratio by the current number of pages.

The logic is as follows: as the amount of data in a table increases, the average number of rows that fit into a physical page will generally not change as much as their total number. To get a more accurate estimate of the current number of rows, we can multiply the average number of rows by the actual information about the current number of pages occupied by the table.


 -- pg_relation_size  block_size   , --        --      SELECT (reltuples/relpages) * ( pg_relation_size('items') / (current_setting('block_size')::integer) ) FROM pg_class where relname = 'items'; 

Estimate for the sample

In the previous section, we looked at how to obtain an estimated number of rows for an entire table, but is it possible to do the same, but only for rows matching the WHERE condition? Michael Fuhr came up with an interesting way : run EXPLAIN for the query and analyze the result.


 CREATE FUNCTION count_estimate(query text) RETURNS integer AS $$ DECLARE rec record; rows integer; BEGIN FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)'); EXIT WHEN rows IS NOT NULL; END LOOP; RETURN rows; END; $$ LANGUAGE plpgsql VOLATILE STRICT; 

This function can be used as follows:


 SELECT count_estimate('SELECT 1 FROM items WHERE n < 1000'); 

The accuracy of this method depends on the scheduler, which uses various methods to evaluate the selectivity of the WHERE , and from where the number of rows returned by the query can be obtained.


Distinct count (no duplicates)


Accurate count

Default behavior when there is insufficient memory

Duplicate count can be slow, but count distinct is much worse. With limited working memory and no indexes, PostgreSQL is not able to perform optimization efficiently. In the default configuration, the DBMS imposes a hard limit on each parallel request ( work_mem ). On the computer I use for development, this default value was set at 4 megabytes.


Let's estimate the performance of working with a million lines on the work_mem factory settings.


 echo "SELECT count(DISTINCT n) FROM items;" | pgbench -d count -t 50 -P 1 -f - # average 742.855 ms # stddev 21.907 ms echo "SELECT count(DISTINCT s) FROM items;" | pgbench -d count -t 5 -P 1 -f - # average 31747.337 ms # stddev 267.183 ms 

Running EXPLAIN shows that most of the query execution time was spent on aggregation. Also note that the counting of the number of rows in a text type column is much slower than in the integer one:





 --     "integer", n Aggregate (cost=20834.00..20834.01 rows=1 width=4) (actual time=860.620..860.620 rows=1 loops=1) Output: count(DISTINCT n) Buffers: shared hit=3904 read=4430, temp read=1467 written=1467 -> Seq Scan on public.items (cost=0.00..18334.00 rows=1000000 width=4) (actual time=0.005..107.702 rows=1000000 loops=1) Output: n, s Buffers: shared hit=3904 read=4430 --     "text", s Aggregate (cost=20834.00..20834.01 rows=1 width=33) (actual time=31172.340..31172.340 rows=1 loops=1) Output: count(DISTINCT s) Buffers: shared hit=3936 read=4398, temp read=5111 written=5111 -> Seq Scan on public.items (cost=0.00..18334.00 rows=1000000 width=33) (actual time=0.005..142.276 rows=1000000 loops=1) Output: n, s Buffers: shared hit=3936 read=4398 

What happens inside the "aggregate"? The description of this procedure in the output of EXPLAIN opaque. Understanding the situation helps to analyze a similar request. Replace count distinct by select distinct .





 EXPLAIN (ANALYZE, VERBOSE) SELECT DISTINCT n FROM items; Unique (cost=131666.34..136666.34 rows=498824 width=4) (actual time=766.775..1229.040 rows=631846 loops=1) Output: n -> Sort (cost=131666.34..134166.34 rows=1000000 width=4) (actual time=766.774..1075.712 rows=1000000 loops=1) Output: n Sort Key: items.n Sort Method: external merge Disk: 13632kB -> Seq Scan on public.items (cost=0.00..18334.00 rows=1000000 width=4) (actual time=0.006..178.153 rows=1000000 loops=1) Output: n 

Under conditions of insufficient work_mem and the absence of external data structures (eg, indexes), PostgreSQL performs a merge-sort table between memory and disk, and then runs through the result, removing duplicates, that is, acting in much the same way as the classic Unix combination sort | uniq sort | uniq .


Most of the time the query is executed is sorting, especially when we use not the integer column n , but the string s . The removal of duplicates (unique filter) in both cases is performed at approximately the same speed.


Specialized aggregation

To calculate the number of unique values, Thomas Vondra created a specialized aggregation method that works with types of limited length (must not exceed 64 bits). This method, even without increasing the working memory or creating indexes, is faster than the default method based on sorting. To install, follow these steps:


  1. Create a copy of the tvondra / count_distinct project .
  2. Switch to the stable branch: git checkout REL2_0_STABLE .
  3. Run make install .
  4. In your database run: CREATE EXTENSION. count_distinct; CREATE EXTENSION. count_distinct; .

In this article, Thomas explains how aggregation works. I can only briefly say that his method creates in memory a sorted array of unique elements, compacting it in the process.


 echo "SELECT COUNT_DISTINCT(n) FROM items;" | pgbench -d count -t 50 -P 1 -f - # average 434.726 ms # stddev 19.955 ms 

This works faster than the standard count distinct , which on our test data is performed on average 742 ms. Note that extensions written in C, such as count_distinct , are not limited to the work_mem parameter, so an array created by a process may take more memory per connection than you originally planned.


Hashagregate

If all recalculated columns fit in work_mem , to get unique values, PostgreSQL will apply a hash table:





 SET work_mem='1GB'; EXPLAIN SELECT DISTINCT n FROM items; HashAggregate (cost=20834.00..25822.24 rows=498824 width=4) Group Key: n -> Seq Scan on items (cost=0.00..18334.00 rows=1000000 width=4) 

This is the fastest method we have reviewed. It runs an average of 372 ms for n and 23 seconds for s . select distinct n and select count(distinct n) queries will work for about the same amount of time, provided that count distinct aggregation also applies HashAggregate.


Be careful: setting a high working memory limit can have unpleasant consequences, since work_mem applies to every parallel request. In addition, we can come up with something better.


Index-Only Scan

This feature appeared in PostgreSQL 9.2. If the index contains all the data needed for the query, the system can only use it, without touching the table itself (“the heap”). The index type must support an index-only scan (for example, btree ). GiST and SP-GiST indexes support index-only scan only for some classes of operators.


Create a btree index for columns n and s :


 CREATE INDEX items_n_idx ON items USING btree (n); CREATE INDEX items_s_idx ON items USING btree (s); 

A different strategy is now used to select unique values ​​from these columns:





 EXPLAIN SELECT DISTINCT n FROM items; Unique (cost=0.42..28480.42 rows=491891 width=4) -> Index Only Scan using items_n_idx on items (cost=0.42..25980.42 rows=1000000 width=4) 

But here we come across a strange problem: SELECT COUNT(DISTINCT n) FROM items will not use the index, despite the fact that SELECT DISTINCT n does this by default. By following the tips on blogs ( “The trick that will speed up your postgres 50x times!” ), You can give a hint to the planner by rewriting the count distinct in the form of a count on a subquery:


 -- SELECT COUNT(DISTINCT n) FROM items; --      EXPLAIN SELECT COUNT(*) FROM (SELECT DISTINCT n FROM items) t; Aggregate (cost=34629.06..34629.07 rows=1 width=0) -> Unique (cost=0.42..28480.42 rows=491891 width=4) -> Index Only Scan using items_n_idx on items (cost=0.42..25980.42 rows=1000000 width=4) 

Symmetric (in-order) traversal of a binary tree is performed quickly. This request takes on average 177 ms (270 ms for column s ).


Remark If the work_mem value is sufficient to hold the entire table, PostgreSQL will select HashAggregate even if there is an index. The paradox turns out: allocating more memory to the system can lead to the choice of the worst query plan. It is possible to force the selection of index-only scan by setting SET enable_hashagg=false; , but do not forget to return it back to true , so as not to spoil the plans of other requests.


Evaluation

HyperLogLog

The methods discussed earlier depend on indices, hash tables, sorted arrays in memory, or refer to the statistical tables of a centralized database. When data becomes really a lot and / or they are divided among several nodes of a distributed database, these methods cease to suit us.


In this case, probabilistic data structures that are able to give rapid approximate estimates and are well parallelized come to the rescue. Let's try one of these structures at count distinct . Consider a mechanism for estimating the number of elements (cardinality estimator) called HyperLogLog (HLL). It uses a small amount of memory to represent a set of elements. The join operation in this mechanism works without loss, which allows you to combine arbitrary HLL values ​​without losing the accuracy of quantity estimation.


HLL uses the properties of “good” hash functions, in particular the distance between the hashed values. A function that evenly distributes values ​​tends to carry them as far as possible. As new hashes are added, free space becomes smaller, and the elements begin to cling to each other. By analyzing the smallest distances between the hashed values, the algorithm can estimate the most probable number of source elements.


Let's measure the speed. First install the extension for PostgreSQL.


  1. Create a copy of the postgresql-hll project.
  2. Run make install .
  3. Create the hll extension in your database: CREATE EXTENSION hll; .

HLL performs fast data aggregation with sequential table scans:


 EXPLAIN SELECT #hll_add_agg(hll_hash_integer(n)) FROM items; Aggregate (cost=23334.00..23334.01 rows=1 width=4) -> Seq Scan on items (cost=0.00..18334.00 rows=1000000 width=4) 

The average HLL speed when performing count distinct was 239 ms in column n and 284 ms in s . It turned out a little slower than index-only scan on one million records. The real power of the HLL is manifested due to its associative and commutative unification operations, which pass without loss. This means that they can be executed in parallel and be combined to calculate the final result.


Paralleling


Applications that collect analytics in real time, such as, for example, Google Analytics, actively use count , and this operation is well parallelized. In this section, we will measure the performance of several methods of counting the number of rows based on a small Citus cluster deployed in the Citus Cloud .


The idea is to deploy the distributed database nodes on multiple machines. The nodes will have the same scheme, and each of them will contain part of a common data set (shard). Counting the number of rows will be performed in parallel, i.e., simultaneously on different machines.


Cluster Setup

For the test, we will make only a small cluster, since our goal is to evaluate comparative performance, and not to get the maximum speed.


In Citus Cloud, I made a cluster of eight machines, choosing the weakest possible configuration for each of them. If you want to reproduce this example, register yourself here .


After creating the cluster, I connect to the coordinating node to execute SQL queries. First create a table.


 CREATE TABLE items ( n integer, s text ); 

At the moment the table exists only in the database of the coordinator. We need to break the table and place its parts on the working nodes. Citus assigns each row to a specific segment (shard) by processing the values ​​in the selected column for distribution . In the example below, we set the task for him to distribute future rows in the items table, using hashes of values ​​in column n to determine whether they belong to a particular segment.


 SELECT master_create_distributed_table('items', 'n', 'hash'); SELECT master_create_worker_shards('items', 32, 1); 

With the help of the coordinating node, we load random data into the database segments. (Citus also supports MX , masterless mode, which is used to quickly load data, but now it does not interest us).


After obtaining the URL of the cluster coordinator database, execute the following code on a computer with a fast network connection. (All generated data will be transmitted over the network from this machine, so you need good speed.)


 cat << EOF > randgen.sql COPY ( SELECT (random()*100000000)::integer AS n, md5(random()::text) AS s FROM generate_series(1,100000000) ) TO STDOUT; EOF psql $CITUS_URL -q -f randgen.sql | \ psql $CITUS_URL -c "COPY items (n, s) FROM STDIN" 

In the central database example, we used a million rows. This time, let's take a hundred million.


Accurate count

With duplicates

Normal count (without duplicates) does not cause problems. The coordinator performs the query on all nodes, and then summarizes the results. The EXPLAIN output shows the plan selected on one of the work nodes (“Distributed Query”) and the plan selected on the coordinator (“Master Query”).


 EXPLAIN VERBOSE SELECT count(*) FROM items; Distributed Query into pg_merge_job_0003 Executor: Real-Time Task Count: 32 Tasks Shown: One of 32 -> Task Node: host=*** port=5432 dbname=citus -> Aggregate (cost=65159.34..65159.35 rows=1 width=0) Output: count(*) -> Seq Scan on public.items_102009 items (cost=0.00..57340.27 rows=3127627 width=0) Output: n, s Master Query -> Aggregate (cost=0.00..0.02 rows=1 width=0) Output: (sum(intermediate_column_3_0))::bigint -> Seq Scan on pg_temp_2.pg_merge_job_0003 (cost=0.00..0.00 rows=0 width=0) Output: intermediate_column_3_0 

For reference: on our cluster, this query takes 1.2 seconds. Distinct count is a more serious problem when working with a distributed database.


Distinct (no duplicates)

The difficulty in calculating the unique values ​​of a column in a distributed database is that duplicates should be searched on different nodes. However, this is a problem if you count the values ​​in the distribution column. Rows with the same values ​​in this column will fall into one segment, thus avoiding intersegment duplication.


Citus knows that in order to calculate the unique values ​​in the distribution column, you need to perform a count distinct query on each node and add the results. Our cluster performs this task in 3.4 seconds.


Finding the number of unique values ​​in a conventional column (non-distribution) is more difficult. Logically, there are two possibilities:


  1. Copy all the lines to the coordinating node and count there.
  2. , , , , .

. .


«» (repartitioning). , , , . , . . Citus , .



, HLL, . (non-distribution), . HLL . HLL , , .


Citus postgresql-hll. citus.count_distinct_error_rate , Citus count distinct HLL. For example:


 SET citus.count_distinct_error_rate = 0.005; EXPLAIN VERBOSE SELECT count(DISTINCT n) FROM items; Distributed Query into pg_merge_job_0090 Executor: Real-Time Task Count: 32 Tasks Shown: One of 32 -> Task Node: host=*** port=5432 dbname=citus -> Aggregate (cost=72978.41..72978.42 rows=1 width=4) Output: hll_add_agg(hll_hash_integer(n, 0), 15) -> Seq Scan on public.items_102009 items (cost=0.00..57340.27 rows=3127627 width=4) Output: n, s Master Query -> Aggregate (cost=0.00..0.02 rows=1 width=0) Output: (hll_cardinality(hll_union_agg(intermediate_column_90_0)))::bigint -> Seq Scan on pg_temp_2.pg_merge_job_0090 (cost=0.00..0.00 rows=0 width=0) Output: intermediate_column_90_0 

: 3,2 n 3,8 s . 100 (non-distribution) ! HLL — .


Results


Method/1Unique
PG Stats0,3---
EXPLAIN0,3-+-
2 ( )+--
count(*)85++-
count(1)99++-
Index Only Scan177+++
HLL239-++
HashAgg372+++
Custom Agg435 ( 64-bit)+++
Mergesort742+++

index-only scan , HyperLogLog (HLL) (> 100 ). , (distinct count) .


')

Source: https://habr.com/ru/post/316304/


All Articles