Surprises query scheduler in the PostgreSQL database

Graphs, reports and analytics - all this is somehow present in the back-office of any, even very small, enterprise. When in conventional tables in Excel / Numbers / Libre it becomes narrower, but data is still not very big, traditional solutions for internal company needs are often built using relational databases such as PostgreSQL, MySQL or MariaDB.

These databases are free, thanks to SQL, they are conveniently integrated with other components in the system, they are popular and most developers and analysts can work with them. They can digest enough volume (traffic and volumes) to calmly hold out until the company can afford more sophisticated (and expensive) solutions for analytics and reports.

Starting position

However, even in the repeatedly studied technology there are always different nuances that can suddenly add to the care of engineers. In addition to reliability, the most frequently mentioned problem with databases is their performance. Obviously, as the amount of data grows, the response rate of the database drops, but if this happens predictably and is consistent with the increase in load, then this is not so bad. You can always see in advance when the database starts to demand attention and plan an upgrade or transition to a fundamentally different database. Much worse, if the performance of the database degrades unpredictably.

The topic of improving database performance is as old as the world and very extensive, and in this article I would like to focus only on one direction. Namely, on assessing the effectiveness of plans for queries in the PostgreSQL database, as well as changing this efficiency over time, in order to make the behavior of the database planner more predictable.
')
Despite the fact that many of the things we are talking about are applicable to all recent versions of this database, the examples below imply version 11.2, which is currently the last.
Before we dive into the details, it makes sense to make a digression and say a few words about where performance problems can come from in relational databases. What exactly is the database doing when it “slows down”? Lack of memory (a large number of disk or network accesses), a weak processor, these are all obvious problems with understandable solutions, but what else can affect the speed of query execution?

Freshen memories

In order for the database to respond to the SQL query, it needs to build a query plan (which tables and columns to see which indices will be needed, what to take from there, what to compare with, how much memory is required, and so on). This plan is formed in the form of a tree, the nodes in which are just a few typical operations, with different computational complexity. Here are a few of them, for example (N is the number of lines with which the operation should be performed):

Operation	What is being done	Cost
SELECT ... WHERE ... data fetching operations
Seq scan	Load each row from the table and check the condition.	O (n)
Index Scan (b-tree index)	The data is directly in the index, so we search for the conditionally necessary elements of the index and take the data from there.	O (log (N)), search for an item in a sorted tree.
Index Scan (hash index)	The data is directly in the index, so we search for the conditionally necessary elements of the index and take the data from there.	O (1), search for an item in a hash table, without taking into account the cost of creating hashes
Bitmap Heap Scan	We select the numbers of the necessary lines by index, then we load only the necessary lines and perform additional checks with them.	Index Scan + Seq Scan (M), Where M is the number of rows found after Index Scan. It is assumed that M << N, i.e. index is more useful than Seq Scan.
Join operations (JOIN, SELECT from multiple tables)
Nested loop	For each row from the left table, look for the appropriate row in the right table.	O (N ² ). But if one of the tables is much smaller than the other (dictionary) and practically does not grow with time, then the actual cost can be reduced to O (N).
Hash join	For each row from the left and right tables, we assume the hash, thereby reducing the number of iterations of possible connection options.	O (N), but in the case of a very inefficient function of a hash or a large number of identical fields for a connection, it can also be O (N ² )
Merge join	Sort left and right tables by condition, after which we combine two sorted lists	O (N * log (N)) The cost of sorting + pass through the list.
Aggregation operations (GROUP BY, DISTINCT)
Group Aggregate	Sort the table by the condition of aggregation and then in the sorted list we group the adjacent rows.	O (N * log (N))
Hash aggregate	We consider a hash for the aggregation condition, for each row. For lines with the same hash, we perform aggregation.	O (n)

As you can see, the cost of a query depends very much on how the data is located in the tables and how this order corresponds to the hash operations used. Nested Loop, despite its cost in O (N ² ), may be more advantageous than Hash Join or Merge Join when one of the joined tables degenerates to one or several rows.

In addition to CPU resources, the cost includes the use of memory. Both are limited resources, so the query planner has to find a compromise. If two tables are mathematically more advantageous to join via Hash Join, but there is simply no space in the memory for such a large hash table, the database may be forced to use Merge Join, for example. And the “slow” Nested Loop does not require additional memory at all and is ready to produce results right after the launch.

The relative cost of these operations more clearly show on the chart. There are not absolute numbers, just an approximate ratio of different operations.

The Nested Loop chart “starts” below, because it does not require any additional calculations, no memory allocation or copying of intermediate data, but its cost is O (N ² ). Merge Join and Hash Join have a higher initial cost, but after some N values, they start to gain in time with Nested Loop. The planner tries to choose a plan with the lowest cost and on the chart above he follows different operations with different N (green dotted arrow). With the number of rows up to N1, it is more profitable to use Nested Loop, from N1 to N2 is more advantageous than Merge Join, then after N2 it becomes more profitable to have Hash Join, however Hash Join requires memory to create hash tables. And when N3 reaches this memory, it becomes insufficient, which leads to the forced use of Merge Join.

When choosing a plan, the scheduler estimates the cost of each operation in the plan using a set of the relative costs of some “atomic” operations in the database. As, for example, calculations, comparisons, loading of the page in memory, etc. Here is a list of some such parameters from the default configuration, there are not so many of them:

Relative cost constant	Default value
seq_page_cost	1.0
random_page_cost	4.0
cpu_tuple_cost	0.01
cpu_index_tuple_cost	0.005
cpu_operator_cost	0.0025
parallel_tuple_cost	0.1
parallel_setup_cost	1000.0

True, these constants alone are not enough, you also need to know the “N”, that is, how many lines from the previous results you will have to process in each such operation. The upper limit is obvious here - the DB “knows” how much data is in any table and can always count “as much as possible”. For example, if you have two tables with 100 rows each, then their connection can give from 0 to 10,000 rows on output. Accordingly, the next input operation can have up to 10,000 lines.

But if you know at least a little about the nature of the data in the tables, this number of rows can be predicted more accurately. For example, for two tables with 100 rows from the example above, if you know in advance that the join will give not 10 thousand rows, but the same 100, the estimated cost of the next operation is greatly reduced. In this case, this plan could be more effective than others.

Out-of-Box Optimization

In order for the scheduler to be able to more accurately predict the size of intermediate results, PostgreSQL uses the collection of statistics on tables, which is accumulated in pg_statistic, or in its more readable variant, in pg_stats. It is updated automatically when vacuum is started, or explicitly with the ANALYZE command. This table contains a variety of information about what kind of data and what nature is in the tables. In particular, histograms of values, the percentage of empty fields and other information. The scheduler uses all this to more accurately predict the amount of data for each operation in the plan tree, and thus more accurately calculate the cost of operations and the plan as a whole.

Take, for example, the query:

SELECT t1.important_value FROM t1 WHERE t1.a > 100

Suppose that a histogram of values in the “t1.a” column revealed that values greater than 100 occur in approximately 1% of the rows in the table. Then you can predict that such a sample will return about one hundredth of all rows from the table "t1".
The database provides an opportunity to look at the projected cost of the plan through the EXPLAIN command, and the actual time of its work - with the help of EXPLAIN ANALYZE.

It seems that with automatic statistics, now everything should be fine, but there may be difficulties. There is a good article about this from Citus Data , with an example of the ineffectiveness of automatic statistics and the collection of additional statistics using CREATE STATISTICS (available from PG 10.0).

So, for the scheduler, there are two sources of errors in the calculation of costs:

The relative cost of primitive operations (seq_page_cost, cpu_operator_cost, and so on) defaults to reality (cpu cost 0.01, srq page load cost - 1 or 4 for random page load). Far from the fact that 100 comparisons will be equal to 1 page load.
Error predicting the number of rows in intermediate operations. The actual cost of the operation in this case can be very different from the forecast.

In complex queries, the compilation and prediction of all possible plans may in itself take a lot of time. What is the use of returning data in 1 second if the DB only planned the request a minute? In PostgreSQL, there is a Geqo optimizer for this situation, it’s a scheduler that doesn’t build all possible plans, but starts with a few random ones and builds the best ones, predicting ways to reduce costs. All this also does not improve the accuracy of the forecast, although it accelerates the finding of at least some more or less optimal plan.

Sudden Plans - Competitors

If everything is going well, your request works as quickly as possible. As the amount of data grows, the speed of queries in the database gradually increases, and after a while, watching it, you can roughly predict when you need to increase the memory or the number of CPU cores or expand the cluster, etc.

But it is necessary to take into account the fact that there are competitors with an optimal plan with close costs for implementation, which we do not see. And if the DB suddenly changes the query plan for another, it becomes a surprise. Well, if the database "jump" to a more effective plan. And if not? Let's look, for example, on a picture. This is the projected cost and real time of the two plans (red and green):

Here one plan is depicted in green and one in red is its closest competitor. The dotted line depicts a graph of the projected cost, a solid line - real time. The gray dotted arrow represents the choice of the scheduler.

Suppose that one Friday evening, the predicted number of rows in some intermediate operation reaches N1 and the “red” forecast begins to outperform the “green” one. The scheduler starts using it. The actual request time immediately jumps up (moving from a green solid line to a red one), that is, the schedule of degradation of the database takes the form of a step (and maybe a “wall”). In practice, such a “wall” can increase the query execution time by an order of magnitude and more.

It is worth noting that this situation is probably more typical for back office and analytics than for the frontend, since the latter is usually adapted to more simultaneous queries and, therefore, uses simpler queries in the database, where the error in the predictions of the plan is smaller. If this is a database for building reports or analytics, queries can be arbitrarily complex.

How to live with it?

The question arises, was it possible to foresee such “underwater” invisible plans somehow? After all, the problem is not even that they are not optimal, but that switching to another plan can happen unpredictably, and, according to the law of meanness, at the most unfortunate moment for this.

Directly, unfortunately, they can not be seen, but you can search for alternative plans by changing the actual weights by which they are selected. The meaning of this approach is to remove from the field of view the current plan, which the planner considers optimal, so that any of his closest competitors become optimal, and, thus, it could be seen through the EXPLAIN command. Periodically checking the cost changes in such "competitors" and in the main plan, we can estimate the likelihood that the database will "jump" to another plan soon.

In addition to collecting data on forecasts of alternative plans, you can run them and measure their performance, which also gives an idea of the internal "well-being" of the database.
Let's see what tools we have for such experiments.

First, you can directly "prohibit" specific operations using session variables. It is convenient that they do not need to be changed in the config and reload the database, their value changes only in the current open session and does not affect the remaining sessions, so you can experiment directly on real data. Here is a list of them with default values. Almost all operations included:

Operations used	Default value
enable_bitmapscan enable_hashagg enable_hashjoin enable_indexscan enable_indexonlyscan enable_material enable_mergejoin enable_nestloop enable_parallel_append enable_seqscan enable_sort enable_tidscan enable_parallel_hash enable_partition_pruning	on
enable_partitionwise_join enable_partitionwise_aggregate	off

By forbidding or allowing individual operations, we force the scheduler to select other plans that we can see with the same EXPLAIN command. In fact, the “prohibition” of operations does not prohibit their use, but simply greatly increases their costs. In PostgreSQL, each “forbidden” operation automatically “costs” a cost of 10 billion conventional units. At the same time, in EXPLAIN, the total weights of the plan may turn out to be prohibitively high, but against the background of these tens of billions, the weight of other operations is clearly visible, since it usually fits into smaller orders.

Of particular interest are two of the following operations:

Hash Join. Its complexity is O (N), but if there is an error with a forecast in the size of the result, you can not fit in memory and you will have to do a Merge Join, with an O (N * log (N)) cost.
Nested Loop. Its complexity is O (N ² ); therefore, the error in the size prediction quadratically affects the speed of such a connection.

For example, take some real numbers from queries that we were optimizing in our company.

Plan 1. With all permitted operations, the total cost of the most optimal plan was 274962.09 units.

Plan 2. With the “forbidden” nested loop, the cost increased to 40000534153.85. The 40 billion that makes up the bulk of the cost is 4 times the Nested Loop used, despite the ban. And the remaining 534153.85 - this is precisely the forecast of the cost of all other operations in the plan. He, as we see, is about 2 times higher than the cost of the optimal plan, that is, is quite close to it.

Plan 3. With the “banned” Hash Join, the cost was 383253.77. The plan was indeed drawn up without using the Hash Join operation, since we do not see any billions. Its cost, however, is 30% higher than that of the optimal one, which is also very close.

In reality, the query execution times were as follows:

Plan 1 (all operations are allowed) was completed in ~ 9 minutes.
Plan 2 (with the “forbidden” nested loop) was completed in 1.5 seconds.
Plan 3 (with the “forbidden” hash join) was completed in ~ 5 minutes.

The reason, as you can see, is the erroneous prediction of the cost of Nested Loop. And indeed, when comparing EXPLAIN with EXPLAIN ANALYZE, it reveals an error with the definition of the most unfortunate N in the intermediate operation. Instead of the predicted single line, the Nested Loop met with several thousand lines, due to which the query execution time increased by a couple of orders of magnitude.

Savings with the “forbidden” Hash Join are associated with replacing hashing with sorting and Merge Join, which worked in this case faster than Hash Join. Note that this plan 2 in reality is almost two times faster than the “optimal” plan 1. Although it was predicted that it would be slower.

In practice, if your query is suddenly (after a DB upgrade or just by itself) began to execute much longer than before, try to ban Hash Join or Nested Loop to begin with and see how this affects the speed of the query. In a successful case, you will succeed in at least banning a new non-optimal plan, and, returning to the previous one, a quick one.

In order to do this, you do not need to change the PostgreSQL configuration files with the restart of the database; it is enough just to change the value of the desired variable in any console for the duration of the open DB session. The remaining sessions will not be affected, the configuration will change only for your current session. For example:

 SET enable_hashjoin='on'; SET enable_nestloop='off'; SELECT … FROM … (    )

The second means to influence the choice of a plan is to change the actual weights of low-level operations. There is no universal recipe here, but, for example, if you have a database with a “heated” cache and the data is completely stored in memory, it is likely that the cost of sequential page loading does not differ from the cost of loading a random page. While in the default configuration, random is 4 times more expensive than consistent.

Or, another example, the notional cost of starting parallel processing is 1000 by default, while the cost of loading page 1.0. It makes sense to start with changing only one of the parameters at a time, in order to determine whether it influences the choice of the plan. The easiest way is to start with setting the parameter to 0 or to some high value (1 million).

However, it must be remembered that by improving performance in one query, you can degrade it in another. In general, there is a wide field for experiments. It is better to try to change them one by one, in turn.

Alternative treatment options

A story about the scheduler would be incomplete without mentioning at least two PostgreSQL extensions.

The first is SR_PLAN , to save the calculated plan and to force it to be used later. This helps to make the DB's behavior more predictable in terms of the choice of plan.

The second is Adaptive Query Optimizer , which implements feedback to the scheduler from the real-time query execution, that is, the scheduler measures the actual results of the executed query and adjusts its plans in the future to reflect this. The DB is thus “self-tuning” for specific data and queries.

What else does the database do when it “slows down”?

Now that we have more or less understood the scheduling of requests, let's see what else can be improved both in the database itself and in the applications that use it to get the maximum performance from it.

Suppose the query plan is already optimal. If we exclude the most obvious problems (low memory or slow disk / network), then there are still costs for calculating hashes. There are probably great opportunities for the future improvement of PostgreSQL (using GPU or even SSE2 / SSE3 / AVX CPU instructions), but so far there is no such thing and the calculation of hashes almost does not use hardware features of hardware. This database can be a little help.

If you noticed, by default, indexes in PostgreSQL are created as b-tree. Their usefulness is that they are quite versatile. Such an index can be used with the conditions of equality, and with the conditions of comparison (more or less). Finding an item in such an index is logarithmic. But if your query contains only the condition of equality, indexes can also be created as a hash index, the search cost in which is a constant.

Further, you can still try to modify the query so as to use its parallel execution. To understand exactly how to rewrite it, it is best to familiarize yourself with the list of cases where concurrency is automatically prohibited by the scheduler and to avoid such situations. The manual on this topic briefly describes all situations, so there is no point in repeating them here.

What to do if the request is still poorly possible to make parallel? It is very sad to see how in your powerful multi-core database, where you are the only client, one core is 100% occupied, and all other cores just look at it. In this case, it is necessary to help the base of the application. Since each session is assigned its own kernel, you can open several of them and divide the general request into parts, making shorter and faster samples, combining them into a common result already in the application. This will allow to occupy the maximum available CPU resources in the PostgreSQL database.

In conclusion, I would like to note that the listed diagnostic and optimization capabilities are just the tip of the iceberg, but they are quite simple to use and can help to quickly identify the problem directly on the working data without the risk of spoiling the config or disrupting other applications.

Successful to you inquiries, with exact and short plans.

Source: https://habr.com/ru/post/443276/

All Articles