Explaining the inexplicable. Part 5

We continue to prepare for PG Day'16 and introduce you to the interesting features of PostgreSQL.

In previous posts in this series, I talked about how to read EXPLAIN output and what each line (operation / node) means.

In the final post I will try to explain why Postgres chooses “Operation X” rather than “Operation Y”.
')

You may have heard that the PostgreSQL scheduler selects operations based on statistics. What statistics?

Let's imagine the simplest scenario possible:

SELECT * FROM table WHERE column = some_value;

If all the rows in the tables have the same value, some_value, then applying to the column an index (potentially existing) makes no sense.

On the other hand, if the values in the column are unique (or nearly unique), using an index is a great idea.

Let's see what happens:

 create table test ( all_the_same int4, almost_unique int4 ); CREATE TABLE insert into test ( all_the_same, almost_unique ) select 123, random() * 1000000 from generate_series(1,100000); INSERT 0 100000

So, I have a table with 100,000 rows, in which the column “all_the_same” always has the same values (123), and the column almost_unique, as the name implies, is almost unique:

 select count(*), count(distinct almost_unique) from test; count | count --------+------- 100000 | 95142 (1 row)

Now, to make them equal, I'll create two simple indexes:

 create index i1 on test (all_the_same); CREATE INDEX create index i2 on test (almost_unique); CREATE INDEX

Ok, test configuration is ready. What about plans?

 explain select * from test where all_the_same = 123; QUERY PLAN ------------------------------------------------------------ Seq Scan on test (cost=0.00..1693.00 rows=100000 width=8) Filter: (all_the_same = 123) (2 rows) explain select * from test where almost_unique = 123; QUERY PLAN --------------------------------------------------------------- Index Scan using i2 on test (cost=0.29..8.31 rows=1 width=8) Index Cond: (almost_unique = 123) (2 rows)

As you can see, Postgres made a wise choice. But here the estimated value of "rows =" is of interest. How does he know how many rows the query can return?
The answer lies in the ANALYZE or VACUUM ANALYZE team.

When you apply an ANALYZE to the table, Postgres takes a kind of “random sample” (I’ll tell you more about this in a second) and get some statistics. What is this statistic, where is it, and can we see it? Of course we can:

 select * from pg_statistic where starelid = 'test'::regclass; -[ RECORD 1 ]----------------------------------------------------------------------------- starelid | 16882 staattnum | 1 stainherit | f stanullfrac | 0 stawidth | 4 stadistinct | 1 stakind1 | 1 stakind2 | 3 stakind3 | 0 stakind4 | 0 stakind5 | 0 staop1 | 96 staop2 | 97 staop3 | 0 staop4 | 0 staop5 | 0 stanumbers1 | {1} stanumbers2 | {1} stanumbers3 | [null] stanumbers4 | [null] stanumbers5 | [null] stavalues1 | {123} stavalues2 | [null] stavalues3 | [null] stavalues4 | [null] stavalues5 | [null] -[ RECORD 2 ]----------------------------------------------------------------------------- starelid | 16882 staattnum | 2 stainherit | f stanullfrac | 0 stawidth | 4 stadistinct | -0.92146 stakind1 | 1 stakind2 | 2 stakind3 | 3 stakind4 | 0 stakind5 | 0 staop1 | 96 staop2 | 97 staop3 | 97 staop4 | 0 staop5 | 0 stanumbers1 | {0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05} stanumbers2 | [null] stanumbers3 | {-0.000468686} stanumbers4 | [null] stanumbers5 | [null] stavalues1 | {21606,27889,120502,289914,417495,951355,283,1812,3774,6028,6229,10372,12234,13291,18309,18443,21758,22565,26634,28392,28413,31208,32890,36563,39277,40574,44527,49954,53344,53863,56492,56715,60856,62993,64294,65275,65355,68353,71194,74718,77205,82096,82783,84764,85301,87498,90990,94043,97304,98779,101181,103700,103889,106288,108562,110796,113154,117850,121578,122643,123874,126299,129236,129332,129512,134430,134980,136987,137368,138175,139001,141519,142934,143432,143707,144501,148633,152481,154327,157067,157799,162437,164072,164337,165942,167611,170319,171047,177383,184134,188702,189005,191786,192718,196330,197851,199457,202652,202689,205983} stavalues2 | {2,10560,20266,31061,40804,50080,59234,69240,79094,89371,99470,109557,119578,130454,140809,152052,162656,173855,183914,194263,204593,214876,224596,233758,243246,253552,264145,273855,283780,294475,303972,314544,324929,335008,346169,356505,367395,376639,387302,397004,407093,416615,426646,436146,445701,455588,466463,475910,485228,495434,505425,515853,525374,534824,545387,554794,563591,573721,584021,593368,602935,613238,623317,633947,643431,653397,664177,673976,684042,694791,703922,714113,724602,735848,745596,754477,764171,772535,781924,791652,801703,812487,822196,831618,841665,850722,861532,872067,881570,891654,901595,910975,921698,931785,940716,950623,960551,970261,979855,989540,999993} stavalues3 | [null] stavalues4 | [null] stavalues5 | [null]

This table (pg_statistic) is, of course, described in the documentation , but still quite mysterious. Of course, you can find a very accurate explanation in the source code , but this is (usually) not the best solution.

Fortunately, there is a view for this table that contains the same data in a more “readable” view:

 select * from pg_stats where tablename = 'test'; -[ RECORD 1 ]----------+------------------------------------------------------------------ schemaname | public tablename | test attname | all_the_same inherited | f null_frac | 0 avg_width | 4 n_distinct | 1 most_common_vals | {123} most_common_freqs | {1} histogram_bounds | [null] correlation | 1 most_common_elems | [null] most_common_elem_freqs | [null] elem_count_histogram | [null] -[ RECORD 2 ]----------+------------------------------------------------------------------ schemaname | public tablename | test attname | almost_unique inherited | f null_frac | 0 avg_width | 4 n_distinct | -0.92146 most_common_vals | {21606,27889,120502,289914,417495,951355,283,1812,3774,6028,6229,10372,12234,13291,18309,18443,21758,22565,26634,28392,28413,31208,32890,36563,39277,40574,44527,49954,53344,53863,56492,56715,60856,62993,64294,65275,65355,68353,71194,74718,77205,82096,82783,84764,85301,87498,90990,94043,97304,98779,101181,103700,103889,106288,108562,110796,113154,117850,121578,122643,123874,126299,129236,129332,129512,134430,134980,136987,137368,138175,139001,141519,142934,143432,143707,144501,148633,152481,154327,157067,157799,162437,164072,164337,165942,167611,170319,171047,177383,184134,188702,189005,191786,192718,196330,197851,199457,202652,202689,205983} most_common_freqs | {0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05} histogram_bounds | {2,10560,20266,31061,40804,50080,59234,69240,79094,89371,99470,109557,119578,130454,140809,152052,162656,173855,183914,194263,204593,214876,224596,233758,243246,253552,264145,273855,283780,294475,303972,314544,324929,335008,346169,356505,367395,376639,387302,397004,407093,416615,426646,436146,445701,455588,466463,475910,485228,495434,505425,515853,525374,534824,545387,554794,563591,573721,584021,593368,602935,613238,623317,633947,643431,653397,664177,673976,684042,694791,703922,714113,724602,735848,745596,754477,764171,772535,781924,791652,801703,812487,822196,831618,841665,850722,861532,872067,881570,891654,901595,910975,921698,931785,940716,950623,960551,970261,979855,989540,999993} correlation | -0.000468686 most_common_elems | [null] most_common_elem_freqs | [null] elem_count_histogram | [null]

Fine. So what knowledge can we learn from here?

The columns schemaname, tablename and attname seem obvious. Inherited simply reports whether the values in this table contain values from any tables that inherit this column.

So, if I created a table:

 create table z () inherits (test);

And then I added some data to this table z, then the statistics of the test table would show “inherited = true”.

The remaining columns mean the following:

null_frac - how many rows are null in this column. This is a fraction, so the value will be from 0 to 1.
avg_width - the average width (approx. lane: size) of the data in this column. This is not very interesting if the width is constant (as in int4 in this example), but in the case of any data types with variable width (as in text / varchar / numeric) this may be useful.
n_distinct is a very interesting value. If it is positive (1+), then it will be just an indicative number (not a fraction!) Of different values, as we see in the case of the column all_the_same, where n_distinct is just 1. And if it is negative, then the meaning changes: n_distinct shows which the proportion of lines is unique. Therefore, in the case of almost_unique, statistics assume that 92.146% of the lines have a unique value (which is slightly less than 95.142%, which I showed earlier). The values may be incorrect due to the “random sample” piece that I mentioned and will explain in detail later.
most_common_vals - an array of the most common values in this table.
most_common_freqs - how often the values from the most_common_vals occur - this is also a fraction, so the maximum value is 1 (but this will mean that we have only one value in the most_common_vals). Here, in almost_unique, we see that Postgres "thinks" that the values 21606, 27889, 120502, 289914, 417495, 951355 occur most often, but this is not the case. Again, the “random pattern” effect is to blame.
histogram_bounds is an array of values that divides (or should divide - again everything rests on a “random sample”) the entire data set into groups with the same number of rows. That is, the number of lines almost_unique between 2 and 10560 is the same (more or less) as the number of lines almost_unique between 931785 and 940716.
correlation is a very interesting statistic, it shows whether there is a correlation between the physical sorting of the strings on the disk and the values. This value can vary from -1 to 1, and the closer it is to -1/1, the greater the correlation. For example, after running “CLUSTER test using i2”, which re-sorts the table in almost_unique order, I got a correlation of 0.919358 - much better than the previous value of -0.000468686.
most_common_elems, most_common_elem_freqs and elem_count_histogram are the same as most_common_vals, most_common_freqs and histogram_bounds, but for non-scalar data types (that is, arrays, tsvectors and alike).

Based on this data, PostgreSQL can estimate how many rows will be returned by any selected part of the query, and based on this information, decide which is better to use: seq scan, index scan or bitmap index scan. And when combined, which operation should be faster: Hash Join, Merge Join or, perhaps, Nested Loop.

If you have carefully studied the data presented above, you might be wondering: this is a fairly extensive set of output data, the arrays of most_common_vals / most_common_freqs / histogram_bounds contain many values. Why are there so many of them?

The reason is simple - it's all about the settings. In postgresql.conf you can find the variable default_statistics_target. This variable tells Postgres how many values to store in these arrays. In my case (the default), this number is 100. But you can easily change it. Make a change to postgresql.conf, or even for each individual column like this:

 alter table test alter column almost_unique set statistics 5;

After applying ALTER (and ANALYZE), the data in pg_stats is significantly shortened:

 select * from pg_stats where tablename = 'test' and not inherited and attname = 'almost_unique'; -[ RECORD 1 ]----------+--------------------------------------------------------- schemaname | public tablename | test attname | almost_unique inherited | f null_frac | 0 avg_width | 4 n_distinct | -0.92112 most_common_vals | {114832,3185,3774,6642,11984} most_common_freqs | {0.0001,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05} histogram_bounds | {2,199470,401018,596414,798994,999964} correlation | 1 most_common_elems | [null] most_common_elem_freqs | [null] elem_count_histogram | [null]

Changing the statistic target also has another effect.

Let me show you. First, I’ll roll back the changes in statistics that I made with ALTER TABLE:

 alter table test alter column almost_unique set statistics -1;

Now do the following:

 $ analyze verbose test; INFO: analyzing "public.test" INFO: "test": scanned 443 of 443 pages, containing 100000 live rows and 0 dead rows; 30000 rows in sample, 100000 estimated total rows ANALYZE $ alter table test alter column almost_unique set statistics 10; ALTER TABLE $ alter table test alter column all_the_same set statistics 10; ALTER TABLE $ analyze verbose test; INFO: analyzing "public.test" INFO: "test": scanned 443 of 443 pages, containing 100000 live rows and 0 dead rows; 3000 rows in sample, 100000 estimated total rows ANALYZE

Notice that the second analyze tested only 3000 lines, and not 30,000 as the first.

This is the "random pattern."

Analysis of all rows will be prohibitively expensive for any medium or large table.

Therefore Postgres is smarter.

First of all, it reads a random part of the pages in the table (I remind you: each page is 8kB of data). How much? 300 * statistics_target.

This means that in my case with default_statistics_target = 100 it will read 30,000 pages (in my table there is not so much, so Postgres will read them all).

From these pages ANALYZE takes only information about living and dead lines. It then receives data on a random sample of rows — again 300 * statistics target — and counts the column statistics based on that data.

In my case, the table had 100,000 rows, but with default_statistics_target = 100, only a third was analyzed. And, taking into account the value of statistics target, the number of lines analyzed is even smaller - only 3000.

You might say: OK, but in this case, the statistics are inaccurate. It may happen that some super-widespread meaning is not found in any of the scanned lines. Of course you're right. This is possible. Although not very likely. You take a random piece of data. The chances that you will get x% of a table that does not have a single row with some value that is present in all the other rows are negligible.

It also means that in some cases, the start of the analysis will “break” your requests. For example, you will receive statistics on other pages, and it will turn out that some values will be skipped (or vice versa - you will get not so common values in most_common_vals, it just so happened that Postgres selected the appropriate pages / lines to see them). And based on such statistics, Pg will generate non-optimal plans.

If you encounter such a situation, simply solve it - increase the statistics target. This will make the analyze work harder and scan more lines, so the chances that this will happen again will become even smaller.

But there are some drawbacks to setting large statistics target values. Firstly, ANALYZE has to work more, but it is a question of exploitation, so it doesn’t worry us too much (usually). The main problem is that the more data in pg_statistic, the more data must be taken into account by the Pg scheduler. Therefore, no matter how tempting it is to set default_statistics_target to a maximum of 10,000, in reality I have not met databases in which this value would be so high.

The current 100 are installed by default, starting with version 8.4. In previous versions, the default value was 10, and on irc there were often tips to increase it. Now with a value of 100, everything is more or less configured.

The last thing I have to tell you about, although I don’t really want to, is the settings that make the Postgres scheduler use different operations.

First, I will explain why I do not want to talk about it: I know for sure that this can be easily abused. So remember: these settings are needed to search for problems, not to solve them. An application that will use them during operation can be suspected at least that it is broken. And yes, I know that sometimes you have to do that. But this “sometimes” happens extremely rarely.

Now that I have warned you, let's see what can be done.

In postgresql.conf you have several settings:

 enable_bitmapscan = on enable_hashagg = on enable_hashjoin = on enable_indexscan = on enable_indexonlyscan = on enable_material = on enable_mergejoin = on enable_nestloop = on enable_seqscan = on enable_sort = on enable_tidscan = on

These settings are needed to disable selected operations.

For example, switching enable_seqscan to false (this can be done using the SET command in an SQL session, you do not need to change postgresql.conf) will cause the scheduler to use everything that is possible to avoid sequential scanning.

And, since sometimes it is impossible to avoid sequential scanning (for example, if there are no indexes in the table), these settings do not actually turn off operations, but simply attach huge costs to their use.

Let's give an example. Regarding our test pattern, we know that searching with “all_the_same = 123” will use sequential scanning, because it doesn’t cost much:

 explain select * from test where all_the_same = 123; QUERY PLAN ------------------------------------------------------------ Seq Scan on test (cost=0.00..1693.00 rows=100000 width=8) Filter: (all_the_same = 123) (2 rows)

But if we disable seq scan:

 set enable_seqscan = false; SET explain select * from test where all_the_same = 123; QUERY PLAN ----------------------------------------------------------------------- Index Scan using i1 on test (cost=0.29..3300.29 rows=100000 width=8) Index Cond: (all_the_same = 123) (2 rows)

We see that the estimated cost of obtaining the same data using the index scan ~ is twice as high (3300.29 versus 1693).

If I remove the i1 index:

 drop index i1; DROP INDEX set enable_seqscan = false; SET explain select * from test where all_the_same = 123; QUERY PLAN ----------------------------------------------------------------------------- Seq Scan on test (cost=10000000000.00..10000001693.00 rows=100000 width=8) Filter: (all_the_same = 123) (2 rows)

And we see that when there are no other possibilities besides sequential scans (interestingly, Postgres did not choose to conduct an i2 scan scan, although this index has pointers to all the rows in the table), the costs soared to 10,000,000,000 — that is enable_ * = false and does.

I think that's all. If you have read the entire series, now you should have enough knowledge to understand what is happening and, more important, why.

Source: https://habr.com/ru/post/282011/

All Articles

Explaining the inexplicable. Part 5

More articles: