Explaining the inexplicable. Part 3

In preparation for the PG Day'16 conference, we continue to acquaint you with interesting aspects of PostgreSQL. And today we offer you a translation of the third article in the explain series

In previous posts in this series, I wrote about how to interpret a single line in the explain analysis output, its structure, and also described the basic data retrieval operations (nodes of the explain tree).

Today we will move to more complex operations.
')

Function scan

Example:

$ explain analyze select * from generate_Series(1,10) i; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Function Scan on generate_series i (cost=0.00..10.00 rows=1000 width=4) (actual time=0.012..0.013 rows=10 loops=1) Total runtime: 0.034 ms (2 rows)

By and large, it is so simple that there is no special need to explain something. But since this operation will be used in the following examples, I will still write a little about it.

Function Scan is a very simple node. It runs a function that returns a recordset (recordset), that's all. It will not run functions like “lower ()", but only those that potentially return multiple rows or columns. When the function returns, the rows will be transferred to the node that is at a level above the Function Scan in the plan tree, or to the client if Function Scan is the root node.

The only additional logic that may arise here is the ability to filter the resulting lines, like this:

 $ explain analyze select * from generate_Series(1,10) i where i < 3; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Function Scan on generate_series i (cost=0.00..12.50 rows=333 width=4) (actual time=0.012..0.014 rows=2 loops=1) Filter: (i < 3) Rows Removed by Filter: 8 Total runtime: 0.030 ms (4 rows)

Sort

I think it's pretty simple to understand - sort takes the selected records and returns them sorted in a certain way.

Example:

 $ explain analyze select * from pg_class order by relname; QUERY PLAN --------------------------------------------------------------------------------------------------------------- Sort (cost=22.88..23.61 rows=292 width=203) (actual time=0.230..0.253 rows=295 loops=1) Sort Key: relname Sort Method: quicksort Memory: 103kB -> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=203) (actual time=0.007..0.048 rows=295 loops=1) Total runtime: 0.326 ms (5 rows)

Although it is simple, there is an interesting logic inside. To begin with, if the memory required for sorting is greater than the value of work_mem , then the switch to disk sorting will occur:

 $ explain analyze select random() as x from generate_series(1,14000) i order by x; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Sort (cost=62.33..64.83 rows=1000 width=0) (actual time=16.713..18.090 rows=14000 loops=1) Sort Key: (random()) Sort Method: quicksort Memory: 998kB -> Function Scan on generate_series i (cost=0.00..12.50 rows=1000 width=0) (actual time=2.036..4.533 rows=14000 loops=1) Total runtime: 18.942 ms (5 rows) $ explain analyze select random() as x from generate_series(1,15000) i order by x; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Sort (cost=62.33..64.83 rows=1000 width=0) (actual time=27.052..28.780 rows=15000 loops=1) Sort Key: (random()) Sort Method: external merge Disk: 264kB -> Function Scan on generate_series i (cost=0.00..12.50 rows=1000 width=0) (actual time=2.171..4.894 rows=15000 loops=1) Total runtime: 29.767 ms (5 rows)

Notice the Sort Method change in the example above.

In such cases, Postgres uses temporary files that are stored in the $ PGDATA / base / pgsql_tmp / directory. Of course, they will be removed as soon as they are no longer necessary.

Another additional property is that Sort can change its method of operation if it is called by the Limit operation, like this:

 $ explain analyze select * from pg_class order by relfilenode limit 5; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Limit (cost=15.77..15.78 rows=5 width=203) (actual time=0.119..0.120 rows=5 loops=1) -> Sort (cost=15.77..16.50 rows=292 width=203) (actual time=0.118..0.118 rows=5 loops=1) Sort Key: relfilenode Sort Method: top-N heapsort Memory: 26kB -> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=203) (actual time=0.005..0.047 rows=295 loops=1) Total runtime: 0.161 ms (6 rows)

Usually, to sort the selected data set, you need to process it entirely. But Postgres knows that if you need only a small number of rows, it does not need to sort the entire data set, it is enough to get only the first values.

In the Big O notation, the total sort has O (m * log (m)) complexity, but Top-N has O (m * log (n)) complexity, where m is the number of rows in the table, and n is the number of returned rows. It is important to know that this sorting method also uses much less memory (after all, it does not need to collect the entire data set from sorted lines, a couple of lines is enough), so it is less likely to use a slow disk for temporary files.

Limit

I have used limit repeatedly because it is very simple, but still let's discuss it in detail. The limit operation starts its sub-operation and returns only the first N lines from the one that the sub-operation returned. Usually after this, it stops the sub-operation, but in some cases (for example, a call to the pl / PgSQL function), the sub-operation has already completed its work by the time it returns the first line.

A simple example:

 $ explain analyze select * from pg_class; QUERY PLAN --------------------------------------------------------------------------------------------------------- Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=203) (actual time=0.008..0.047 rows=295 loops=1) Total runtime: 0.096 ms (2 rows) $ explain analyze select * from pg_class limit 2; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..0.07 rows=2 width=203) (actual time=0.009..0.010 rows=2 loops=1) -> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=203) (actual time=0.008..0.009 rows=2 loops=1) Total runtime: 0.045 ms (3 rows)

As you can see, the use of the limit in the second case led to the fact that the nested Seq Scan operation completed its work immediately after finding two lines.

Hashagregate

This operation is mainly used in cases when you use GROUP BY and some aggregates, like sum (), avg (), min (), max () and others.

Example:

 $ explain analyze select relkind, count(*) from pg_Class group by relkind; QUERY PLAN ------------------------------------------------------------------------------------------------------------- HashAggregate (cost=12.38..12.42 rows=4 width=1) (actual time=0.223..0.224 rows=5 loops=1) -> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=1) (actual time=0.008..0.053 rows=295 loops=1) Total runtime: 0.273 ms (3 rows)

HashAggregate does the following: for each line it receives, it finds a GROUP BY “key” (in this case, relkind). Then in the hash (associative array, dictionary) puts the selected row in the basket indicated by the given key.

After all rows have been processed, it scans the hash and returns one row for each key value, making relevant calculations as necessary (sum, min, avg, and so on).

It is important to understand that HashAggregate must scan all rows before it can return at least one.

If you understand this, then you probably see a potential problem: what to do in a situation where you have millions of lines? The hash will be too large to fit in memory. And here we will use work_mem again . If the generated hash is too large, it will merge onto the disk (again in $ PGDATA / base / pgsql_tmp).

This means that if there are both HashAggregate and Sort in the plan, we can use up to 2 * work_mem. Such a plan is easy to obtain:

 $ explain analyze select relkind, count(*) from pg_Class group by relkind order by relkind; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Sort (cost=12.46..12.47 rows=4 width=1) (actual time=0.260..0.261 rows=5 loops=1) Sort Key: relkind Sort Method: quicksort Memory: 25kB -> HashAggregate (cost=12.38..12.42 rows=4 width=1) (actual time=0.221..0.222 rows=5 loops=1) -> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=1) (actual time=0.006..0.044 rows=295 loops=1) Total runtime: 0.312 ms (6 rows)

In reality, a single query can use work_mem many times, since work_mem is a restriction for an operation. Therefore, if the request uses 1000 HashAggregates and Sorts (and other operations that use work_mem), the total memory consumption can be very high.

Hash Join / Hash

Since we have just discussed HashAggregate, it will be logical to go to Hash Join.

This operation, unlike the previous one, has two suboperations. One of them is always “Hash", and the second is something else.

As the name implies, Hash Join is used to merge two sets of records. For example, like here:

 $ explain analyze select * from pg_class c join pg_namespace n on c.relnamespace = n.oid; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Hash Join (cost=1.14..16.07 rows=292 width=316) (actual time=0.036..0.343 rows=295 loops=1) Hash Cond: (c.relnamespace = n.oid) -> Seq Scan on pg_class c (cost=0.00..10.92 rows=292 width=203) (actual time=0.007..0.044 rows=295 loops=1) -> Hash (cost=1.06..1.06 rows=6 width=117) (actual time=0.012..0.012 rows=6 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> Seq Scan on pg_namespace n (cost=0.00..1.06 rows=6 width=117) (actual time=0.004..0.005 rows=6 loops=1) Total runtime: 0.462 ms (7 rows)

This works as follows: first, Hash Join calls “Hash", which in turn calls something else (in our case, Seq Scan by pg_namespace). Then Hash creates a hash / in memory (or on a disk, depending on the size) associative array / dictionary with strings from the source, hashed using what is used to merge the data (in our case, this is the OID column in pg_namespace).

Of course, you may have many rows for the join join (not in this case, since I join with the primary key, but in general it’s likely that you will have multiple rows for one hash key).

In Perl notation, the output of Hash will be something like this:

 { '123' => [ { data for row with OID = 123 }, ], '256' => [ { data for row with OID = 256 }, ], ... }

Then Hash Join starts the second sub-operation (Seq Scan by pg_class in our case) and, for each line from it, does the following:

Checks if the join key (pg_class.relnamespace in this case) is in the hash returned by the Hash operation.
If not, this string from the sub-operation is ignored (will not be returned).
If the key exists, Hash Join takes the lines from the hash and, based on this line, on the one hand, and all the lines of the hash, on the other hand, generates the output of the lines.

It is important to note that both sides of the join are executed only once (in our case, both are seq scan), but first, the one that was caused by the Hash operation must return all the rows that were stored in the hash, and the second is processed line by line, and some lines will be skipped if they do not exist in the first-party hash (I hope this sentence is understandable, despite the abundance of hashes).

Since both subscans can be any type of operation, they can be filters, index scans, or what you can imagine.

The last thing worth mentioning in connection with Hash Join / Hash is that Hash operation, just like Sort and HashAggregate, will use memory up to work_mem.

Hash Join / Hash

Since we are talking about associations, it is worth discussing the Nested Loop. Example:

 $ explain analyze select a.* from pg_class c join pg_attribute a on c.oid = a.attrelid where c.relname in ( 'pg_class', 'pg_namespace' ); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.28..52.32 rows=16 width=203) (actual time=0.057..0.134 rows=46 loops=1) -> Seq Scan on pg_class c (cost=0.00..11.65 rows=2 width=4) (actual time=0.043..0.080 rows=2 loops=1) Filter: (relname = ANY ('{pg_class,pg_namespace}'::name[])) Rows Removed by Filter: 291 -> Index Scan using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.28..20.25 rows=8 width=203) (actual time=0.007..0.015 rows=23 loops=2) Index Cond: (attrelid = c.oid) Total runtime: 0.182 ms

This is a very interesting plan, because it can perform selected operations repeatedly.

Like Hash Join, Nested Loop has two “descendants”. First, it starts “Seq Scan" (in our example, first it starts the first node), and then, for each returned row (only 2 lines in our example), it starts the second operation (Index Scan on pg_attribute in our case).

You may have noticed that Index Scan in actual meta-information is “loops = 2". This means that this operation was run twice, and other values (lines, time) are average values for all launches.

Let's look at the next plan from explain.depesz.com . Note that the actual execution time for all index scan operations for categories is from 0.002 to 0.003ms. But the total time spent on this node is 78.852ms, because this index scan was performed more than 26k times.

So the processing is as follows:

Nested Loop launches the first side of the merge once. Let's call it “A".
For each line from “A", the second operation is started (let's call it “B").
If “B" did not return any rows, the data from "A" is ignored.
If “B" returned rows, for each row returned, the Nested Loop returns a new row based on the current rows from A and B.

Merge join

Another method to merge data is called Merge Join. It is used if the datasets to be merged are sorted (or can be sorted at low cost) using the join key.

I do not have a ready-made visual example, so I will create it artificially with the help of subqueries that sort the data before merging:

 $ explain analyze select * from ( select oid, * from pg_class order by oid) as c join ( select * from pg_attribute a order by attrelid) as a on c.oid = a.attrelid; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge Join (cost=23.16..268.01 rows=2273 width=410) (actual time=0.658..3.779 rows=2274 loops=1) Merge Cond: (pg_class.oid = a.attrelid) -> Sort (cost=22.88..23.61 rows=292 width=207) (actual time=0.624..0.655 rows=293 loops=1) Sort Key: pg_class.oid Sort Method: quicksort Memory: 102kB -> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=207) (actual time=0.011..0.211 rows=293 loops=1) -> Materialize (cost=0.28..212.34 rows=2273 width=203) (actual time=0.028..1.264 rows=2274 loops=1) -> Index Scan using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.28..183.92 rows=2273 width=203) (actual time=0.015..0.752 rows=2274 loops=1) Total runtime: 4.009 ms (9 rows)

Merge Join, like other unions, runs two suboperations (Sort and Materialize in this case). Since both of them return the data in sorted order and the sorting order is the same as in the merge operation, Pg can scan both data sets returned by the suboperations and at the same time just check if the identifiers match.

The procedure is as follows:

if the right column to be merged is the same as the left to be merged column:
- return a new merged string based on the current lines to the right and left;
- take the next line to the right (or to the left, if there are no more lines to the right);
- go back to step 1;
if the column to be merged on the right is “less” than the column to be merged on the left:
- we take the next line on the right (if there are no more lines, we finish processing);
- go back to step 1;
if the mergeable column on the right is “larger” than the merging column on the left:
- we take the next line on the left (if there are no more lines, we finish processing);
- go back to step 1.

This is a very cool way to combine datasets, but it works only for sorted sources. Based on the current database explain.depesz.com , there is:

44,721 plans containing the operation “Nested Loop";
34,305 plans with “Hash Join";
just 8,889 plans using “Merge Join".

Hash Join / Nested Loop / Merge Join Modifiers

In all the examples above, I demonstrated that the Join operation returns a string only when it receives rows from both sides of the join.

But this is not always the case. We can have left, right, and full (LEFT / RIGHT / FULL OUTER JOIN) external unions, as well as so-called anti-joins.

In the case of left / right joins, the names of operations change to:

Hash Left Join,
Hash Right Join,
Merge Left Join,
Merge right join
Nested Loop Left Join.

Nested Loop Right Join does not exist, because Nested Loop always starts on the left and takes the left side as the basis for the loop. Therefore, a union using a RIGHT JOIN that will work with the Nested Loop is internally transformed into a LEFT JOIN so that the Nested Loop operation can work.

In all these cases, the logic is simple: we have two sides of the union - left and right. And when a side is mentioned in a join, it returns a new string, even if there are no matching rows on the other side .

This happens with queries like this:

 select * from a left join b on ...

(or right join).

All other information for Hash Join / Merge Join and Nested Loop is the same; there is only a small change in the logic of when line output is generated.

There is also a version called Full Join with the following operation names:

Hash Full Join,
Merge Full Join.

In this case, the union generates a new output line, regardless of whether there is no data on any of the parties (as long as there is data on at least one side). This happens in the case of:

 select * from a full join b ...

All processing is the same as in the previous examples.

In addition, there are so-called Anti Join'y. The names of their operations are as follows:

Hash Anti Join,
Merge anti join
Nested Loop Anti Join.

In these cases, the Join returns a string only if the right side does not find a single string. This is useful when you do something like “WHERE not exists ()" or “left join ... where right_table.column is null".

As in this example:

 $ explain analyze select * from pg_class c where not exists (select * from pg_attribute a where a.attrelid = c.oid and a.attnum = 10); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Anti Join (cost=62.27..78.66 rows=250 width=203) (actual time=0.145..0.448 rows=251 loops=1) Hash Cond: (c.oid = a.attrelid) -> Seq Scan on pg_class c (cost=0.00..10.92 rows=292 width=207) (actual time=0.009..0.195 rows=293 loops=1) -> Hash (cost=61.75..61.75 rows=42 width=4) (actual time=0.123..0.123 rows=42 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 2kB -> Index Only Scan using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.28..61.75 rows=42 width=4) (actual time=0.021..0.109 rows=42 loops=1) Index Cond: (attnum = 10) Heap Fetches: 0 Total runtime: 0.521 ms (9 rows)

Here Pg executed the right side (Index Scan by pg_attribute), hashed it, and then executed the left side (Seq Scan by pg_class), returning only those lines where there were no Hash entries for the given pg_class.oid.

Materialize

This operation has already been demonstrated in the example for Merge Join, but it can be useful in other cases.

Psql has a lot of internal commands. One of them is \ dTS, which lists all system data types. Internally, \ dTS runs this query:

 SELECT n.nspname as "Schema", pg_catalog.format_type(t.oid, NULL) AS "Name", pg_catalog.obj_description(t.oid, 'pg_type') as "Description" FROM pg_catalog.pg_type t LEFT JOIN pg_catalog.pg_namespace n ON n.oid = t.typnamespace WHERE (t.typrelid = 0 OR (SELECT c.relkind = 'c' FROM pg_catalog.pg_class c WHERE c.oid = t.typrelid)) AND NOT EXISTS(SELECT 1 FROM pg_catalog.pg_type el WHERE el.oid = t.typelem AND el.typarray = t.oid) AND pg_catalog.pg_type_is_visible(t.oid) ORDER BY 1, 2;

His plan is:

  QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=2783.00..2783.16 rows=65 width=68) (actual time=3.883..3.888 rows=87 loops=1) Sort Key: n.nspname, (format_type(t.oid, NULL::integer)) Sort Method: quicksort Memory: 39kB -> Nested Loop Left Join (cost=16.32..2781.04 rows=65 width=68) (actual time=0.601..3.657 rows=87 loops=1) Join Filter: (n.oid = t.typnamespace) Rows Removed by Join Filter: 435 -> Hash Anti Join (cost=16.32..2757.70 rows=65 width=8) (actual time=0.264..0.981 rows=87 loops=1) Hash Cond: ((t.typelem = el.oid) AND (t.oid = el.typarray)) -> Seq Scan on pg_type t (cost=0.00..2740.26 rows=81 width=12) (actual time=0.012..0.662 rows=157 loops=1) Filter: (pg_type_is_visible(oid) AND ((typrelid = 0::oid) OR (SubPlan 1))) Rows Removed by Filter: 185 SubPlan 1 -> Index Scan using pg_class_oid_index on pg_class c (cost=0.15..8.17 rows=1 width=1) (actual time=0.002..0.002 rows=1 loops=98) Index Cond: (oid = t.typrelid) -> Hash (cost=11.33..11.33 rows=333 width=8) (actual time=0.241..0.241 rows=342 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 14kB -> Seq Scan on pg_type el (cost=0.00..11.33 rows=333 width=8) (actual time=0.002..0.130 rows=342 loops=1) -> Materialize (cost=0.00..1.09 rows=6 width=68) (actual time=0.000..0.001 rows=6 loops=87) -> Seq Scan on pg_namespace n (cost=0.00..1.06 rows=6 width=68) (actual time=0.002..0.003 rows=6 loops=1) Total runtime: 3.959 ms

For viewing convenience, I also uploaded this plan to explain.depesz.com .

Note that operation # 9 is Materialize. Why?

Materialize is called by Nested Loop Left Join - operation # 2. We know that Nested Loop causes the selected operation to be executed multiple times, in this case 87 times.

The right side of the union is Seq Scan by pg_namespace. So, theoretically, Postgres should perform a sequential scan on pg_namespace 87 times. If we consider that a single sequential scan of this table takes 0.003ms, we can expect that the total time will be ~ 0.25ms.

But Postgres is smarter. He understands that it will be less expensive to scan the table once and build in memory an image of all its rows. Then next time you will not need to scan the table, check the visibility information, parse the data pages. He just takes the data from the memory.

Due to this, the total time for everything (one-time reading of the table, preparing the image of the data in memory and scanning this image 87 times) was 0.087ms.

You can say, “OK, but why did the merge join use materialize before, did he just perform one scan?” Let's remember the plan:

 $ explain analyze select * from ( select oid, * from pg_class order by oid) as c join ( select * from pg_attribute a order by attrelid) as a on c.oid = a.attrelid; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge Join (cost=23.16..268.01 rows=2273 width=410) (actual time=0.658..3.779 rows=2274 loops=1) Merge Cond: (pg_class.oid = a.attrelid) -> Sort (cost=22.88..23.61 rows=292 width=207) (actual time=0.624..0.655 rows=293 loops=1) Sort Key: pg_class.oid Sort Method: quicksort Memory: 102kB -> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=207) (actual time=0.011..0.211 rows=293 loops=1) -> Materialize (cost=0.28..212.34 rows=2273 width=203) (actual time=0.028..1.264 rows=2274 loops=1) -> Index Scan using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.28..183.92 rows=2273 width=203) (actual time=0.015..0.752 rows=2274 loops=1) Total runtime: 4.009 ms (9 rows)

Yes, it was launched only once. The problem is that the data source for Merge Join must meet several criteria. Some of them are obvious (data should be sorted), while others are less obvious because they are more technical (data should be viewed back and forth).

Because of these not too obvious criteria, Postgres sometimes has to apply Materialize to the data coming from the source (in our case from the Index Scan) so that he has all the necessary capabilities when it comes to using them.

In short, Materialize obtains data from the underlying operation and places it in memory (or partly in memory) so that it can be used more quickly, or adds additional properties to it that the previous operation does not provide.

That's all for today. I thought I would finish on this, but there are still many operations that should be described, so there will be at least two more posts in this series (the remaining operations and statistical information).

Source: https://habr.com/ru/post/279255/

All Articles

Explaining the inexplicable. Part 3

Function scan

Sort

Limit

Hashagregate

Hash Join / Hash

Hash Join / Hash

Merge join

Hash Join / Nested Loop / Merge Join Modifiers

Materialize

More articles: