Hint glider in PostgreSQL

It is known that SQL is a declarative language that indicates “what” we want to choose from the base, and “how” to do this - the DBMS itself decides. The task of selecting a specific query for the SQL query (plan) is solved by the query planner, which is available in almost any DBMS. But sometimes he chooses not the best plan. Many commercial DBMSs provide “hints” for this case, which allow you to manually tell the database how best to execute the query. There was no such mechanism in the Open Source DBMS PostgreSQL.

And finally, something happened that many dreamed of and were already tired of waiting, while others were afraid. Japanese NTT developers implemented PostgreSQL hints. Moreover, they managed to do this without changing the kernel, as a separate pg_hint_plan module that supports PostgreSQL versions 9.1 and 9.2. The module implements hints, allowing you to set methods for scanning and joining tables, setting GUC values. For details of the installation and use, welcome under cat.

From the site you can download the source archives separately for versions 9.1 and 9.2, which, however, do not differ absolutely nothing and are equally collected under both versions. Anyway. Assembling and installing the module does not cause problems: make && make install. The build will require a PostgreSQL dev package from your favorite distribution. In order for PostgreSQL to pick up the module, no SQL is needed, just add pg_hint_plan to the shared_preload_libraries variable in the postgresql.conf file (instead, you can load the module into each session, where necessary, with the LOAD command). After restarting the server, three new GUC variables will be available: pg_hint_plan.enable_hint, pg_hint_plan.debug_print, pg_hint_plan.parse_messages. The first of them is responsible for the availability of hints (enabled by default), the remaining two are for logging.
')
Hints are specified in the comments to the query, decorated with / * and * /. In order for a comment to be interpreted as a hint, it must have a + sign at the beginning, for example, / * + SeqScan (t1) * /. Hints are of the following types.

Hints that are responsible for the table scan method

SeqScan (table name)
TidScan (table name)
IndexScan (table name [index name])
IndexOnlyScan (table name [index name])
BitmapScan (table name [index name])
NoSeqScan (table name)
NoTidScan (table name)
NoIndexScan (table name)
NoIndexOnlyScan (table name)
NoBitmapScan (table name)

As the name of the table, you can specify both the name of the table itself and its alias in the query.

Hints that are responsible for the method of joining tables

NestLoop (list of table names)
HashJoin (list of table names)
MergeJoin (list of table names)
NoNestLoop (list of table names)
NoHashJoin (list of table names)
NoMergeJoin (list of table names)

The list of table names is separated by a space. It is sensitive to order, that is, the connection is made exactly in the order in which the tables are listed.

Also, two hints are highlighted:

Leading (list of table names) - establishes the order of joining tables without specifying a specific joining method
Set (GUC value) - sets the GUC value of the variable for the duration of the query. In general, it does not carry any new functionality, GUC and so it could be installed, just hint allows you to do it more succinctly (and quickly?).

It is time to try all this in action. Let's create test tables, indexes, collect statistics.

CREATE TABLE test1 AS (SELECT id, (random()*1000)::int AS id_2, random() AS value1, random() AS value2 FROM generate_series(1,1000000) id); CREATE TABLE test2 AS (SELECT id, random() AS value FROM generate_series(1,1000) id); CREATE INDEX test1_id_idx ON test1 (id); CREATE INDEX test1_id_2_idx ON test1 (id_2); CREATE INDEX test1_value1_idx ON test1 (value1); CREATE INDEX test1_value2_idx ON test1 (value2); CREATE INDEX test2_id_idx ON test2 (id); CREATE INDEX test2_value_idx ON test2 (value); VACUUM ANALYZE;

Suppose we have a query that filters data by the values of two fields.

 SELECT * FROM test1 WHERE value1 BETWEEN 0.5 and 0.505 AND value2 BETWEEN 0.6 and 0.61;

The glider decides to combine the index scan results for each of the fields using Bitmap Scan.

  QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on test1 (cost=319.82..514.76 rows=52 width=24) (actual time=9.575..9.736 rows=59 loops=1) Recheck Cond: ((value1 >= 0.5::double precision) AND (value1 <= 0.505::double precision) AND (value2 >= 0.6::double precision) AND (value2 <= 0.61::double precision)) -> BitmapAnd (cost=319.82..319.82 rows=52 width=0) (actual time=9.529..9.529 rows=0 loops=1) -> Bitmap Index Scan on test1_value1_idx (cost=0.00..113.54 rows=5318 width=0) (actual time=2.839..2.839 rows=5072 loops=1) Index Cond: ((value1 >= 0.5::double precision) AND (value1 <= 0.505::double precision)) -> Bitmap Index Scan on test1_value2_idx (cost=0.00..206.00 rows=9764 width=0) (actual time=5.385..5.385 rows=10070 loops=1) Index Cond: ((value2 >= 0.6::double precision) AND (value2 <= 0.61::double precision)) Total runtime: 9.805 ms

However, we can force it to use the usual Index Scan.

 /*+ IndexScan(test1) */ SELECT * FROM test1 WHERE value1 BETWEEN 0.5 and 0.505 AND value2 BETWEEN 0.6 and 0.61;

  QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Index Scan using test1_value1_idx on test1 (cost=0.00..15198.71 rows=52 width=24) (actual time=0.124..10.704 rows=59 loops=1) Index Cond: ((value1 >= 0.5::double precision) AND (value1 <= 0.505::double precision)) Filter: ((value2 >= 0.6::double precision) AND (value2 <= 0.61::double precision)) Total runtime: 10.776 ms

And even force it to use another index.

 /*+ IndexScan(test1 test1_value2_idx) */ SELECT * FROM test1 WHERE value1 BETWEEN 0.5 and 0.505 AND value2 BETWEEN 0.6 and 0.61;

  QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Index Scan using test1_value2_idx on test1 (cost=0.00..22463.60 rows=52 width=24) (actual time=0.787..15.757 rows=59 loops=1) Index Cond: ((value2 >= 0.6::double precision) AND (value2 <= 0.61::double precision)) Filter: ((value1 >= 0.5::double precision) AND (value1 <= 0.505::double precision)) Total runtime: 15.816 ms (4 rows)

The example is more complicated. Connection of two tables with filtering by field of one table, sorting by field of another and LIMIT.

 SELECT * FROM test1 t1 JOIN test2 t2 ON t1.id_2 = t2.id WHERE t2.value BETWEEN 0.5 AND 0.51 ORDER BY t1.value1 LIMIT 100;

The glider selects a plan with an Index Scan for test1_value1_idx and Nested Loop.

  QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=4.33..2149.77 rows=100 width=36) (actual time=0.274..34.784 rows=100 loops=1) -> Nested Loop (cost=4.33..171467.82 rows=7992 width=36) (actual time=0.271..34.753 rows=100 loops=1) Join Filter: (t1.id_2 = t2.id) -> Index Scan using test1_value1_idx on test1 t1 (cost=0.00..51457.05 rows=1000000 width=24) (actual time=0.022..10.338 rows=11873 loops=1) -> Materialize (cost=4.33..10.80 rows=8 width=12) (actual time=0.000..0.001 rows=8 loops=11873) -> Bitmap Heap Scan on test2 t2 (cost=4.33..10.76 rows=8 width=12) (actual time=0.035..0.046 rows=8 loops=1) Recheck Cond: ((value >= 0.5::double precision) AND (value <= 0.51::double precision)) -> Bitmap Index Scan on test2_value_idx (cost=0.00..4.33 rows=8 width=0) (actual time=0.026..0.026 rows=8 loops=1) Index Cond: ((value >= 0.5::double precision) AND (value <= 0.51::double precision)) Total runtime: 34.870 ms

Suppose we want to use a different type of table join: HashJoin.

 /*+ HashJoin(t1 t2) */ EXPLAIN ANALYZE SELECT * FROM test1 t1 JOIN test2 t2 ON t1.id_2 = t2.id WHERE t2.value BETWEEN 0.5 AND 0.51 ORDER BY t1.value1 LIMIT 100;

Glider to submit, adding inside Bitmap Index Scan on test2, and outside - sorting with Limit.

  QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=20516.23..20516.48 rows=100 width=36) (actual time=156.219..156.230 rows=100 loops=1) -> Sort (cost=20516.23..20536.21 rows=7992 width=36) (actual time=156.217..156.225 rows=100 loops=1) Sort Key: t1.value1 Sort Method: top-N heapsort Memory: 32kB -> Hash Join (cost=10.86..20210.78 rows=7992 width=36) (actual time=0.248..154.286 rows=7889 loops=1) Hash Cond: (t1.id_2 = t2.id) -> Seq Scan on test1 t1 (cost=0.00..16370.00 rows=1000000 width=24) (actual time=0.013..63.210 rows=1000000 loops=1) -> Hash (cost=10.76..10.76 rows=8 width=12) (actual time=0.066..0.066 rows=8 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> Bitmap Heap Scan on test2 t2 (cost=4.33..10.76 rows=8 width=12) (actual time=0.044..0.057 rows=8 loops=1) Recheck Cond: ((value >= 0.5::double precision) AND (value <= 0.51::double precision)) -> Bitmap Index Scan on test2_value_idx (cost=0.00..4.33 rows=8 width=0) (actual time=0.034..0.034 rows=8 loops=1) Index Cond: ((value >= 0.5::double precision) AND (value <= 0.51::double precision)) Total runtime: 156.335 ms

If, for example, you specify the MergeJoin and IndexScan connection type by the test2_value_idx index, the glider, again, adds the necessary sorts and Limit.

 /*+ MergeJoin(t1 t2) IndexScan (t2 test2_value_idx) */ EXPLAIN ANALYZE SELECT * FROM test1 t1 JOIN test2 t2 ON t1.id_2 = t2.id WHERE t2.value BETWEEN 0.5 AND 0.51 ORDER BY t1.value1 LIMIT 100;

  QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=54410.09..54410.34 rows=100 width=36) (actual time=446.031..446.041 rows=100 loops=1) -> Sort (cost=54410.09..54430.07 rows=7992 width=36) (actual time=446.029..446.032 rows=100 loops=1) Sort Key: t1.value1 Sort Method: top-N heapsort Memory: 32kB -> Merge Join (cost=71.79..54104.65 rows=7992 width=36) (actual time=12.501..444.501 rows=7889 loops=1) Merge Cond: (t1.id_2 = t2.id) -> Index Scan using test1_id_2_idx on test1 t1 (cost=0.00..51460.24 rows=1000000 width=24) (actual time=0.033..377.392 rows=900401 loops=1) -> Sort (cost=24.52..24.54 rows=8 width=12) (actual time=0.074..0.545 rows=6927 loops=1) Sort Key: t2.id Sort Method: quicksort Memory: 25kB -> Index Scan using test2_value_idx on test2 t2 (cost=0.00..24.40 rows=8 width=12) (actual time=0.026..0.047 rows=8 loops=1) Index Cond: ((value >= 0.5::double precision) AND (value <= 0.51::double precision)) Total runtime: 446.182 ms

It can be noted that in all the examples cited, the situation only worsened from the use of hints. With this I wanted to hint at what is worth thinking twice before using hints in real projects. Even if you have a plan that is faster in this particular case, ask yourself the following questions:

Did you configure the airframe parameters * _cost, effective_cache_size, geqo *, etc. in accordance with the available server resources?
What data do you get a plan that runs faster? Is there a production distribution like that? Are you ready to rewrite hints when data distribution changes?
The plan was executed faster when all that was needed was cached? And at the time of the execution of this request on the production, everything will also be in the cache?

Still, hints are very useful in at least two situations:

I would like to understand more deeply the work of the airframe / executer, to get answers to the questions “What would happen if?”.
Sometimes the glider is still very wrong. For example, when there is a strong correlation between the fields of the table, which he does not know how to take into account. Because of this, an incorrect assessment of the selectivity of the condition is obtained, and there may be a bad plan.

PS Thanks to Oleg Bartunov (aka zen ) for the tip-off on the module!

Source: https://habr.com/ru/post/169751/

All Articles

Hint glider in PostgreSQL

More articles: