Query optimization. Basics of EXPLAIN in PostgreSQL (Part 2)

I keep publishing the author's rework Understanding EXPLAIN from Guillaume Lelarge.
Once again I will note that some of the information is omitted for brevity, so I strongly recommend that you familiarize yourself with the original.

Previous parts:

Part 1

Cache

What happens at the physical level when we execute our query? We will understand. My server is up on Ubuntu 13.10. OS-level disk caches are used.
I stop PostgreSQL, forcibly fix the changes in the file system, clear the caches, start PostgreSQL:

> sudo service postgresql-9.3 stop > sudo sync > sudo su - # echo 3 > /proc/sys/vm/drop_caches # exit > sudo service postgresql-9.3 start

Now the caches are cleared, we try to execute the query with the option BUFFERS

 EXPLAIN (ANALYZE,BUFFERS) SELECT * FROM foo;

QUERY PLAN
- Seq Scan on foo (cost = 0.00..18334.10 rows = 1000010 width = 37) (actual time = 0.525..734.754 rows = 1000010 loops = 1)
Buffers: shared read = 8334
Total runtime: 1253.177 ms
(3 rows)

The table is read in parts - in blocks. The cache is empty. The table is completely read from the disk. For this, we had to count 8334 blocks.
Buffers: shared read - the number of blocks read from the disk.

Repeat last request

 EXPLAIN (ANALYZE,BUFFERS) SELECT * FROM foo;

QUERY PLAN
- Seq Scan on foo (cost = 0.00..18334.10 rows = 1000010 width = 37) (actual time = 0.173..693.000 rows = 1000010 loops = 1)
Buffers: shared hit = 32 read = 8302
Total runtime: 1208.433 ms
(3 rows)

Buffers: shared hit - the number of blocks read from the PostgreSQL cache.
If you repeat this query several times, you will see how PostgreSQL takes more and more data from the cache each time. With each request, PostgreSQL fills its cache.
Cache reads are faster than disk reads. You can notice this trend by tracking the Total runtime value.
The cache size is determined by the shared_buffers constant in the postgresql.conf file.

WHERE

Add a condition to the request

 EXPLAIN SELECT * FROM foo WHERE c1 > 500;

QUERY PLAN
- Seq Scan on foo (cost = 0.00..20834.12 rows = 999522 width = 37)
Filter: (c1> 500)
(2 rows)

There are no indices on the table. When executing the query, each record in the table is sequentially read ( Seq Scan ). Each record is compared with the condition c1 > 500 . If the condition is met, the entry is entered into the result. Otherwise, it is discarded. Filter means exactly this behavior.
The value of cost , which is logical, has increased.
The expected number of rows of the result - rows - decreased.
The original explains why cost takes just such a value, and how the expected number of rows is calculated.
')
It's time to create indexes.

 CREATE INDEX ON foo(c1); EXPLAIN SELECT * FROM foo WHERE c1 > 500;

QUERY PLAN
- Seq Scan on foo (cost = 0.00..20834.12 rows = 999519 width = 37)
Filter: (c1> 500)
(2 rows)

The expected number of rows has changed. Clarified. Otherwise, nothing new. What about the index?

 EXPLAIN (ANALYZE) SELECT * FROM foo WHERE c1 > 500;

QUERY PLAN
- Seq Scan on foo (cost = 0.00..20834.12 rows = 999519 width = 37) (actual time = 0.572..848.895 rows = 999500 loops = 1)
Filter: (c1> 500)
Rows Removed by Filter: 510
Total runtime: 1330.788 ms
(4 rows)

Only 510 lines out of more than one million have been filtered. I had to count more than 99.9% of the table.

Forcibly force the use of the index, prohibiting Seq Scan:

 SET enable_seqscan TO off; EXPLAIN (ANALYZE) SELECT * FROM foo WHERE c1 > 500;

QUERY PLAN
- Index Scan using foo_c1_idx on foo (cost = 0.42..34623.01 rows = 999519 width = 37) (actual time = 0.178..1018.045 rows = 999500 loops = 1)
Index Cond: (c1> 500)
Total runtime: 1434.429 ms
(3 rows)

Index Scan , Index Cond instead of Filter - the index foo_c1_idx is used.
When sampling almost the entire table, using the index only increases the cost and run time of the query. Planner is not stupid!

Do not forget to cancel the ban on the use of Seq Scan:

 SET enable_seqscan TO on;

Change the query:

 EXPLAIN SELECT * FROM foo WHERE c1 < 500;

QUERY PLAN
- Index Scan using foo_c1_idx on foo (cost = 0.42..25.78 rows = 491 width = 37)
Index Cond: (c1 <500)
(2 rows)

Here the planner decided to use the index.

Complicate the condition. Use the text field.

 EXPLAIN SELECT * FROM foo WHERE c1 < 500 AND c2 LIKE 'abcd%';

QUERY PLAN
- Index Scan using foo_c1_idx on foo (cost = 0.42..27.00 rows = 1 width = 37)
Index Cond: (c1 <500)
Filter: (c2 ~ abcd% ':: text)
(3 rows)

As you can see, the index foo_c1_idx for the condition c1 < 500 . For c2 ~~ 'abcd%'::text , a filter is used.
Note that the POSIX format of the LIKE operator is used in the output of the results.

If only a text field is provided:

 EXPLAIN (ANALYZE) SELECT * FROM foo WHERE c2 LIKE 'abcd%';

QUERY PLAN
- Seq Scan on foo (cost = 0.00..20834.12 rows = 100 width = 37) (actual time = 14.497 ..412.030 rows = 10 loops = 1)
Filter: (c2 ~ abcd% ':: text)
Rows Removed by Filter: 1000000
Total runtime: 412.120 ms
(4 rows)

Expected, Seq Scan .

Build an index on c2 :

 CREATE INDEX ON foo(c2); EXPLAIN (ANALYZE) SELECT * FROM foo WHERE c2 LIKE 'abcd%';

QUERY PLAN
- Seq Scan on foo (cost = 0.00..20834.12 rows = 100 width = 37) (actual time = 20.992..424.946 rows = 10 loops = 1)
Filter: (c2 ~ abcd% ':: text)
Rows Removed by Filter: 1000000
Total runtime: 425.039 ms
(4 rows)

Again Seq Scan ? The index is not used because I have a base for text fields using the UTF-8 format.
When creating an index in such cases, use the text_pattern_ops operator text_pattern_ops :

 CREATE INDEX ON foo(c2 text_pattern_ops); EXPLAIN SELECT * FROM foo WHERE c2 LIKE 'abcd%';

QUERY PLAN
- Bitmap Heap Scan on foo (cost = 4.58..55.20 rows = 100 width = 37)
Filter: (c2 ~ abcd% ':: text)
-> Bitmap Index Scan on foo_c2_idx1 (cost = 0.00 ..4.55 rows = 13 width = 0)
Index Cond: ((c2 ~> = ~ 'abcd' :: text) AND (c2 ~ <~ 'abce' :: text))
(4 rows)

Hooray! Happened!
Bitmap Index Scan - the foo_c2_idx1 index is used to determine the records we need, and then PostgreSQL crawls into the table itself: ( Bitmap Heap Scan ) - to make sure that these records actually exist. This behavior is related to PostgreSQL versioning.

If you choose not the entire line, but only the field on which the index is built

 EXPLAIN SELECT c1 FROM foo WHERE c1 < 500;

QUERY PLAN
- Index Only Scan using foo_c1_idx on foo (cost = 0.42..25.78 rows = 491 width = 4)
Index Cond: (c1 <500)
(2 rows)

Index Only Scan runs faster than Index Scan due to the fact that you do not need to read the entire table row: width=4 .

Summary

Seq Scan - reads the entire table.
Index Scan - uses an index for WHERE conditions, reads a table when selecting rows.
Bitmap Index Scan - first Index Scan, then sample control by the table. Effective for a large number of lines.
Index Only Scan - the fastest. Only the index is read.

Part 3

Source: https://habr.com/ru/post/203386/

All Articles

Query optimization. Basics of EXPLAIN in PostgreSQL (Part 2)

Previous parts:

Cache

WHERE

Summary

More articles: