Query optimization. Basics of EXPLAIN in PostgreSQL

Why does the query take so long? Why not use indexes?
Probably everyone has heard of EXPLAIN in PostgreSQL. But not many who understand how to use it. For a long time I could not find a textbook that I could understand (did I look badly?).
I hope this article will help those who wish to deal with this wonderful tool.

This is not a translation, but rather the author's recycling of materials Understanding EXPLAIN from Guillaume Lelarge. Some of the information is omitted, so I strongly recommend that you read the original.

Not so bad damn

To optimize queries, it is very important to understand the logic of the PostgreSQL core.
I will try to explain. In fact, everything is not so difficult. EXPLAIN displays the information necessary to understand what the kernel does on each particular request. We will examine the output of the EXPLAIN command, at the same time figuring out what is going on inside PostgreSQL. This applies to PostgreSQL 9.2 and higher.

Our tasks:

learn to read and understand the output of the EXPLAIN command
understand what happens in PostgreSQL when executing a query

The first steps

We will train on ~~cats~~ test table in a million rows

CREATE TABLE foo (c1 integer, c2 text); INSERT INTO foo SELECT i, md5(random()::text) FROM generate_series(1, 1000000) AS i;

Let's try to read the data

 EXPLAIN SELECT * FROM foo;

QUERY PLAN
- Seq Scan on foo (cost = 0.00..18334.00 rows = 1000000 width = 37)
(1 row)

Reading data from a table can be done in several ways. In our case, EXPLAIN reports that Seq Scan is being used - sequential, block by block, reading data from the foo table.
What is the cost ? This is not a time, but a certain spherical notion in a vacuum, designed to estimate the cost of the operation. The first value of 0.00 is the cost of getting the first row. The second - 18334.00 - the cost of obtaining all the lines.
rows - the approximate number of rows returned when performing a Seq Scan operation. This value is returned by the scheduler. In my case, it coincides with the actual number of rows in the table.
width - the average size of a single line in bytes.
Let's try to add 10 lines.

 INSERT INTO foo SELECT i, md5(random()::text) FROM generate_series(1, 10) AS i; EXPLAIN SELECT * FROM foo;

QUERY PLAN
- Seq Scan on foo (cost = 0.00..18334.00 rows = 1000000 width = 37)
(1 row)

The rows value has not changed. Statistics on the table is old. To update the statistics, call the ANALYZE command.

 ANALYZE foo; EXPLAIN SELECT * FROM foo;

QUERY PLAN
- Seq Scan on foo (cost = 0.00..18334.10 rows = 1000010 width = 37)
(1 row)

Now rows displays the correct number of rows.

What happens when doing ANALYZE?

Reads a certain number of randomly selected rows in a table.
The statistics of values for each of the columns of the table is collected:

How many lines ANALYZE will read - depends on the default_statistics_target parameter.

Real data

All that we saw above in the output of the EXPLAIN command is just the scheduler's expectations. Let's try to compare them with the results on real data. Use EXPLAIN (ANALYZE) .

 EXPLAIN (ANALYZE) SELECT * FROM foo;

QUERY PLAN
- Seq Scan on foo (cost = 0.00..18334.10 rows = 1000010 width = 37) (actual time = 0.012..61.524 rows = 1000010 loops = 1)
Total runtime: 90.944 ms
(2 rows)

Such a request will be executed realistically. So if you execute EXPLAIN (ANALYZE) for INSERT, DELETE or UPDATE, your data will change. Be careful! In such cases, use the ROLLBACK command.
In the output of the command information added.
actual time - real time in milliseconds, spent to get the first row and all rows, respectively.
rows - the actual number of rows received by Seq Scan .
loops - how many times you had to perform the Seq Scan operation.
Total runtime - the total time of the request.

Further...

For the first part, probably enough.
Part 2
Part 3

Source: https://habr.com/ru/post/203320/

All Articles