About the usefulness of indexes by expressions

In the PostgreSQL training sessions, both in advanced and basic courses, I often come across the fact that learners know almost nothing about how powerful expressions indexes can be (if they are generally aware of their existence). So let me do a little review for you.

Suppose we have a table with a range of timestamps (yes, we have a generate_series function that can generate dates):

CREATE TABLE t AS SELECT d, repeat(md5(d::text), 10) AS padding FROM generate_series(timestamp '1900-01-01', timestamp '2100-01-01', interval '1 day') s(d); VACUUM ANALYZE t;

The table also contains a padding column so that it is a bit larger. Now, let's perform a simple range query, returning only one month from the approximately 200 years that are in the table. If you execute this query with explain, you get something like this:

 EXPLAIN SELECT * FROM t WHERE d BETWEEN '2001-01-01' AND '2001-02-01'; QUERY PLAN ------------------------------------------------------------------------ Seq Scan on t (cost=0.00..4416.75 rows=32 width=332) Filter: ((d >= '2001-01-01 00:00:00'::timestamp without time zone) AND (d <= '2001-02-01 00:00:00'::timestamp without time zone)) (2 rows)

and on my computer, the request is about 20 milliseconds. Not bad, considering the fact that he needs to go through the entire table consisting of 75 thousand rows.
')
But let's create an index on a column with a timestamp (all indexes in this text are basic, that is, btree, if not explicitly specified):

 CREATE INDEX idx_t_d ON t (d);

And now let's try the query again:

  QUERY PLAN ------------------------------------------------------------------------ Index Scan using idx_t_d on t (cost=0.29..9.97 rows=34 width=332) Index Cond: ((d >= '2001-01-01 00:00:00'::timestamp without time zone) AND (d <= '2001-02-01 00:00:00'::timestamp without time zone)) (2 rows)

Now it runs in 0.5 milliseconds, which is roughly 40 times faster. But it was, of course, a simple index created directly on the column, not an index by expression. So let's assume that instead of the last request, we will request data for each first day of each month, performing the following request:

 SELECT * FROM t WHERE EXTRACT(day FROM d) = 1;

which, however, cannot use the index, since it needs to execute the expression on the column, while the index is built on the column itself, as shown by EXPLAIN ANALYZE:

  QUERY PLAN ------------------------------------------------------------------------ Seq Scan on t (cost=0.00..4416.75 rows=365 width=332) (actual time=0.045..40.601 rows=2401 loops=1) Filter: (date_part('day'::text, d) = '1'::double precision) Rows Removed by Filter: 70649 Planning time: 0.209 ms Execution time: 43.018 ms (5 rows)

So he not only has to perform a sequential scan, but also count the number, increasing the request time to 43 milliseconds.

The database cannot use indexes for several reasons. Indices (at least btree) are based on queries to the sorted data provided by the tree structure, and if the first query on the range benefits from this, then the second one (with the extract call) does not.

Note: Another problem is that the set of operators supported by indices (that is, which can be executed directly on the index itself) is very limited. And the extract function is not supported, so the query cannot bypass the sorting problem using Bitmap Index Scan.

Theoretically, the database can try to convert a condition into a set of conditions, but it is extremely difficult and specific for each expression. In this case, we will have to generate an infinite number of such ranges per day, because the scheduler does not really know the minimum / maximum timestamp in the table. So the base won't even try.

But while the database does not know how to transform the condition, the developers usually know. For example, the condition of the form:

 (column + 1) >= 1000

it is not difficult to rewrite as follows:

 column >= (1000 - 1)

and it will already work fine with indexes.

But what if such a conversion is not possible, as is the case with our query:

 SELECT * FROM t WHERE EXTRACT(day FROM d) = 1;

In this case, the developer will have to face the same problem of unknown minimum / maximum values for column d, and even then he will have to generate multiple ranges.

Well, this article is about indexes on expressions, and we used only ordinary indexes up to this point, built directly on the column. Let's create the first index by expression:

 CREATE INDEX idx_t_expr ON t ((extract(day FROM d))); ANALYZE t;

as a result of which we will get the following plan:

  QUERY PLAN ------------------------------------------------------------------------ Bitmap Heap Scan on t (cost=47.35..3305.25 rows=2459 width=332) (actual time=2.400..12.539 rows=2401 loops=1) Recheck Cond: (date_part('day'::text, d) = '1'::double precision) Heap Blocks: exact=2401 -> Bitmap Index Scan on idx_t_expr (cost=0.00..46.73 rows=2459 width=0) (actual time=1.243..1.243 rows=2401 loops=1) Index Cond: (date_part('day'::text, d) = '1'::double precision) Planning time: 0.374 ms Execution time: 17.136 ms (7 rows)

So far, he does not give the same increase in speed by 40 times, as the index from the first example, this is expected, since this query returns much more tuples (2401 vs 32). Moreover, they are spread over the entire table and are not as localized as in the first example. So this is a good acceleration of 2 times, and in many situations from real life, you will see a much larger increase.

But the ability to use indexes for conditions with complex expressions at the base is not the most interesting information in this article, this is the reason why people create indexes on expressions. But this is not the only advantage.

If you look at the two plans for executing queries below (with and without an index by expression), you will notice the following:

  QUERY PLAN ------------------------------------------------------------------------ Seq Scan on t (cost=0.00..4416.75 rows=365 width=332) (actual time=0.045..40.601 rows=2401 loops=1) ...

  QUERY PLAN ------------------------------------------------------------------------ Bitmap Heap Scan on t (cost=47.35..3305.25 rows=2459 width=332) (actual time=2.400..12.539 rows=2401 loops=1) ...

True - creating an index by expression has significantly improved the scores. Without an index, we only have statistics (MCV + histogram) for the coarse columns of the table, and the base does not know how to evaluate the expression

 EXTRACT(day FROM d) = 1

Accordingly, it applies a comparison to the default equality, which returns approximately 0.5% of all rows — since the table has 73050 rows, we get a score of only 365 rows. Often you can see much larger evaluation errors in real-world applications.

With the index, however, the database also collected statistics on the index columns and, in this case, the column contains the result of the expression. In the planning process, the optimizer pays attention to this and produces a much better estimate.

This is a great advantage that can help deal with the problems of poor query plans caused by inaccurate estimates. However, most people do not know about such a handy tool.

And the utility of this tool has only increased with the introduction of the JSONB data type in version 9.4, because it is practically the only way to collect statistics about the contents of JSONB documents.

When using JSONB documents, there are two basic indexing strategies. You can create a GIN / GIST index for the entire document, for example as follows:

 CREATE INDEX ON t USING GIN (jsonb_column);

which allows you to query arbitrary parts of the JSONB field, use the content operator to compare subdocuments, and so on. This is fine, but you still have only basic statistics on the column, which is not very convenient, since the documents are served as scalar values (and neither matches the whole document, or does not use a range of documents).

Expression indices, for example, created as follows:

 CREATE INDEX ON t ((jsonb_column->'id'));

will be useful only for a specific expression, for this particular example:

 SELECT * FROM t WHERE jsonb_column ->> 'id' = 123;

but not for requests directed to other JSON keys, for example, value:

 SELECT * FROM t WHERE jsonb_column ->> 'value' = 'xxxx';

This does not mean that GIN / GIST indexes are useless on the whole document, but you have to choose. Alternatively, you will create an index targeted to a specific expression, useful when the request goes to a specific key and with the added advantage of expression statistics. Or you create a GIN / GIST index on the entire document, able to cope with requests for arbitrary keys, but without statistics.

However, you can kill both birds with one stone, in this case, because you can create both indices at the same time, and the database itself will choose which one to use for which queries. And you will have accurate statistics, thanks to the index on expressions.

Unfortunately, expression indexes and GIN / GIST indexes use different conditions:

 -- expression (btree) SELECT * FROM t WHERE jsonb_column ->> 'id' = 123; -- GIN/GiST SELECT * FROM t WHERE jsonb_column @> '{"id" : 123}';

so the scheduler cannot use them at the same time - expression indexes for evaluation and GIN / GIST indexes for execution.

Source: https://habr.com/ru/post/282932/

All Articles

About the usefulness of indexes by expressions

More articles: