📜 ⬆️ ⬇️

We use all index features in PostgreSQL


In the Postgres world, indexes are extremely important for efficiently navigating the database repository (it is called the heap). Postgres does not support clustering for it, and the MVCC architecture causes you to accumulate many versions of the same tuple. Therefore, it is very important to be able to create and maintain efficient indexes to support applications.

I offer you some tips on optimizing and improving the use of indexes.

Note: the queries shown below work on an unmodified sample of the pagila database .

Using Covering Indexes


Let's consider a request to retrieve email addresses for inactive users. In the customer table there is an active column, and the query is simple:
')
 pagila=# EXPLAIN SELECT email FROM customer WHERE active=0; QUERY PLAN ----------------------------------------------------------- Seq Scan on customer (cost=0.00..16.49 rows=15 width=32) Filter: (active = 0) (2 rows) 

The request invokes the complete sequence of scanning the customer table. Let's create an index for the active column:

 pagila=# CREATE INDEX idx_cust1 ON customer(active); CREATE INDEX pagila=# EXPLAIN SELECT email FROM customer WHERE active=0; QUERY PLAN ----------------------------------------------------------------------------- Index Scan using idx_cust1 on customer (cost=0.28..12.29 rows=15 width=32) Index Cond: (active = 0) (2 rows) 

It helped, the subsequent scan turned into an " index scan ". This means that Postgres will scan the idx_cust1 index and then continue searching the table heap to read the values ​​of other columns (in this case, the email column) that the query needs.

PostgreSQL 11 appeared covering indexes. They allow you to include in the index itself one or more additional columns - their values ​​are stored in the index data storage.

If we used this feature and added the email value inside the index, then Postgres would not need to look for the email value in the table heap. Let's see if this will work:

 pagila=# CREATE INDEX idx_cust2 ON customer(active) INCLUDE (email); CREATE INDEX pagila=# EXPLAIN SELECT email FROM customer WHERE active=0; QUERY PLAN ---------------------------------------------------------------------------------- Index Only Scan using idx_cust2 on customer (cost=0.28..12.29 rows=15 width=32) Index Cond: (active = 0) (2 rows) 

" Index Only Scan " tells us that the query now needs only one index, which helps to avoid all disk I / O operations to read the table heap.

Today covering indexes are available only for B-trees. However, in this case, the accompanying effort will be higher.

Using partial indexes


Partial indexes index only a subset of rows in a table. This saves the size of the indexes and makes scanning faster.

Suppose we need to get a list of email addresses of our clients from California. The request will be as follows:

 SELECT c.email FROM customer c JOIN address a ON c.address_id = a.address_id WHERE a.district = 'California'; which has a query plan that involves scanning both the tables that are joined: pagila=# EXPLAIN SELECT c.email FROM customer c pagila-# JOIN address a ON c.address_id = a.address_id pagila-# WHERE a.district = 'California'; QUERY PLAN ---------------------------------------------------------------------- Hash Join (cost=15.65..32.22 rows=9 width=32) Hash Cond: (c.address_id = a.address_id) -> Seq Scan on customer c (cost=0.00..14.99 rows=599 width=34) -> Hash (cost=15.54..15.54 rows=9 width=4) -> Seq Scan on address a (cost=0.00..15.54 rows=9 width=4) Filter: (district = 'California'::text) (6 rows) 

What normal indexes will give us:

 pagila=# CREATE INDEX idx_address1 ON address(district); CREATE INDEX pagila=# EXPLAIN SELECT c.email FROM customer c pagila-# JOIN address a ON c.address_id = a.address_id pagila-# WHERE a.district = 'California'; QUERY PLAN --------------------------------------------------------------------------------------- Hash Join (cost=12.98..29.55 rows=9 width=32) Hash Cond: (c.address_id = a.address_id) -> Seq Scan on customer c (cost=0.00..14.99 rows=599 width=34) -> Hash (cost=12.87..12.87 rows=9 width=4) -> Bitmap Heap Scan on address a (cost=4.34..12.87 rows=9 width=4) Recheck Cond: (district = 'California'::text) -> Bitmap Index Scan on idx_address1 (cost=0.00..4.34 rows=9 width=0) Index Cond: (district = 'California'::text) (8 rows) 

The address scan was replaced by idx_address1 index idx_address1 , and then the address heap was scanned.

Since this is a frequent query and needs to be optimized, we can use a partial index that indexes only those lines with addresses in which the region 'California' :

 pagila=# CREATE INDEX idx_address2 ON address(address_id) WHERE district='California'; CREATE INDEX pagila=# EXPLAIN SELECT c.email FROM customer c pagila-# JOIN address a ON c.address_id = a.address_id pagila-# WHERE a.district = 'California'; QUERY PLAN ------------------------------------------------------------------------------------------------ Hash Join (cost=12.38..28.96 rows=9 width=32) Hash Cond: (c.address_id = a.address_id) -> Seq Scan on customer c (cost=0.00..14.99 rows=599 width=34) -> Hash (cost=12.27..12.27 rows=9 width=4) -> Index Only Scan using idx_address2 on address a (cost=0.14..12.27 rows=9 width=4) (5 rows) 

Now the query reads only idx_address2 and does not touch the address table.

Using Multi-Value Indexes


Some columns that need to be indexed may not contain a scalar data type. jsonb types like jsonb , arrays and tsvector contain multiple or multiple values. If you need to index such columns, you usually have to search through all the individual values ​​in these columns.

Let's try to find the names of all films containing slices from unsuccessful duplicates. The film table has a text column called special_features . If the film has this “special property”, then the column contains an element in the form of a text array Behind The Scenes . To search for all such films, we need to select all rows with “Behind The Scenes” for any values ​​of the special_features array:

 SELECT title FROM film WHERE special_features @> '{"Behind The Scenes"}'; 

The containment operator @> checks whether the right side is a subset of the left side.

Request Plan:

 pagila=# EXPLAIN SELECT title FROM film pagila-# WHERE special_features @> '{"Behind The Scenes"}'; QUERY PLAN ----------------------------------------------------------------- Seq Scan on film (cost=0.00..67.50 rows=5 width=15) Filter: (special_features @> '{"Behind The Scenes"}'::text[]) (2 rows) 

Which requests a full heap scan with a cost of 67.

Let's see if the usual B-tree index will help us:

 pagila=# CREATE INDEX idx_film1 ON film(special_features); CREATE INDEX pagila=# EXPLAIN SELECT title FROM film pagila-# WHERE special_features @> '{"Behind The Scenes"}'; QUERY PLAN ----------------------------------------------------------------- Seq Scan on film (cost=0.00..67.50 rows=5 width=15) Filter: (special_features @> '{"Behind The Scenes"}'::text[]) (2 rows) 

The index was not even considered. The B-tree index does not know about the existence of individual elements in the indexed values.

We need a GIN-index.

 pagila=# CREATE INDEX idx_film2 ON film USING GIN(special_features); CREATE INDEX pagila=# EXPLAIN SELECT title FROM film pagila-# WHERE special_features @> '{"Behind The Scenes"}'; QUERY PLAN --------------------------------------------------------------------------- Bitmap Heap Scan on film (cost=8.04..23.58 rows=5 width=15) Recheck Cond: (special_features @> '{"Behind The Scenes"}'::text[]) -> Bitmap Index Scan on idx_film2 (cost=0.00..8.04 rows=5 width=0) Index Cond: (special_features @> '{"Behind The Scenes"}'::text[]) (4 rows) 

GIN-index supports the comparison of individual values ​​with indexed composite values, as a result, the cost of the query plan will be more than doubled.

We get rid of duplication of indexes


Indices accumulate over time, and sometimes a new index may contain the same definition as one of the previous ones. You can use the pg_indexes catalog view to pg_indexes human-readable SQL definitions of indexes. You can also easily find the same definitions:

  SELECT array_agg(indexname) AS indexes, replace(indexdef, indexname, '') AS defn FROM pg_indexes GROUP BY defn HAVING count(*) > 1; And here's the result when run on the stock pagila database: pagila=# SELECT array_agg(indexname) AS indexes, replace(indexdef, indexname, '') AS defn pagila-# FROM pg_indexes pagila-# GROUP BY defn pagila-# HAVING count(*) > 1; indexes | defn ------------------------------------------------------------------------+------------------------------------------------------------------ {payment_p2017_01_customer_id_idx,idx_fk_payment_p2017_01_customer_id} | CREATE INDEX ON public.payment_p2017_01 USING btree (customer_id {payment_p2017_02_customer_id_idx,idx_fk_payment_p2017_02_customer_id} | CREATE INDEX ON public.payment_p2017_02 USING btree (customer_id {payment_p2017_03_customer_id_idx,idx_fk_payment_p2017_03_customer_id} | CREATE INDEX ON public.payment_p2017_03 USING btree (customer_id {idx_fk_payment_p2017_04_customer_id,payment_p2017_04_customer_id_idx} | CREATE INDEX ON public.payment_p2017_04 USING btree (customer_id {payment_p2017_05_customer_id_idx,idx_fk_payment_p2017_05_customer_id} | CREATE INDEX ON public.payment_p2017_05 USING btree (customer_id {idx_fk_payment_p2017_06_customer_id,payment_p2017_06_customer_id_idx} | CREATE INDEX ON public.payment_p2017_06 USING btree (customer_id (6 rows) 

Superset Indexes


It may happen that you have a lot of indexes, one of which indexes a superset of columns that index other indexes. This may or may not be desirable - a superset can only lead to scanning by index, which is good, but it may take up too much space, or the query that this superset was intended to optimize is no longer used.

If you need to automate the definition of such indices, then you can start with pg_index from the pg_catalog table.

Unused indexes


As applications that use databases evolve, so are the queries they use. The indexes added earlier may no longer be used by any query. Each time the index is scanned, it is marked by the statistics manager, and in the pg_stat_user_indexes system catalog pg_stat_user_indexes you can see the idx_scan value, which is a cumulative counter. Tracking this value over a period of time (say, a month) will give a good idea of ​​which indexes are not used and can be deleted.

Here is a request to get current counters for scanning all indexes in the 'public' scheme:

 SELECT relname, indexrelname, idx_scan FROM pg_catalog.pg_stat_user_indexes WHERE schemaname = 'public'; with output like this: pagila=# SELECT relname, indexrelname, idx_scan pagila-# FROM pg_catalog.pg_stat_user_indexes pagila-# WHERE schemaname = 'public' pagila-# LIMIT 10; relname | indexrelname | idx_scan ---------------+--------------------+---------- customer | customer_pkey | 32093 actor | actor_pkey | 5462 address | address_pkey | 660 category | category_pkey | 1000 city | city_pkey | 609 country | country_pkey | 604 film_actor | film_actor_pkey | 0 film_category | film_category_pkey | 0 film | film_pkey | 11043 inventory | inventory_pkey | 16048 (10 rows) 

Rebuilding indexes with fewer locks


Often, indexes have to be re-created, for example, when they swell in size, and re-creation can speed up scanning. Also, indexes may be damaged. Changing the parameters of the index may also require its re-creation.

Enable parallel index creation


In PostgreSQL 11, the B-Tree index creation is competitive. To speed up the creation process, several parallel working workers can be used. However, make sure these configuration parameters are set correctly:

 SET max_parallel_workers = 32; SET max_parallel_maintenance_workers = 16; 

Default values ​​are too small. Ideally, these numbers should be increased along with the number of processor cores. Read more in the documentation .

Background index creation


You can create an index in the background using the CONCURRENTLY parameter of the CREATE INDEX command:

 pagila=# CREATE INDEX CONCURRENTLY idx_address1 ON address(district); CREATE INDEX 


This procedure for creating an index differs from the usual one in that it does not require locking the table, and therefore does not block the write operation. On the other hand, it takes more time and consumes more resources.

Postgres provides many flexible options for creating indexes and ways to solve any particular cases, and also provides ways to manage the database in case of an explosive growth of your application. We hope that these tips will help you make queries fast, and the base is ready to scale.

Source: https://habr.com/ru/post/453046/


All Articles