PostgreSQL Indexes - 8

We have already reviewed the PostgreSQL indexing mechanism , the access method interface and all the main access methods, such as: hash indices , B-trees , GiST , SP-GiST and GIN . And in this part we will look at the transformation of gin into rum.

RUM

Although the authors claim that gin is a powerful spirit, but the topic of drinks still won: the next-generation GIN was called RUM.

This access method develops the idea embedded in the GIN and allows you to perform full-text search even faster. This is the only method in this series of articles that is not included in the standard PostgreSQL distribution and is a third-party extension. There are several options for installing it:
')

Take the yum or apt package from the PGDG repository . For example, if you installed PostgreSQL from the postgresql-10 package, then add another postgresql-10-rum.
Independently build and install from source codes on github (instruction there).
Use as part of Postgres Pro Enterprise (or at least read the documentation from there).

GIN restrictions

What are the limitations of the GIN index to overcome RUM?

First, the tsvector data type, in addition to the tokens themselves, contains information about their positions within the document. In the GIN-index, as we saw last time , this information is not saved. Because of this, the phrasal search operations , which appeared in version 9.6, are serviced by the GIN-index inefficiently and are forced to refer to the initial data for rechecking.

Secondly, search engines typically return results in order of relevance (whatever that means). To do this, you can use the ranking functions ts_rank and ts_rank_cd, but they have to be calculated for each row of the result, which, of course, is slow.

As a first approximation, the RUM access method can be viewed as a GIN, to which positional information has been added, and which supports outputting the result in the desired order (similar to how GiST can produce the nearest neighbors). Let's go in order.

Phrase Search

A query in a full-text search may contain special constructions that take into account the distance between the tokens. For example, you can find documents in which the grandmother from Dedki separates another word:

postgres=# select to_tsvector(' , ...') @@
to_tsquery(' <2> ');
?column?
----------
t
(1 row)

Or indicate that words should stand next to each other:

postgres=# select to_tsvector(' , ...') @@
to_tsquery(' <-> ');
?column?
----------
t
(1 row)

A regular GIN index can produce documents that have both tokens, but you can check the distance between them only by looking at the tsvector:

postgres=# select to_tsvector(' , ...');
to_tsvector
------------------------------
'':1 '':3,4 '':6
(1 row)

In the RUM index, each token does not simply refer to the rows of the table: along with each TID, there is a list of positions in which the token is found in the document. Here's how to imagine an index created on a table already familiar to us with a white birch tree (the default for the tsvector is the rum_tsvector_ops operator class):

postgres=# create extension rum;
CREATE EXTENSION
postgres=# create index on ts using rum(doc_tsv);
CREATE INDEX

Gray squares in the figure - added positional information:

postgres=# select ctid, doc, doc_tsv from ts;
ctid | doc | doc_tsv
--------+-------------------------+--------------------------------
(0,1) | | '':3 '':2 '':4
(0,2) | | '':3 '':2 '':4
(0,3) | , , | '':1,2 '':3
(0,4) | , , | '':1,2 '':3
(1,1) | | '':2 '':3 '':1
(1,2) | | '':3 '':2 '':1
(1,3) | , , | '':3 '':1,2
(1,4) | , , | '':3 '':1,2
(2,1) | | '':3 '':2
(2,2) | | '':1 '':2 '':3
(2,3) | , , | '':3 '':1,2
(2,4) | , , | '':3 '':1,2
(12 rows)

The GIN still has a delayed insert when you specify the fastupdate parameter; RUM has removed this functionality.

To see how the index works on real data, we use the pgsql-hackers mailing list known to us.

fts=# alter table mail_messages add column tsv tsvector;
ALTER TABLE
fts=# set default_text_search_config = default;
SET
fts=# update mail_messages
set tsv = to_tsvector(body_plain);
...
UPDATE 356125

Here is how a query is executed using a phrase search with a GIN index:

fts=# create index tsv_gin on mail_messages using gin(tsv);
CREATE INDEX
fts=# explain (costs off, analyze)
select * from mail_messages where tsv @@ to_tsquery('hello <-> hackers');
QUERY PLAN
---------------------------------------------------------------------------------
Bitmap Heap Scan on mail_messages (actual time=2.490..18.088 rows=259 loops=1)
Recheck Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))
Rows Removed by Index Recheck: 1517
Heap Blocks: exact=1503
-> Bitmap Index Scan on tsv_gin (actual time=2.204..2.204 rows=1776 loops=1)
Index Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))
Planning time: 0.266 ms
Execution time: 18.151 ms
(8 rows)

As can be seen from the plan, the GIN-index is used, but returns 1776 potential matches, of which 259 remain, and 1517 are discarded during the recheck phase.

Now remove the GIN-index and build RUM.

fts=# drop index tsv_gin;
DROP INDEX
fts=# create index tsv_rum on mail_messages using rum(tsv);
CREATE INDEX

Now the index has all the necessary information and the search is performed exactly:

fts=# explain (costs off, analyze)
select * from mail_messages
where tsv @@ to_tsquery('hello <-> hackers');
QUERY PLAN
--------------------------------------------------------------------------------
Bitmap Heap Scan on mail_messages (actual time=2.798..3.015 rows=259 loops=1)
Recheck Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))
Heap Blocks: exact=250
-> Bitmap Index Scan on tsv_rum (actual time=2.768..2.768 rows=259 loops=1)
Index Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))
Planning time: 0.245 ms
Execution time: 3.053 ms
(7 rows)

Sort by relevance

In order to issue documents at once in the right order, the RUM index supports the ordering operators, which we discussed in the GiST part. The rum extension defines such an operator <=> that returns some distance between the document (tsvector) and the query (tsquery). For example:

fts=# select to_tsvector(' , ...') <=> to_tsquery('');
?column?
----------
16.4493
(1 row)

fts=# select to_tsvector(' , ...') <=> to_tsquery('');
?column?
----------
13.1595
(1 row)

The document turned out to be more relevant to the first request than to the second: the more often the word appears in the document, the less “valuable” it is.

Again, try to compare GIN and RUM on a relatively large amount of data: select the ten most relevant documents containing “hello” and “hackers”.

fts=# explain (costs off, analyze)
select * from mail_messages
where tsv @@ to_tsquery('hello & hackers')
order by ts_rank(tsv,to_tsquery('hello & hackers'))
limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------
Limit (actual time=27.076..27.078 rows=10 loops=1)
-> Sort (actual time=27.075..27.076 rows=10 loops=1)
Sort Key: (ts_rank(tsv, to_tsquery('hello & hackers'::text)))
Sort Method: top-N heapsort Memory: 29kB
-> Bitmap Heap Scan on mail_messages (actual ... rows=1776 loops=1)
Recheck Cond: (tsv @@ to_tsquery('hello & hackers'::text))
Heap Blocks: exact=1503
-> Bitmap Index Scan on tsv_gin (actual ... rows=1776 loops=1)
Index Cond: (tsv @@ to_tsquery('hello & hackers'::text))
Planning time: 0.276 ms
Execution time: 27.121 ms
(11 rows)

GIN-index returns 1776 matches, which are then separately sorted for a sample of the ten most appropriate.

With the RUM index, the query is performed by a simple index scan: no unnecessary documents are scanned, no separate sorting is required:

fts=# explain (costs off, analyze)
select * from mail_messages
where tsv @@ to_tsquery('hello & hackers')
order by tsv <=> to_tsquery('hello & hackers')
limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------
Limit (actual time=5.083..5.171 rows=10 loops=1)
-> Index Scan using tsv_rum on mail_messages (actual ... rows=10 loops=1)
Index Cond: (tsv @@ to_tsquery('hello & hackers'::text))
Order By: (tsv <=> to_tsquery('hello & hackers'::text))
Planning time: 0.244 ms
Execution time: 5.207 ms
(6 rows)

Additional Information

The RUM index, like the GIN, can be built in several fields. But if in the GIN, the tokens of different columns are stored independently of each other, then RUM allows you to "link" the main field (tsvector in our case) with an additional one. To do this, use the special class of rum_tsvector_addon_ops operators:

fts=# create index on mail_messages using rum(tsv rum_tsvector_addon_ops, sent)
with (attach='sent', to='tsv');
CREATE INDEX

Such an index can be used to produce results in the sort order by the additional field:

fts=# select id, sent, sent <=> '2017-01-01 15:00:00'
from mail_messages
where tsv @@ to_tsquery('hello')
order by sent <=> '2017-01-01 15:00:00'
limit 10;
id | sent | ?column?
---------+---------------------+----------
2298548 | 2017-01-01 15:03:22 | 202
2298547 | 2017-01-01 14:53:13 | 407
2298545 | 2017-01-01 13:28:12 | 5508
2298554 | 2017-01-01 18:30:45 | 12645
2298530 | 2016-12-31 20:28:48 | 66672
2298587 | 2017-01-02 12:39:26 | 77966
2298588 | 2017-01-02 12:43:22 | 78202
2298597 | 2017-01-02 13:48:02 | 82082
2298606 | 2017-01-02 15:50:50 | 89450
2298628 | 2017-01-02 18:55:49 | 100549
(10 rows)

Here we are looking for suitable lines, located as close as possible to the specified date, no matter, sooner or later. To get the results strictly preceding the date (or following it), you need to use the operation <=| (or |=> ).

The query, as we expect, is performed by a simple index scan:

ts=# explain (costs off)
select id, sent, sent <=> '2017-01-01 15:00:00'
from mail_messages
where tsv @@ to_tsquery('hello')
order by sent <=> '2017-01-01 15:00:00'
limit 10;
QUERY PLAN
---------------------------------------------------------------------------------
Limit
-> Index Scan using mail_messages_tsv_sent_idx on mail_messages
Index Cond: (tsv @@ to_tsquery('hello'::text))
Order By: (sent <=> '2017-01-01 15:00:00'::timestamp without time zone)
(4 rows)

If we created an index without additional information about the connection of fields, then for a similar query we would have to sort all the results obtained from the index.

Of course, in addition to the date, you can add fields and other data types to the RUM index — almost all base types are supported. For example, an online store can quickly display products by newness (date), price (numeric), popularity, or discount size (whole or floating point).

Other classes of operators

For completeness, it is worth mentioning about other available classes of operators.

Let's start with rum_tsvector_hash_ops and rum_tsvector_hash_addon_ops. They are all similar to those already discussed above rum_tsvector_ops and rum_tsvector_addon_ops, but the lexeme itself is not stored in the index, but its hash code. This may reduce the size of the index, but, of course, makes the search less accurate and requires rechecking. In addition, the index no longer supports the search for partial matches.

The class of rum_tsquery_ops operators is curious. It allows you to solve the "reverse" task: to find requests that match the document. Why this may be needed? For example, to sign a user for new products by his filter. Or automatically classify new documents. Here is a simple example:

fts=# create table categories(query tsquery, category text);
CREATE TABLE
fts=# insert into categories values
(to_tsquery('vacuum | autovacuum | freeze'), 'vacuum'),
(to_tsquery('xmin | xmax | snapshot | isolation'), 'mvcc'),
(to_tsquery('wal | (write & ahead & log) | durability'), 'wal');
INSERT 0 3
fts=# create index on categories using rum(query);
CREATE INDEX

fts=# select array_agg(category)
from categories
where to_tsvector(
'Hello hackers, the attached patch greatly improves performance of tuple
freezing and also reduces size of generated write-ahead logs.'
) @@ query;
array_agg
--------------
{vacuum,wal}
(1 row)

There are classes of operators rum_anyarray_ops and rum_anyarray_addon_ops - they are designed to work not with tsvector, but with arrays. For GIN, this has already been considered last time , so there is no reason to repeat.

Index size and prerecord log

It is clear that, since RUM contains more information than the GIN, then it will take up more space. Last time we compared the sizes of different indices; add to this table and RUM:

rum | gin | gist | btree
--------+--------+--------+--------
457 MB | 179 MB | 125 MB | 546 MB

As you can see, the volume has grown quite significantly - this is the fee for a quick search.

Another non-obvious point that is worth paying attention to is that RUM is an extension, that is, it can be installed without making any changes to the core of the system. This was made possible in version 9.6 thanks to the patch that was made by Alexander Korotkov . One of the tasks that had to be solved was the generation of journal entries. The journaling mechanism must be absolutely reliable, so the extension should not be allowed into this kitchen. Instead of allowing the extension to create its own types of journal entries, it’s done like this: the extension code reports the intention to change the page, makes any changes to it and signals completion, and the kernel of the system compares the old and new versions of the page and generates the necessary unified journal entries records

The current generation algorithm compares pages byte-by-by, finds changed fragments and logs each such fragment along with the offset from the beginning of the page. This works well when changing only a few bytes, and when the page has changed completely. But if you add some fragment inside the page, sliding the rest of the content down (or, on the contrary, removing the fragment, sliding the content up), many more bytes will be formally changed than was actually added or deleted.

Because of this, the actively changing RUM-index can generate log entries that are significantly larger than the GIN (which, being not an extension, but a part of the kernel, manages the journal itself). The degree of this unpleasant effect strongly depends on the real load, but in order to somehow feel the problem, let's try to remove and add a few lines several times, alternating these actions with cleaning (vacuum). You can estimate the size of the log entries as follows: at the beginning and at the end, you can remember the position in the log by the function pg_current_wal_location (up to the tenth version - pg_current_xlog_location) and then look at their difference.

Here, of course, many factors must be kept in mind. You need to make sure that only one user is working in the system, otherwise “extra” entries will be included in the calculation. Even in this case, we consider not only RUM, but also changes to the table itself and the index supporting the primary key. The values of the configuration parameters also affect (here the replica log level was used, without compression). But still try.

fts=# select pg_current_wal_location() as start_lsn \gset

fts=# insert into mail_messages(parent_id, sent, subject, author, body_plain, tsv)
select parent_id, sent, subject, author, body_plain, tsv
from mail_messages where id % 100 = 0;
INSERT 0 3576
fts=# delete from mail_messages where id % 100 = 99;
DELETE 3590
fts=# vacuum mail_messages;
VACUUM

fts=# insert into mail_messages(parent_id, sent, subject, author, body_plain, tsv)
select parent_id, sent, subject, author, body_plain, tsv
from mail_messages where id % 100 = 1;
INSERT 0 3605
fts=# delete from mail_messages where id % 100 = 98;
DELETE 3637
fts=# vacuum mail_messages;
VACUUM

fts=# insert into mail_messages(parent_id, sent, subject, author, body_plain, tsv)
select parent_id, sent, subject, author, body_plain, tsv from mail_messages
where id % 100 = 2;
INSERT 0 3625
fts=# delete from mail_messages where id % 100 = 97;
DELETE 3668
fts=# vacuum mail_messages;
VACUUM

fts=# select pg_current_wal_location() as end_lsn \gset
fts=# select pg_size_pretty(:'end_lsn'::pg_lsn - :'start_lsn'::pg_lsn);
pg_size_pretty
----------------
3114 MB
(1 row)

So, it turned out about 3 GB. And if the same experiment is repeated with the GIN index, there will be only about 700 MB.

Therefore, I would like to have another algorithm that finds the minimum number of insert and delete operations, with which one page state can be brought to another - in the same way as the diff utility works. This algorithm has already been implemented by Oleg Ivanov , his patch is being discussed. In the above example, this patch, at the cost of a slight slowdown, allows reducing the volume of journal entries by one and a half times, to 1900 MB.

Properties

Traditionally, we look at the properties of the rum access method (requests were cited earlier ), drawing attention to the differences from gin.

Method properties:

amname | name | pg_indexam_has_property
--------+---------------+-------------------------
rum | can_order | f
rum | can_unique | f
rum | can_multi_col | t
rum | can_exclude | t -- f gin

Index properties:

name | pg_index_has_property
---------------+-----------------------
clusterable | f
index_scan | t -- f gin
bitmap_scan | t
backward_scan | f

Note that RUM, in contrast to GIN, supports index scanning - otherwise it would be impossible to get exactly the required number of results in queries with the phrase limit. Accordingly, there is no need for an analogue of the gin_fuzzy_search_limit parameter. Well, as a result, the index can be used to support exception constraints.

Column level properties:

name | pg_index_column_has_property
--------------------+------------------------------
asc | f
desc | f
nulls_first | f
nulls_last | f
orderable | f
distance_orderable | t -- f gin
returnable | f
search_array | f
search_nulls | f

Here the difference is that RUM supports ordering operators. Although not for all classes of operators: for example, for tsquery_ops it will be false.

Continued .

Source: https://habr.com/ru/post/343488/

All Articles