We are preparing a full-text search in Postgres. Part 2

In the last article, we optimized the search in PostgreSQL using standard tools. In this article, we will continue to optimize using the RUM index and analyze its pros and cons compared to the GIN.

Introduction

RUM is an extension for Postgres, a new index for full-text search. It allows you to return when passing by the index sorted by relevance results. I will not focus on its installation - it is described in the README in the repository.

We use the index

An index is created in the same way as the GIN index, but with some parameters. The entire list of parameters can be found in the documentation.

CREATE INDEX idx_rum_document ON documents_documentvector USING rum ("text" rum_tsvector_ops);

Search query for RUM:

 SELECT document_id, "text" <=> plainto_tsquery('') AS rank FROM documents_documentvector WHERE "text" @@ plainto_tsquery('') ORDER BY rank;

Request for GIN

 SELECT document_id, ts_rank("text", plainto_tsquery('')) AS rank FROM documents_documentvector WHERE "text" @@ plainto_tsquery('') ORDER BY rank DESC;

The difference from the GIN is that relevance is not obtained using the ts_rank function, but using a query with the <=> operator: "text" <=> plainto_tsquery('') . Such a query returns some distance between the search vector and the search query. The smaller it is, the better the query corresponds to the vector.

GIN comparison

Here we will compare on a test database with ~ 500 thousand documents to notice differences in the search results.

Query speed

Let's see what the EXIN for the GIN will produce on this base:

 Gather Merge (actual time=563.840..611.844 rows=119553 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (actual time=553.427..557.857 rows=39851 loops=3) Sort Key: (ts_rank(text, plainto_tsquery(''::text))) Sort Method: external sort Disk: 1248kB -> Parallel Bitmap Heap Scan on documents_documentvector (actual time=13.402..538.879 rows=39851 loops=3) Recheck Cond: (text @@ plainto_tsquery(''::text)) Heap Blocks: exact=5616 -> Bitmap Index Scan on idx_gin_document (actual time=12.144..12.144 rows=119553 loops=1) Index Cond: (text @@ plainto_tsquery(''::text)) Planning time: 4.573 ms Execution time: 617.534 ms

And for RUM?

 Sort (actual time=1668.573..1676.168 rows=119553 loops=1) Sort Key: ((text <=> plainto_tsquery(''::text))) Sort Method: external merge Disk: 3520kB -> Bitmap Heap Scan on documents_documentvector (actual time=16.706..1605.382 rows=119553 loops=1) Recheck Cond: (text @@ plainto_tsquery(''::text)) Heap Blocks: exact=15599 -> Bitmap Index Scan on idx_rum_document (actual time=14.548..14.548 rows=119553 loops=1) Index Cond: (text @@ plainto_tsquery(''::text)) Planning time: 0.650 ms Execution time: 1679.315 ms

What is it? What is the use of this praised RUM, you ask if it works three times slower than the GIN? And where is the notorious sorting inside the index?

Easy: let's try adding LIMIT 1000 to the query.

EXPLAIN for RUM

  Limit (actual time = 115.568..137.313 rows = 1000 loops = 1)
    -> Index Scan using idx_rum_document on documents_documentvector (actual time = 115.567..137.239 rows = 1000 loops = 1)
          Index Cond: (text @@ plainto_tsquery ('query' :: text))
          Order By: (text <=> plainto_tsquery ('query' :: text))
  Planning time: 0.481 ms
  Execution time: 137.678 ms

EXPLAIN for GIN

  Limit (actual time = 579.905..585.650 rows = 1000 loops = 1)
    -> Gather Merge (actual time = 579.904..585.604 rows = 1000 loops = 1)
          Workers Planned: 2
          Workers Launched: 2
          -> Sort (actual time = 574.061..574.171 rows = 992 loops = 3)
                Sort Key: (ts_rank (text, plainto_tsquery ('query' :: text))) DESC
                Sort Method: external merge Disk: 1224kB
                -> Parallel Bitmap Heap Scan on documents_documentvector (actual time = 8.920..555.571 rows = 39851 loops = 3)
                      Recheck Cond: (text @@ plainto_tsquery ('query' :: text))
                      Heap Blocks: exact = 5422
                      -> Bitmap Index Scan on idx_gin_document (actual time = 8.945..8.945 rows = 119553 loops = 1)
                            Index Cond: (text @@ plainto_tsquery ('query' :: text))
  Planning time: 0.223 ms
  Execution time: 585.948 ms

~ 150 ms vs ~ 600 ms! No longer in favor of GIN, right? And the sorting has moved inside the index!

And if you look for LIMIT 100 ?

EXPLAIN for RUM

  Limit (actual time = 105.863..108.530 rows = 100 loops = 1)
    -> Index Scan using idx_rum_document on documents_documentvector (actual time = 105.862..108.517 rows = 100 loops = 1)
          Index Cond: (text @@ plainto_tsquery ('query' :: text))
          Order By: (text <=> plainto_tsquery ('query' :: text))
  Planning time: 0.199 ms
  Execution time: 108.958 ms

EXPLAIN for GIN

  Limit (actual time = 582.924..588.351 rows = 100 loops = 1)
    -> Gather Merge (actual time = 582.923..588.344 rows = 100 loops = 1)
          Workers Planned: 2
          Workers Launched: 2
          -> Sort (actual time = 573.809..573.889 rows = 806 loops = 3)
                Sort Key: (ts_rank (text, plainto_tsquery ('query' :: text))) DESC
                Sort Method: external merge Disk: 1224kB
                -> Parallel Bitmap Heap Scan on documents_documentvector (actual time = 18.038..552.827 rows = 39851 loops = 3)
                      Recheck Cond: (text @@ plainto_tsquery ('query' :: text))
                      Heap Blocks: exact = 5275
                      -> Bitmap Index Scan on idx_gin_document (actual time = 16.541..16.541 rows = 119553 loops = 1)
                            Index Cond: (text @@ plainto_tsquery ('query' :: text))
  Planning time: 0.487 ms
  Execution time: 588.583 ms

The difference is even more noticeable.

The fact is that the GIN, no matter how many lines you get in the end - it should go through all the lines for which the request was successful, and rank them. RUM does this only for those strings that we really need. If we need a lot of lines, GIN wins. Its ts_rank more efficient calculations than the <=> operator. But on small requests, the advantage of RUM is undeniable.

Most often, the user does not need to unload all 50 thousand documents from the database at once. He needs only 10 posts on the first, second, third page, etc. And it is for such cases that this index is sharpened, and it will give a good performance increase in search on a large base.

Tolerance to join

What if a search needs to join another one or more tables? For example, to display in the results the type of document, its owner? Or, as in my case, filter by the names of related entities?

Compare:

Query with two join for GIN

 SELECT document_id, ts_rank("text", plainto_tsquery('')) AS rank, case_number FROM documents_documentvector RIGHT JOIN documents_document ON documents_documentvector.document_id = documents_document.id LEFT JOIN documents_case ON documents_document.case_id = documents_case.id WHERE "text" @@ plainto_tsquery('') ORDER BY rank DESC LIMIT 10;

Result:

 Limit (actual time = 1637.902..1643.483 rows = 10 loops = 1)
    -> Gather Merge (actual time = 1637.901..1643.479 rows = 10 loops = 1)
          Workers Planned: 2
          Workers Launched: 2
          -> Sort (actual time = 1070.614..1070.687 rows = 652 loops = 3)
                Sort Key: (ts_rank (documents_documentvector.text, plainto_tsquery ('request' :: text))) DESC
                Sort Method: external merge Disk: 2968kB
                -> Hash Left Join (actual time = 323.386..1049.092 rows = 39851 loops = 3)
                      Hash Cond: (documents_document.case_id = documents_case.id)
                      -> Hash Join (actual time = 239.312..324.797 rows = 39851 loops = 3)
                            Hash Cond: (documents_documentvector.document_id = documents_document.id)
                            -> Parallel Bitmap Heap Scan on documents_documentvector (actual time = 11.022..37.073 rows = 39851 loops = 3)
                                  Recheck Cond: (text @@ plainto_tsquery ('query' :: text))
                                  Heap Blocks: exact = 9362
                                  -> Bitmap Index Scan on idx_gin_document (actual time = 12.094..12.094 rows = 119553 loops = 1)
                                        Index Cond: (text @@ plainto_tsquery ('query' :: text))
                            -> Hash (actual time = 227.856..227.856 rows = 472089 loops = 3)
                                  Buckets: 65536 Batches: 16 Memory Usage: 2264kB
                                  -> Seq Scan on documents_document (actual time = 0.009..147.104 rows = 472089 loops = 3)
                      -> Hash (actual time = 83.338..83.338 rows = 273695 loops = 3)
                            Buckets: 65536 Batches: 8 Memory Usage: 2602kB
                            -> Seq Scan on documents_case (actual time = 0.009..39.082 rows = 273695 loops = 3)
 Planning time: 0.857 ms
 Execution time: 1644.028 ms

On three join-ahs and longer, the request time reaches 2-3 seconds and grows with the number of join-s.

And what about RUM? Let the request be immediately with five join.

Query with five join for RUM

 SELECT document_id, "text" <=> plainto_tsquery('') AS rank, case_number, classifier_procedure.title, classifier_division.title, classifier_category.title FROM documents_documentvector RIGHT JOIN documents_document ON documents_documentvector.document_id = documents_document.id LEFT JOIN documents_case ON documents_document.case_id = documents_case.id LEFT JOIN classifier_procedure ON documents_case.procedure_id = classifier_procedure.id LEFT JOIN classifier_division ON documents_case.division_id = classifier_division.id LEFT JOIN classifier_category ON documents_document.category_id = classifier_category.id WHERE "text" @@ plainto_tsquery('') AND documents_document.is_active IS TRUE ORDER BY rank LIMIT 10;

Result:

  Limit (actual time = 70.524..72.292 rows = 10 loops = 1)
   -> Nested Loop Left Join (actual time = 70.521..72.279 rows = 10 loops = 1)
         -> Nested Loop Left Join (actual time = 70.104..70.406 rows = 10 loops = 1)
               -> Nested Loop Left Join (actual time = 70.089..70.351 rows = 10 loops = 1)
                     -> Nested Loop Left Join (actual time = 70.073..70.302 rows = 10 loops = 1)
                           -> Nested Loop (actual time = 70.052..70.201 rows = 10 loops = 1)
                                 -> Index Scan using document_vector_rum_index on documents_documentvector (actual time = 70.001..70.035 rows = 10 loops = 1)
                                       Index Cond: (text @@ plainto_tsquery ('query' :: text))
                                       Order By: (text <=> plainto_tsquery ('query' :: text))
                                 -> Index Scan using documents_document_pkey on documents_document (actual time = 0.013..0.013 rows = 1 loops = 10)
                                       Index Cond: (id = documents_documentvector.document_id)
                                       Filter: (is_active IS TRUE)
                           -> Index Scan using documents_case_pkey on documents_case (actual time = 0.009..0.009 rows = 1 loops = 10)
                                 Index Cond: (documents_document.case_id = id)
                     -> Index Scan using classifier_procedure_pkey on classifier_procedure (actual time = 0.003..0.003 rows = 1 loops = 10)
                           Index Cond: (documents_case.procedure_id = id)
               -> Index Scan using classifier_division_pkey on classifier_division (actual time = 0.004..0.004 rows = 1 loops = 10)
                     Index Cond: (documents_case.division_id = id)
         -> Index Scan using classifier_category_pkey on classifier_category (actual time = 0.003..0.003 rows = 1 loops = 10)
               Index Cond: (documents_document.category_id = id)
 Planning time: 2.861 ms
 Execution time: 72.865 ms

If you can’t do without join when searching, then RUM is clearly suitable for you.

Disk space

On a test database of ~ 500 thousand documents and 3.6 GB, the indices occupied very different volumes.

  idx_rum_document |  1950 MB
  idx_gin_document |  418 MB

Yes, the disk is cheap. But 2 GB instead of 400 MB can not be happy. Half the size of the base - a bit too much for the index. Here GIN wins unconditionally.

findings

You need RUM if:

You have a lot of documents, but you give out search results page by page
You need advanced filtering of search results.
You do not mind the disk space

You are completely satisfied with the GIN, if:

You have a small base
You have a large base, but you need to produce results immediately and that's it
You do not need filtering with join
You are interested in the minimum size of the index on the disk

I hope this article will remove a lot of WTF?!, Arising from the work and setting up a search in Postgres. I will be glad to listen to advice from those who know how to set everything up even better!)

In the next parts, I plan to talk more about RUM in my project: about using additional RUM options, working with Django + PostgreSQL.

Source: https://habr.com/ru/post/443368/

All Articles