We look for typo names in PostgreSQL

It all started with the fact that I needed to develop a search for patients for a single internal medical system. The logic of the work was that if we did not find a person in the system, then it must be created (and duplicate patients cannot be produced). In this regard, one of the subtasks was the implementation of the search for people, taking into account typos in their names. Well, since I love PostgreSQL (and when you have a hammer in your hands, everything looks like nails), it’s not hard to guess what I decided to do with a typo search ...

Usually, a similar problem is solved in two ways: using a fuzzy search or using phonetic algorithms. Being a lazy and holy man who believed that everything had been ~~stolen~~ long before was invented before us, I turned to the PostgreSQL documentation. There are currently two modules in PostgreSQL that can help you search with typos: pg_trgm and fuzzystrmatch .

pg_trgm works with trigrams, can search by substring and fuzzy search. As indexes it works with gist and gin .
fuzzystrmatch can read the Levenshtein distance between words and three phonetic algorithms: Soundex , Metaphone and Double Metaphone . The pitfalls are, firstly, that the Levenshtein function in this module does not allow creating an index for an arbitrary search query. Secondly, all phonetic algorithms are implemented for the Latin alphabet.

In this regard, I began to look for the solution of the problem where it is lighter, namely with the pg_trgm module.
')

Trigrams

For simplicity of the model, consider the info table with the patient ID, his last name, first name and patronymic. Since we want gist / gin indexes, first we need to know an important but unpleasant moment: one gist / gin index is one column of the table. It cannot be created, for example, by concatenating multiple columns. Therefore, below the cat will be created:

pg_trgm extension
patient table storing full name in the form of jsonb (with checks for the existence and filling of keys)
Immunity function for building an index of trigrams, which converts the name from jsonb to text

Boring SQL code

create extension pg_trgm; create table info ( patid integer, fullname jsonb, constraint info_pk primary key (patid), constraint fullname_exists check ( fullname ? 'lname'::text and fullname ? 'fname'::text and fullname ? 'sname'::text ), constraint fullname_notnull check ( (fullname ->> 'lname'::text) is not null and (fullname ->> 'fname'::text) is not null ) ); create function fullname(in_fullname jsonb) returns text language plpgsql immutable as $$ begin return regexp_replace( lower( trim( coalesce(in_fullname->>'lname', '') || ' ' || coalesce(in_fullname->>'fname', '') || ' ' || coalesce(in_fullname->>'sname', '') ) ), '', '', 'g' ); exception when others then raise exception '%', sqlerrm; end; $$;

We insert into the table about 300 thousand full name entries and proceed.

Trigrams and GIST

So, first check the gist index for a fuzzy search by trigrams.

Even more boring SQL code

 create index info_gist_idx on info using gist (fullname(fullname) gist_trgm_ops); CREATE INDEX Time: 15054,102 ms explain (analyze, buffers) select patid, fullname(fullname) <-> '  ' as dist from info order by dist limit 10; Limit (cost=0.28..4.35 rows=10 width=8) (actual time=157.378..157.688 rows=10 loops=1) Buffers: shared hit=5743 -> Index Scan using info_gist_idx on info (cost=0.28..126822.96 rows=312084 width=8) (actual time=157.371..157.655 rows=10 loops=1) Order By: (fullname(fullname) <-> '  '::text) Buffers: shared hit=5743 Planning time: 0.225 ms Execution time: 158.223 ms (7 rows)

The index was created 15 seconds, the size of 45 MB, the search for an incomplete name with misspellings - 158 ms.

Trigrams and GIN

Next, consider the gin index for a fuzzy trigram search.

Do you think the previous SQL spoilers were boring?

 create index info_trgm_idx on info using gin(fullname(fullname) gin_trgm_ops); CREATE INDEX Time: 10163,401 ms explain (analyze, buffers) select patid, similarity(fullname(fullname), '  ' ) as sml from info where true and fullname(fullname) % '  ' order by sml desc limit 10; Limit (cost=1180.22..1180.25 rows=10 width=8) (actual time=133.086..133.117 rows=8 loops=1) Buffers: shared hit=5741 -> Sort (cost=1180.22..1181.00 rows=312 width=8) (actual time=133.080..133.090 rows=8 loops=1) Sort Key: (similarity(fullname(fullname), '  '::text)) DESC Sort Method: quicksort Memory: 25kB Buffers: shared hit=5741 -> Bitmap Heap Scan on info (cost=26.70..1173.48 rows=312 width=8) (actual time=132.828..133.048 rows=8 loops=1) Recheck Cond: (fullname(fullname) % '  '::text) Heap Blocks: exact=7 Buffers: shared hit=5741 -> Bitmap Index Scan on info_gist_idx (cost=0.00..26.62 rows=312 width=0) (actual time=132.699..132.699 rows=8 loops=1) Index Cond: (fullname(fullname) % '  '::text) Buffers: shared hit=5734 Planning time: 0.573 ms Execution time: 133.225 ms (15 rows)

Creating an index of 10 seconds, the size of 18 MB, search for an incomplete name with typos - 133 ms.

To be honest, the results are so-so - after all, I have a Chinese ssd disk on my laptop from the glorious city of Shenzhen. Therefore, we will try to speed up the process by combining fuzzy and full-text search.

Trigrams and full text search

The idea is extremely simple - to collect from all spellings of surnames, names and middle names a separate dictionary table. First, we cut the input search string into lexemes, search each of the lexemes in the dictionary table through a fuzzy search, select all possible spellings of each lexeme from there, put them in tsquery and do a full-text search by tsvector index of the info table. The benefit of this plan is that the speed of fuzzy search by trigrams depends on the width of the string and their number in the column with the text. Obviously, the full name dictionary will be more compact than the original column in the info table, which means that the search will be faster. There is only one drawback to the method - when adding each new patient, you will have to update the dictionary if there are no tokens from the full name in it. To check, we will need to build from the rum source code an index to build the tsvector index by the name in the info table. Rum is a modified version of the gin index that stores additional information in the leaves. In our case, we use the rum_tsvector_ops operator class, which stores positional information about the token in the index. Therefore, unlike gin, we can use an index-only tsquery query of the form

 (||)<->()<->()

without accessing the table for more information about the order of the token in the tuple. Moreover, the recommendation for gin is the physical existence of the tsvector column, since all found pointers to tuples will have to be rechecked in the table. And if physically there is no tsvector column (you built it with a function for the index), then for each tuple you will have to perform an additional tsvector calculation. In general, the rum in this story will be much more productive.

Everest in the world of boring SQL

 create extension rum; create index info_rum_idx on info using rum ( to_tsvector('simple'::regconfig, fullname(fullname)) rum_tsvector_ops ); CREATE INDEX Time: 7.545s (7 seconds) create table patname ( lex text, constraint patname_uniq_idx unique (lex) ); create index patname_fuzzy_idx on patname using gin (lex gin_trgm_ops); CREATE INDEX Time: 0.596s insert into patname (lex) select word from ts_stat($$ select to_tsvector('simple', fullname(fullname)) from info $$); explain (analyze, buffers) with fio as ( select lexeme as lex, positions[1] as pos from unnest(to_tsvector('simple','  ')) ), query as( select to_tsquery('simple', string_agg(q.tq,'&')) as q from ( select f.pos, '('||string_agg(p.lex,'|')||')' as tq from fio as f join patname as p on p.lex % f.lex group by f.pos ) as q ) select to_tsvector('simple'::regconfig, fullname(fullname)) <=> (select q from query) as rank, * from info where to_tsvector('simple'::regconfig, fullname(fullname)) @@ (select q from query) order by to_tsvector('simple'::regconfig, fullname(fullname)) <=> (select q from query); Sort (cost=6453.71..6457.61 rows=1560 width=100) (actual time=68.201..68.202 rows=1 loops=1) Sort Key: ((to_tsvector('simple'::regconfig, fullname(info.fullname)) <=> $3)) Sort Method: quicksort Memory: 25kB Buffers: shared hit=536 CTE fio -> Function Scan on unnest (cost=0.00..0.10 rows=10 width=34) (actual time=0.023..0.034 rows=3 loops=1) CTE query -> Aggregate (cost=1484.60..1484.86 rows=1 width=32) (actual time=11.829..11.830 rows=1 loops=1) Buffers: shared hit=325 -> HashAggregate (cost=1484.30..1484.48 rows=10 width=34) (actual time=11.640..11.644 rows=2 loops=1) Group Key: f.pos Buffers: shared hit=325 -> Nested Loop (cost=16.58..1480.53 rows=755 width=19) (actual time=2.940..11.442 rows=62 loops=1) Buffers: shared hit=325 -> CTE Scan on fio f (cost=0.00..0.20 rows=10 width=34) (actual time=0.028..0.053 rows=3 loops=1) -> Bitmap Heap Scan on patname p (cost=16.58..147.28 rows=75 width=17) (actual time=1.905..3.717 rows=21 loops=3) Recheck Cond: (lex % f.lex) Rows Removed by Index Recheck: 321 Heap Blocks: exact=275 Buffers: shared hit=325 -> Bitmap Index Scan on patname_fuzzy_idx (cost=0.00..16.57 rows=75 width=0) (actual time=1.277..1.277 rows=342 loops=3) Index Cond: (lex % f.lex) Buffers: shared hit=50 InitPlan 3 (returns $3) -> CTE Scan on query (cost=0.00..0.02 rows=1 width=32) (actual time=0.004..0.006 rows=1 loops=1) InitPlan 4 (returns $4) -> CTE Scan on query query_1 (cost=0.00..0.02 rows=1 width=32) (actual time=11.834..11.839 rows=1 loops=1) Buffers: shared hit=325 -> Bitmap Heap Scan on info (cost=31.99..4885.97 rows=1560 width=100) (actual time=68.184..68.187 rows=1 loops=1) Recheck Cond: (to_tsvector('simple'::regconfig, fullname(fullname)) @@ $4) Heap Blocks: exact=1 Buffers: shared hit=536 -> Bitmap Index Scan on info_rum_idx (cost=0.00..31.60 rows=1560 width=0) (actual time=67.847..67.847 rows=1 loops=1) Index Cond: (to_tsvector('simple'::regconfig, fullname(fullname)) @@ $4) Buffers: shared hit=517 Planning time: 5.012 ms Execution time: 68.583 ms (37 rows)

Total, the full-text search index was created 7 seconds (size 13 MB), the lexemes dictionary index was created in 0.6 seconds (size 5.8 MB), the search was 68 ms. Among the disadvantages - the selectivity is worse than the previous options.

Phonetic algorithms

Having tried the options of a fuzzy search from the pg_trmg module, I decided to look again at fuzzystrmatch. I did not invent how to index the Levenshtein function, but I was extremely interested in the phonetic algorithms. As mentioned above, out of the box in PostgreSQL, phonetic functions are implemented only for Latin and sharpened under English names. Searching the Internet for their Russian implementations led me to a wonderful article on Habré, which described the working Metaphone algorithm for Russian names (consisting of Russian letters). There was only one sadness - even though it was simple, it was somehow quite sad to implement this logic on plpgsql, then it was a matter of some Python ... And then I remembered that plpython3u is unsafe (functions on it can access file system with the rights of the postgres process), but perfectly working language in PostgreSQL. And sin would not use it. Therefore, I wrote two immunity functions:

phoneme on plpython3u, which turns a token into a phoneme ("smirnov" in "smirnaf") according to an algorithm from an article on Habré
metaphone on plpgsql, which turns not one lexeme into a phoneme, but the whole text into a set of phonemes. In fact, this is just a binding over the phoneme function.

Metaphone and btree

Next, we will try to create an ordinary btree index using the metaphone function and estimate the speed.

Pass by, there is nothing interesting

 create or replace function phoneme (in_lexeme text) returns text language plpython3u immutable as $$ import re class Lexeme: def __init__(self, body): """ :type body: str """ self.body = body.lower().strip() #    self._vowels = {"(?:|||)": "", "[]": "", "[]": "", "[]": ""} #    self._consonants = {"": "", "": "", "": "", "": ""} #    ,         _deafening_chars self._deafening_chars = ["", "", "", "", "", "", ""] #   self._removable_chars = {"[]": ""} def _remove_double_chars(self): return Lexeme("".join((char for num, char in enumerate(self.body) if char != self.body[num - 1]))) def _deafen_consonants(self): modified_body = "" for num, char in enumerate(self.body): if char in self._consonants and ( num < len(self.body) - 1 and self.body[num + 1] in self._deafening_chars or num == len(self.body) - 1 ): modified_body += self._consonants[char] else: modified_body += char return Lexeme(modified_body) @staticmethod def _regexp_replace(text, char_dict): modified_body = text for item in char_dict: modified_body = re.sub(item, char_dict[item], modified_body) return Lexeme(modified_body) def _replace_vowels(self): return self._regexp_replace(self.body, self._vowels) def _remove_chars(self): return self._regexp_replace(self.body, self._removable_chars) def metaphone(self): return self._remove_chars()._replace_vowels()._deafen_consonants()._remove_double_chars().body return Lexeme(in_lexeme).metaphone() $$; create or replace function metaphone (in_phonemes text) returns text language plpgsql immutable as $$ begin return ( select string_agg(q.lex,' ') from ( select phoneme(lexeme) as lex from unnest(to_tsvector('simple', in_phonemes)) order by positions ) as q ); exception when others then raise '%', SQLERRM using errcode = SQLSTATE; end; $$; create index info_metaphone_idx on info ( metaphone(fullname(fullname)) text_pattern_ops ); CREATE INDEX Time: 114.757s (a minute) explain (analyze, buffers) select patid, fullname from info where metaphone(fullname(fullname)) like regexp_replace(metaphone('  '),'\s','%','g')||'%' limit 10; Limit (cost=76.03..1388.96 rows=10 width=96) (actual time=22.452..129.944 rows=3 loops=1) Buffers: shared hit=239 -> Bitmap Heap Scan on info (cost=76.03..4146.10 rows=31 width=96) (actual time=22.447..129.927 rows=3 loops=1) Filter: (metaphone(fullname(fullname)) ~~ '%%%'::text) Rows Removed by Filter: 244 Heap Blocks: exact=234 Buffers: shared hit=239 -> Bitmap Index Scan on info_metaphone_idx (cost=0.00..76.02 rows=1560 width=0) (actual time=0.061..0.061 rows=247 loops=1) Index Cond: ((metaphone(fullname(fullname)) ~>=~ ''::text) AND (metaphone(fullname(fullname)) ~<~ ''::text)) Buffers: shared hit=5 Planning time: 1.012 ms Execution time: 129.977 ms (12 rows) Time: 131,802 ms

The index was created 114 seconds, the size is 22 MB (it seems I didn’t write the most optimal function on python for performance), the query is 131 ms. The index is triggered only by a small part of the substring, and then the filter works because of the "%". Poorly.

Metaphone and trigrams

Let's try on the base of the metaphone function created on plpython3u to build an index of trigrams. But we will use it now not for a fuzzy search, but to search for a substring.

How to reduce the query time of folk remedies using ...

 create index info_metaphone_trgm_idx on info using gin (metaphone(fullname(fullname)) gin_trgm_ops); CREATE INDEX Time: 124.713s (2 minutes) explain (analyze, buffers) select patid, fullname from info where metaphone(fullname(fullname)) like '%'||regexp_replace(metaphone('  '),'\s','%','g')||'%' limit 10; Limit (cost=92.24..134.98 rows=10 width=96) (actual time=9.562..10.638 rows=3 loops=1) Buffers: shared hit=103 -> Bitmap Heap Scan on info (cost=92.24..224.74 rows=31 width=96) (actual time=9.554..10.617 rows=3 loops=1) Recheck Cond: (metaphone(fullname(fullname)) ~~ '%%%%'::text) Heap Blocks: exact=2 Buffers: shared hit=103 -> Bitmap Index Scan on info_metaphone_trgm_idx (cost=0.00..92.23 rows=31 width=0) (actual time=8.354..8.354 rows=3 loops=1) Index Cond: (metaphone(fullname(fullname)) ~~ '%%%%'::text) Buffers: shared hit=101 Planning time: 2.029 ms Execution time: 10.726 ms (11 rows) Time: 14,480 ms

Index creation time - 124 seconds, size 15 MB (hello my curved hands and plpython3u), search - 14 ms.

Results

Let's summarize a summary table of various search options with typos
UPDATE 1: added the movax implementation of the metaphone .
UPDATE 2: added the implementation of Metaphone on plpgsql from Ivan Milovanov (telegram - milovanov)

Metaphone on plpgsql

 create or replace function phoneme (in_lexeme text) returns text language plpgsql immutable as $$ declare res varchar(100) DEFAULT ''; begin res := lower(in_lexeme); res := regexp_replace(res,'[]','','g'); res := regexp_replace(res,'(|||)','','g'); res := regexp_replace(res,'[]','','g'); res := regexp_replace(res,'[]','','g'); res := regexp_replace(res,'','','g'); res := regexp_replace(res,'([]|$)','\1','g'); res := regexp_replace(res,'([]|$)','\1','g'); res := regexp_replace(res,'([]|$)','\1','g'); res := regexp_replace(res,'([]|$)','\1','g'); res := regexp_replace(res,'([]|$)','\1','g'); res := regexp_replace(res,'','','g'); res := regexp_replace(res,'','','g'); res := regexp_replace(res,'(.)\1','\1','g'); return res; exception when others then raise exception '%', sqlerrm; end; $$;

Search type	Index creation time	Index size	Missing search speed	Remarks
Gist trigrams	15 sec	45 MB	158 ms
Gin trigrams	10 sec	18 MB	133 ms
Trigrams and full text search	7.6 seconds	18.8 MB	68 ms	Worse selectivity, you need to maintain a dictionary of tokens
Metaphone btree	114 seconds	22 MB	131 ms	Plpython3u insecure language
Metaphone trigram	124 sec	15 MB	14 ms	Plpython3u insecure language
The implementation of the metagram trigram from movEAX	77.8 seconds	16 MB	14 ms	Plpython3u insecure language
Implementation of Ivan Milovanov on plpgsql	72.0 seconds	16 MB	14 ms

UPDATE 3: when the index contains “Smirnaf Dinis Anatalievich” , the letter “B” in the middle name is not stunned (since after it comes a vowel). If you search for the metaphone substring ('anatoliev') , then the letter “c” is not behind the vowel, but at the end it will be stunned. To get around this problem, the mquery function is written below and the search is performed by the expression

 select metaphone('  ') similar to mquery('  ');

Mqery function

 create or replace function mquery(in_fullname text) returns text language plpgsql immutable as $$ declare res text; begin res := metaphone(in_fullname); res := regexp_replace(res, '(|)', '(|)', 'g'); res := regexp_replace(res, '(|)', '(|)', 'g'); res := regexp_replace(res, '(|)', '(|)', 'g'); res := regexp_replace(res, '(|)', '(|)', 'g'); res := regexp_replace(res, '(|)', '(|)', 'g'); res := regexp_replace(res, '\s', '%', 'g'); return '%'||res||'%'; exception when others then raise exception '%', sqlerrm; end; $$;

Since in my case the system will be focused on reading, not on writing (maximum addition of a couple of patients per minute), my version is Metaphone with trigrams. If someone has ideas on how to optimize a function in Python for speed, then write in the comments, I will add data to the tests.

Source: https://habr.com/ru/post/341142/

All Articles