📜 ⬆️ ⬇️

Postgres in Chinese or setting Full Text Search in Postgres for Chinese

image

A client approached us with a request to update PostgreSQL to the latest version, and at the same time teach him Chinese.
More precisely, to optimize the process of full-text search in Chinese, because it braked the whole thing mercilessly.

The following describes how we did it.
Immediately get down to business.
')

The first step is to build and install Simple Chinese Word Segmentation (SCWS) and the zhparser extension itself.
SCWS
wget -q -O - http://www.xunsearch.com/scws/down/scws-1.2.2.tar.bz2 | tar xf - cd scws-1.2.2 ; ./configure ; make install 

Now zhparser
github.com/amutu/zhparser - there is also a full instruction here, and we are doing it.
 git clone https://github.com/amutu/zhparser.git SCWS_HOME=/usr/local make && make install 

If gathered successfully, we include expansion in the necessary base
 psql -U postgres -d test_dbname -c 'CREATE EXTENSION zhparser' 

Next, go to the desired database and create a configuration.
 test_dbname=# CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser); test_dbname=# ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple; 

Suppose we already have a table with data.
 CREATE TABLE messages ( title text, body text ); 

And we need to search the columns title and body. Let there be such a record
 INSERT INTO messages VALUES('批发新', '款新婴幼'); 

To do this, we need to add a tsv column with a tsvector type for the tokens.
 test_dbname=# ALTER TABLE messages ADD COLUMN tsv tsvector; 

We hang on it the gin index (https://ru.wikipedia.org/wiki/GIN)
 test_dbname=# CREATE INDEX tsv_idx ON messages USING gin(tsv); 

Now, we create a trigger to automatically update the tokens in the tsv column using our public.testzhcfg configuration, which we created earlier.
 test_dbname=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON messages FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger( tsv, 'public.testzhcfg', title, body ); 

And the last step.
We need to update the tsv column with our new trigger. (and as we can see it is activated when insert and update), for this we use the UPDATE to rewrite the current values ​​of the title and body fields, the same values.
 test_dbname=# UPDATE messages SET title=title, body=body; 

You can try
SELECT title, body FROM messages WHERE tsv @@ to_tsquery ('批 & 款');

Questions, suggestions and suggestions write in the comments.
Thanks for attention!

Source: https://habr.com/ru/post/301820/


All Articles