Using Tsearch2 in PostgreSQL 8.3

Many in the implementation of site search have to make a query to the database using the operator LIKE or regular expressions.
With the release of PostgreSQL 8.3 , the full-text search module tsearch2 was built into the kernel of the system (this has already been written on Habré), and with its help you can implement database search, which will be more functional. In principle, the article mentioned above gives an example of using this module, but I would like to share the experience of implementing full-text search in a real project .

So, there is a table “news” containing the fields: “title”, “metaKeywords”, “metaDescription” and “content”.
It is necessary to implement a full-text search that will search for a word across all fields of the table, and each field has its own absolute priority.

Create a database in UTF8 encoding, and create a test table in it:

CREATE DATABASE "tsearch2"
WITH ENCODING = 'UTF8' ;
CREATE TABLE "news"
(
"NewsId" Serial NOT NULL ,
"Title" Varchar (1024) NOT NULL ,
"MetaKeywords" Varchar (2048),
"MetaDescription" Varchar (1024),
"Content" Text NOT NULL ,
primary key ("newsId")
);

Next, add some entry to this table:

INSERT INTO "news" ("title", "metaKeywords", "metaDescription", "content")
VALUES ( 'Test news' , 'news, article' , 'Test news for search' , 'Hello world' );

Now you need to create a full-text search configuration:

CREATE TEXT SEARCH DICTIONARY mydict_russian_ispell (
TEMPLATE = ispell,
DictFile = russian,
AffFile = russian,
StopWords = russian
);

CREATE TEXT SEARCH CONFIGURATION public. Mydict_ru (PARSER = default );
COMMENT ON TEXT SEARCH CONFIGURATION public .mydict IS 'conf. for mydict ru ' ;
')
ALTER TEXT SEARCH CONFIGURATION mydict ADD MAPPING
FOR email, url, url_path, host , file , version,
sfloat, float , int , uint,
numword, hword_numpart, numhword
WITH simple;

ALTER TEXT SEARCH CONFIGURATION mydict ADD MAPPING
FOR word, hword_part, hword
WITH mydict_russian_ispell;

In this script we have created our ispell dictionary mydict_russian_ispell . Then they created their own configuration, mydict_ru , which indicated the rules for using dictionaries.
It is worth saying that the first request uses links to files in postgres in the default directory (\ share \ tsearch_data \).
To create our own dictionary, we needed 3 files: russian.affix, russian.stop and russian.dict. The first contains a description of the endings of words in a given language, the second - a list of stop words, the last - the words themselves. It is important to remember that the files must be in the same encoding as the base itself, i.e. in our case, UTF8.

When implementing a search, we will refer not to the database fields, but to a special index that will contain information about the contents of the table.

ALTER TABLE "news" ADD COLUMN fts_news tsvector;
UPDATE "news" SET fts_news =
setweight ( coalesce (to_tsvector ( 'mydict_ru' , "title"), '' ), 'A' ) ||
setweight ( coalesce (to_tsvector ( 'mydict_ru' , "metaKeywords"), '' ), 'B' ) ||
setweight ( coalesce (to_tsvector ( 'mydict_ru' , "metaDescription"), '' ), 'C' ) ||
setweight ( coalesce (to_tsvector ( 'mydict_ru' , "content"), '' ), 'D' );

CREATE INDEX news_fts_idx ON "news" USING gist (fts_news);

vacuum analyze "news";

With this script, we created another tsvector type field in the table, which contains information about the four fields in the table, the “title” field is taken with the highest priority A , and the “content” field with the lowest D. Then we created a GIST index and made an index update.
As a result of our test record of the table, the following tsvector - 'search': 8C 'article': 4B 'news': 2A, 3B, 6C 'test': 1A, 5C .

Well, now it's time to test our search. Perform the following query:

SELECT
"NewsId",
ts_headline ( 'mydict_ru' , "title", q) as "title",
rank
FROM (
SELECT
"NewsId",
"Title",
q,
ts_rank (fts_news, q) as rank
FROM "news", plainto_tsquery ( 'mydict_ru' , 'news' ) q
WHERE fts_news @@ q
ORDER BY rank DESC
) AS foo;

As a result, we get the string Test News . The highlighting of the found word is implemented by the ts_headline function; query ranking is the ts_rank function, by the value of which we sort the results.

Well, of course, you need to remember to make a trigger that will update the “fts_news” field on the “news” table:

CREATE OR REPLACE FUNCTION "updateNewsFTS" () RETURNS " trigger " AS '
DECLARE bUpdate boolean;
BEGIN
bUpdate = false;
IF (TG_OP = ' ' INSERT ' ') THEN
bUpdate: = true;
ELSEIF (TG_OP = ' ' UPDATE ' ') THEN
IF NEW.title! = OLD.title OR NEW.content! = OLD.content OR NEW. "MetaKeywords"! = OLD. "MetaKeywords" OR NEW. "MetaDescription"! = OLD. "MetaDescription" THEN
bUpdate: = true;
END IF;
END IF;

IF bUpdate = TRUE THEN
RAISE NOTICE ' ' UPDATE ' ';
new.fts_news: = setweight (coalesce (to_tsvector (' ' mydict_ru ' ', new.title), ' ' ' '), ' ' A ' ') ||
setweight (coalesce (to_tsvector (' ' mydict_ru ' ', new. "metaKeywords"), ' ' ' ', ' ' B ' ') ||
setweight (coalesce (to_tsvector (' ' mydict_ru ' ', new. "metaDescription"), ' ' ' ', ' ' C ' ') ||
setweight (coalesce (to_tsvector (' ' mydict_ru ' ', new.content), ' ' ' '), ' ' D ' ');
END IF;
RETURN NEW;
END;
' LANGUAGE ' plpgsql ' VOLATILE;

CREATE TRIGGER "newsFTSTrigger"
BEFORE INSERT OR UPDATE ON "news"
FOR EACH ROW EXECUTE PROCEDURE “updateNewsFTS” ();

UPD

Performance

Since I was asked to provide information about the performance in the comments, I decided to compare two types of search - tsearch and regular expression search.
The search is conducted by VIEW, which connects data from four tables (about 5400 records in total).

Query using tsearch:

SELECT
"Id"
"Type",
ts_headline ( 'mydict_ru' , "title", q) as "title",
( CASE WHEN trim ("foreword") = '' THEN ts_headline ( 'mydict_ru' , "content", q)
ELSE ts_headline ( 'mydict_ru' , "foreword", q) END ) as "body",
"ResourceTypes",
rank
FROM (
SELECT
"Id"
"Type",
"Title",
Foreword
"Content",
"ResourceTypes",
q,
ts_rank (fts_vector, q) as rank
FROM "getSearchItems", plainto_tsquery ( 'mydict_ru' , 'physics' ) q
WHERE fts_vector @@ q
ORDER BY rank DESC
) AS foo;

The average time to complete such a request on my laptop is 2.35 seconds. The result contains 821 entries.

A query that uses regular expressions and implements a sort of ranking of results:

SELECT *
, ( SELECT position ( lower ( 'physics' ) in lower (" search ". "Title")) as "pos"
EXCEPT SELECT 0 as "pos") as "titlePosition"
, ( select position ( lower ( 'physics' ) in lower (" search ". "metaKeywords")) as "pos"
EXCEPT SELECT 0 as "pos") as "metaKeywordsPosition"
, ( SELECT position ( lower ( 'physics' ) in lower (" search ". "MetaDescription")) as "pos"
EXCEPT SELECT 0 as "pos") as "metaDescriptionPosition"
, ( SELECT position ( lower ( 'physics' ) in lower (" search ". "Foreword")) as "pos"
EXCEPT SELECT 0 as "pos") as "forewordPosition"
, ( SELECT position ( lower ( 'physics' ) in lower (" search ". "Content")) as "pos"
EXCEPT SELECT 0 as "pos") as "contentPosition"
FROM (
SELECT
"Id"
"Type",
"Title",
"MetaKeywords",
"MetaDescription",
Foreword
"Content"
FROM "getSearchItems"
WHERE ( lower ("title") ~ lower ( '(. *)' || physics' || '(. *)' )
or lower ("metaKeywords") ~ lower ( '(. *)' || physics' || '(. *)' )
or lower ("metaDescription") ~ lower ( '(. *)' || physics' || '(. *)' )
or lower ("foreword") ~ lower ( '(. *)' || 'physics' || '(. *)' )
or lower ("content") ~ lower ( '(. *)' || physics' || '(. *)' ))
) as " search "
ORDER BY "type" ASC
, "TitlePosition" ASC
, "MetaKeywordsPosition" ASC
, "MetaDescriptionPosition" ASC
, "ForewordPosition" ASC
, "ContentPosition" ASC ;

The average time to complete such a request on my laptop is 1.5 seconds. The result contains 567 records.
Thus, when using tsearch, I got the work time 1.5 times longer than the “simple” search, but I got about 1.5 times more records that contain different forms of the word physics and are already ready for output to the pattern .

PS

This search implementation provides full-text search for Russian words in the database.
With this configuration, the English words were not indexed. To fix this, you need to create a different search configuration, I just provided the one I used myself.
So far, the simplest and most accessible to me is this kind of configuration, adding an English ispell dictionary and indexing of English words:

CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
DictFile = english,
AffFile = english,
StopWords = english
);

ALTER TEXT SEARCH CONFIGURATION mydict ADD MAPPING
FOR asciiword
WITH english_ispell;

Source: https://habr.com/ru/post/30533/

All Articles

Using Tsearch2 in PostgreSQL 8.3

Performance

PS

More articles: