AdMe.ru: Tags, full-text search and all that ...

In this article I will reveal the secrets of how adme.ru works, namely, I will tell you how storage of articles and tags is arranged, how full-text search works, search by tags, and so on ...

The article provides an example of the structure of the tables, the working Sphinx config and some php code with sample samples.

A bit of history

Once upon a time, tags on adme.ru were stored in the manner described in this post (method number two), and the search was carried out in the manner described in this post (match against).
It was implemented in the following way: In the table where the articles were stored there was a text field tags - in this field the ID tags were stored in this form: "3a10 s3 1zog s7 4eus s8 vx rt 4lcd 4gv4 3c4i 3c3q 3a09 39za" is nothing else id tags in the 36-digit number system (was chosen because of the compactness of the representation of large numbers) listed through a space. Just this method of storage allowed to receive articles from a table by a specific tag or tags. Of course, for the operation of such a scheme, it was necessary to keep the table in MyISAM, across the tags field to make a fulltext index, and in my.cnf specify ft_min_word_len = 0 so that full-text search works and for tags whose ID in the 36th system will be less than 3 characters ( or 2, I do not remember). Now, about the search: Full-text search did not work on the source table but on its own, which was replenished and changed during those moments when the editor added, deleted or changed articles. In this table, for each article there was a text field that contained word forms. The word forms themselves were built not only from the content, title and announcement of the article, but also from the tags belonging to this article. And again, our not beloved MyISAM, fulltext index and match against.
')
Unfortunately, such tag storage and such a search mechanism have several important disadvantages:

When adding or changing a record in a table (MyISAM) with articles, the entire table lochitsya, the process of adding / changing a record itself takes a long time and as a result all other customers who want to get something from the table get in the queue. Articles are edited and added constantly, which leads to permanent site brakes.
Full-text search using match against is very, very deceptive, and gives a significant load on the database.
It works slowly.
It works very slowly.
It works like Kick as slow.

All these shortcomings made us think and make the right decision.

What is the solution

Links tags and materials stored in a separate table. Selection of materials by tags, by time range and full-text search to give Sphinx'u.
As a result, Sphinx saved us from all the flaws of the previous scheme and did not add any new ones. In this article I will not dwell on the moments associated with the installation of Sphinx and I will not talk about sphinx-api - we will assume that all this is already familiar to the reader.

Keeping tags and articles

The requirements for tags are:

It should be possible to set the tags synonyms, for example: commercials, TV commercials and so on. You also need the ability to designate the main of these synonyms.
It should be possible to set tags for their values in different languages, for example, in Russian, the tag will be presented as "Movies", and in English as "TV-Spots" or "Reels", for other languages, the rule from p.1 should also be observed
Tags must be organized into multi-level trees, so that objects can be classified using several attributes. For example, the tag "Russia" is a subsidiary to the tag "Geography".

Of these requirements, approximately the following table structure emerges:
tags
id
path - URL of the tag, a string of Latin letters, for example for the tag "Creative Review" it will be "kreativnyj_obzor"
left_key - tree
right_key - tree
gr_id - tag group (tree) ID. It is necessary to store several trees in one table and to determine if a particular tag belongs to a specific tree.
tags_names
tree_id - tag identifier from tags table
name - natural language tag representation, for example Russia, Russia
primary - the main name value flag for the current tree_id
lang - language code
The structure of article tables (simplified version):
articles
id
header - header
anounce - announcement
published_at - publication date
modified_at - last modified date
published - flag of publication
articles_content
id - id of articles from articles
content - a large piece of text - article content
articles_sphinx - this table always contains only one entry.
ts - last full indexing time
Link tags and articles:
articles_tags
id - article identifier
tag_id - tag identifier
order - tag order (this field is important for us, since we build the full URL of the article for its first tag, here is an example )

How to index real-time

For real-time indexing, we use two indexes - the main index contains all articles and is updated at night, and the delta index contains articles added or modified today.

Accordingly, every time we change or add an article, we run a reindexing of the delta index.
Since the delta index itself does not contain a lot of materials, the indexing time is very short. We reindex the main index at night.

In this case, the search is always carried out simultaneously on two indices - on the main and delta.

Sphinx working config for real-time indexing and tag search

  source articles
 {	
	 type = mysql
	 sql_host = *********
	 sql_user = *********
	 sql_pass = *********
	 sql_db = *********
	 sql_port = 3306  

	 sql_query_pre = SET NAMES utf8
	 sql_query = SELECT article.id, article.id AS idt, article.header, article.anounce, article_content.content, article.published_at, 0 AS deleted, \
					 (SELECT concat (concat ('_ tag', replace (group_concat (tags_names.tree_id), ',', '_tag'), ''), replace (group_concat (tags_names.name), ',', '')) \
					 FROM tags_names AS tags_names \
					 JOIN articles_tags AS articles_tags ON (tags_names.tree_id = articles_tags.tag_id) \
					 WHERE articles_tags.id = article.id \
					 ) AS `tags_names` \
				   FROM articles AS article \
				   LEFT JOIN articles_content AS article_content on (article.id = article_content.id) \
				   WHERE article.published = 1;
	 sql_query_post_index = UPDATE articles_sphinx set ts = (select max (modyfied_at) from articles)
	 sql_attr_timestamp = published_at
	 sql_attr_uint = idt
	 sql_attr_uint = deleted
	 sql_attr_multi = uint tags from query;  SELECT id, tag_id FROM articles_tags order by `order` asc
	 sql_ranged_throttle = 0
 }

 source articles_delta: articles
 {
	 sql_query = SELECT article.id, article.id AS idt, article.header, article.anounce, article_content.content, article.published_at, 0 AS deleted, \
					 (SELECT concat (concat ('_ tag', replace (group_concat (tags_names.tree_id), ',', '_tag'), ''), replace (group_concat (tags_names.name), ',', '')) \
					 FROM tags_names AS tags_names \
					 JOIN articles_tags AS articles_tags ON (tags_names.tree_id = articles_tags.tag_id) \
					 WHERE articles_tags.id = article.id \
					 ) AS `tags_names` \
				   FROM articles AS article \
				   LEFT JOIN articles_content AS article_content on (article.id = article_content.id) \
				   WHERE article.published = 1 AND modyfied_at> (SELECT ts FROM articles_sphinx limit 1)

	 sql_query_post_index = select 1; 

	 sql_attr_multi = uint tags from query;  SELECT id, tag_id FROM articles_tags \
							 WHERE id> = (SELECT min (id) FROM articles WHERE modyfied_at> (SELECT ts FROM articles_sphinx limit 1)) \
							 AND id <= (SELECT max (id) FROM articles WHERE modyfied_at> (SELECT ts FROM articles_sphinx limit 1))
 }

 index articles_index
 {
	 source = articles
	 path = / usr / local / sphinx / var / data / articles
	 docinfo = extern
	 mlock = 0
	 morphology = stem_enru
	 min_word_len = 1
	 charset_type = utf-8
	 html_strip = 1
 }

 index articles_index_delta: articles_index
 {
	 source = articles_delta
	 path = / usr / local / sphinx / var / data / articles_delta
 }

Some lines of the Sphinx config with comments

  source articles
 {	
	 sql_query_post_index = UPDATE articles_sphinx set ts = (select max (modyfied_at) from articles)
 }

Every time after full indexation of articles, we set the time in the articles_sphix table as the maximum time for modifying articles - this is our time.
used in source articles_delta in order to index only the materials that have changed since the last full indexation.

  source articles
 {	
	 sql_attr_multi = uint tags from query;  SELECT id, tag_id FROM articles_tags order by `order` asc
 }

This is just a multi-value attribute - in fact, for each article we get a field with a list of ID tags by ID values, we can filter the search results.

Part of the main query from both sources:

  (SELECT concat (concat ('_ tag', replace (group_concat (tags_names.tree_id), ',', '_tag'), ''), replace (group_concat (tags_names.name), ',', '')) \
  FROM tags_names AS tags_names \
  JOIN articles_tags AS articles_tags ON (tags_names.tree_id = articles_tags.tag_id) \
  WHERE articles_tags.id = article.id \
 ) AS `tags_names`

This query for each article collects a string of the form: "_tag1020 _tag1342 _tag1243 Russia France" - this line contains the submission of tags in natural language and in the form of _tagN lines, where N is the tag tag. As you already understood, tags in the index are presented in three forms: in the form of a multi-value attribute, in the form of strings "_tag1020" and in natural language, just below the text it is told why this is done.

  source articles_delta: articles
 {
	 sql_query_post_index = select 1; 
 }

Everything is simple here - we need to block the request from the source from which we inherit.

Why we send to Sphinx tags in three forms:

In the attributes:

Multi-value tag list field for implementing samples by one or multiple tags with AND and OR conditions if there is an additional search string. An example can be found here . This example using the condition I. Multi-value attribute cannot be used to search for similar materials, since it returns weight = 1 for all materials - that is, we cannot determine relevance, but we need to sort by it.

In full-text index:

The presentation of the tag in natural languages, that is, all the submission of the tag: "Russia Russia Russia" for full-text search is added to the indexed text.
The string value in the form of "_tagN", where N is the tag id, for searching similar materials - this method partially covers the Multi-value fields, but is not suitable for tag samples by the condition OR if there is an additional search substring, since lines are already used by those _tagN.

Indexing

'/ usr / local / sphinx / bin / indexer articles_index_delta - rotate' - this is run every time you save an article
'/ usr / local / sphinx / bin / indexer articles_index --rotate' - and this is run on cron at night, or several times a day - as convenient.

Sample by one or more tags

When selecting materials by tags, there are two tasks:

Condition I: Select all materials for a specific tag / tag so that all requested tags are in the material
Condition OR: Select all materials for a specific tag / tag so that at least one of the requested tags is in the material

With the use of the Sphinx, this is solved quite simply.
AND and OR conditions using filters
AND:

  $ tags_ids = array (1,2,3,4,5,6);
 foreach ($ tags_ids as $ tag_id) {
	 // Adds new filters in each iteration, in each filter 
	 // one tag each - as a result we get the condition AND
	 $ sphinx_obj-> SetFilter ('tags', array ($ tag_id)); 
 }
 $ search_res = $ sphinx_obj-> Query ('', 'articles_index articles_index_delta');

OR:

  $ tags_ids = array (1,2,3,4,5,6);
 $ sphinx_obj-> SetFilter ('tags', $ tags_ids); 
 $ search_res = $ sphinx_obj-> Query ('', 'articles_index articles_index_delta');

As you can see, the only difference is that in the first case we will get several filters, each of which must correspond to the desired material, and in the second case we will get one filter. Quote from the Sphinx documentation: "SetFilter ($ attribute, $ values, $ exclude = false) If you are not logged exclude is true). ”- fully reveals the essence of the approach.

AND and OR conditions without using filters (works faster than filters)
AND:
To do this, all the IDs of the required tags should be brought to their pseudo-strings.

  $ tags_query_string = '_tag1 _tag2 _tag3';
 $ sphinx_obj-> SetMatchMode (SPH_MATCH_ALL);
 $ search_res = $ sphinx_obj-> Query ($ tags_query_string, 'articles_index articles_index_delta');

OR:
From the previous example, you need to replace SPH_MATCH_ALL with SPH_MATCH_ANY

A selection of similar materials

In our case, the similarity of materials is determined by the similarity of their tags.

  // we throw out the current material from the sample
 $ sphinx_obj-> SetFilter ('idt', array ($ current_article_id), true); 
 $ sphinx_obj-> SetMatchMode (SPH_MATCH_ANY);
 // sort by weight and by publication date so that it is fresh on top
 $ sphinx_obj-> SetSortMode (SPH_SORT_EXTENDED, '@weight DESC, published_at DESC'); 
 $ search_res = $ sphinx_obj-> Query ('_ tag1 _tag2 _tag3 _tag4', 'articles_index articles_index_delta');

Full text search

It's simple

  $ sphinx_obj-> SetRankingMode (SPH_RANK_PROXIMITY_BM25);
 $ sphinx_obj-> SetMatchMode (SPH_MATCH_ALL);
 $ sphinx_obj-> Query ('I want an elephant', 'articles_index articles_index_delta');

Results

The scheme described in this article has been working stably on adme.ru for quite a long time under the following conditions:
Tags: 87 402 pieces.
Articles: 22,069, Tagging: 132,935.
Images: 103 353, Tagging: 751 950.
Video: 68 316, Tagging: 682 415.
Of course, the given code and the structure of the tables are considerably simplified for clarity, but nevertheless can be used in real projects.

btw

In the sphinx config, there is a searchd section, in which, with a large number of returned results, it is recommended to tweak the max_matches parameter. We have it worth 20,000, as the materials on the usa tag are more than 15,000, but this is not in the articles, but in the encyclopedia of advertising. By default, this parameter is 1000.

Questions, advice, criticism - all welcome.

Source: https://habr.com/ru/post/40443/

All Articles