Sphinx Text Processing Pipeline

Text processing in the search engine looks quite simple from the outside, but in fact it is a complex process. When indexing, the text of the documents should be processed by the HTML stripper, tokenizer, stopword filter, word form filter and morphological processor. And at the same time, you need to remember about exceptions, blended characters, N-grams and sentence boundaries. When searching, everything becomes even more complicated, because in addition to all the above, you need to also process the syntax of the request, which adds all sorts of specials. characters (operators and masks). Now we will tell how it all works in Sphinx.

Picture as a whole

Simplified text processing pipeline (in version 2.x engine) looks like this:

')
It looks quite simple, but the devil is in the details. There are several very different filters (which are applied in a special order); Tokenizer does something else besides breaking the text into words; and finally, under “etc.” in the morphology block, there are in fact at least three different variants.

Therefore, the following picture will be more accurate:

Regular expression filters

This is an optional step. In essence, it is a set of regular expressions that apply to documents and queries sent to the Sphinx, and nothing more! Thus, this is just “syntactic sugar”, but quite convenient: with the help of regexps, Sphinx processes everything, and without them you would have to write a separate script to load data into the sphinx, then one more - to correct the requests, and both scripts needed would keep synchronous. And from the inside of the Sphinx, we simply run all the filters above the fields and queries before any further processing. Everything! A more detailed description can be found in the regexp_filter section of the documentation.

HTML stripper

This is also an optional step. This handler is connected only if the source directive contains the html_strip directive. This filter works right after the regular expression filter. The stripper removes all HTML tags from the incoming text. In addition, he can extract and index individual attributes of the specified tags (see html_index_attrs ), as well as delete text between tags (see html_remove_elements ). Finally, since the zones in the documents and paragraphs use the same SGML markup, the stripper performs the definition of the zone boundaries and paragraphs (see index_sp and index_zones ). (Otherwise, you would have to make another exactly the same pass through the document in order to accomplish this task. Ineffective!)

Tokenization

This step is required. Whatever happens, we need to break the phrase “Mary had a little lamb” into separate keywords. This is the essence of tokenization: transform a text field into a set of keywords. It would seem, what could be easier?

Everything is so, except that simple word splitting with spaces and punctuation marks does not always work, and therefore we have a whole set of parameters that control tokenization.

First, there are tricky characters that are both “characters” and “not characters” at the same time, and even more, they can simultaneously be a “character”, a “space” and a “punctuation mark” (which at first glance can also be interpreted as a space , but actually it is impossible). To combat all of this economy, the charset_table , blend_chars , ignore_chars and ngram_chars settings are used .

By default, the Sphinx tokenizer treats all unknown characters as spaces . Therefore, no matter how crazy your unicode pseudo-graphics are, you can “fill in” your document, it will be indexed simply as a space. All characters mentioned in charset_table are treated as ordinary characters . Also, the charset_table allows you to map one character to another : it is usually used to convert characters to a single case, to remove accents, or for everything together. In most cases, this is already enough: we reduce the known characters to the contents of the charset_table; we replace all unknowns (including punctuation) with spaces - and that's all, tokenization is ready.

However, there are three significant exceptions.

Sometimes a text editor (for example, Word) inserts soft hyphen characters directly into the text! And if you do not ignore them entirely (instead of simply replacing them with spaces), the text will be indexed as “ma ry had a lit t le lamb”. To solve this problem, use ignore_chars.
Eastern languages with hieroglyphs. They are not significant gaps! Therefore, for limited support for CJK texts (Chinese, Japanese, Korean) in the kernel, you can specify the ngram_chars directive, and then each such character will be considered as a separate keyword, as if it is surrounded by spaces (even if it is not so).
For tricky characters like & or. we do not really know in the process of word splitting, whether we want to index them or delete them. For example, in the phrase “Jeeves & Wooster”, the & sign can be removed. But in AT & T - no way! Also, you can not spoil the "Marwel's Agents of SHIELD". For this Sphinx, you can specify a list of characters with the blend_chars directive. The characters from this list will be processed in two ways at once: as ordinary characters, and as spaces. Notice how a simple turnover of ordinary characters can lead to the generation of multiple tokens when the blend_chars list comes into play: say, the “Back to USSR” field will be, as usual, split into tokens, as usual, but also one more the ussr token will be indexed in the same position as the u in the base partition.

And all this happens with the most basic elements of the text - the characters! Scared ?!

In addition, the tokenizer (oddly enough) can work with exceptions (such as C ++ or C # - where special characters make sense only in these keywords and can be completely ignored in all other cases), and besides, it can determine bounds of sentences (if the index_sp directive is given ). This task can not be solved later, since after tokenization we will no longer have any specials. characters, no punctuation. Also, it is not worthwhile to do this at earlier stages, since, again, 3 passes through the same text, in order to make 4 operations on it, this is worse than the one and only one, which will immediately put everything in its place.

Inside, the tokenizer is designed so that exceptions are triggered before anything else . In this sense, they are very similar to regular expression filters (and moreover, they can easily be emulated using regular expressions. We say "it is possible", but we never tried it: in fact, it is much easier and faster to work with exceptions. add one more regexp? Ok, this will lead to one more pass through the text of the field. But all exceptions apply at once on a single pass of the tokenizer and occupy 15-20% of the tokenization time (which in general will be 2-5% of the total indexing time) .

Determination of the boundaries of sentences is defined in the tokenizer code and there is nothing to configure (and not necessary). Just turn on and hope that everything will work (usually it happens; although who knows, there may be some strange regional cases).

So, if you take a relatively innocuous point, and enter it first in one of the exceptions, as well as in blend_chars, and also put index_sp = 1 - you risk turning the whole axle nest (fortunately, not going beyond the tokenener boundaries). Again, everything “just works” outside (although if you turn on ALL of the above options, and then you also try to index some strange text that will trigger all the conditions at the same time and thereby awaken Cthulhu - well, you are to blame!)

From now on, we have tokens! And all subsequent processing phases deal specifically with individual tokens. Forward!

Word Forms and Morphology

Both steps are optional; both are disabled by default. More interestingly, word forms and morphological processors (stemmers and lemmatizers) are interchangeable in some way, and therefore we consider them together.

Each word created by a tokenizer is processed separately. Several different handlers are available: from the stupid, but still in some places popular Soundex and Metaphone, to Porter's classical grammers, including the libstemmer library, as well as full-fledged dictionary lemmatizers. All handlers generally take one word and replace it with the given normalized form. So far so good.

And now the details: morphological handlers are applied in exactly the order as mentioned in the config file, until the word is processed . That is, as soon as the word is changed by one of the handlers, that's all, the processing chain ends and all subsequent handlers will not even be called. For example, in the chain morphology = stem_en, stem_fr the English stemmer will have an advantage; morphology = stem_fr in the chain, stem_en - French. And in the chain morphology = soundex, stem_en, the mention of an English stemmer is essentially useless, since soundex converts all English words before the stemmer reaches them. An important edge effect of this behavior is that if a word is already in normal form and one of the stemmers discovered it (but, of course, it didn’t change anything), then it will be processed by subsequent stemmers.

Further. Regular word forms are an implicit morphological processor of the highest priority . If word forms are specified, then the words are first processed by them, and get into the handlers of morphology only if no transformations have happened. Thus, any unpleasant error of a stemmer or lemmatizer can be corrected with the help of word forms. For example, English Stemmer leads the words "business" and "busy" to the same basis "busi". And this is easily corrected by adding one line of “business => business” to word forms. (and yes, notice - the word forms are even more than morphology, since in this case the fact of replacing the word is sufficient, and it does not matter that it itself, in fact, has not changed).

Above, "ordinary word forms" were mentioned. And here's why: there are three different types of word forms .

Common word forms . They display 1: 1 tokens and in some way replace morphology (we just mentioned this)
Morphological word forms . You can replace all those who run on walking with a single line "~ run => walk" instead of a set of rules about "running", "running", "running", "running away", etc. And if in the English language there may not be so many such options, in some others, like our Russian, one base may have dozens or even hundreds of different inflections. Morphological word forms are applied after morphology handlers . And they still display the words 1: 1
Multiforms . They display the words M: N. In general, they work like normal substitution and are performed at an earlier stage as possible. The easiest way to present multi-forms as some kind of early replacement. In this sense, they are a kind of regexp or exception, but they are applied at a different stage and therefore ignore punctuation. Note that after applying the multi forms, the resulting tokens undergo all other morphological treatments, including the usual 1: 1 word forms !

Consider an example:

  morphology = stem_en
 wordforms = myforms.txt
 
 myforms.txt:
 walking => crawling
 running shoes => walking pants
 ~ pant => shoes

Suppose we index the “my running shoes” document with these strange settings. What will be the result in the index?

First we get three tokens - “my” “running” “shoes”.
Then the multiform will apply and transform it into “my” “walking” “pants”.
The usual word form displays "walking" in "crawling" (you get "my" "crawling" "pants")
The morphological processor (English stemmer) will process “my” and “pants” (since “walking” is already processed with the usual word form) and will issue “my” “crawling” “pant”
Finally, the morphological word form will display all forms of the word pant in shoes. The resulting “my” “crawling” “shoes” tokens will be saved in the index.

It sounds solid. However, how can a mere mortal who does not develop Sphinx and is not at all used to debugging C ++ code, guess all this ? Very simple: for this there is a special command:

 mysql> call keywords ('my running shoes', 'test1');
 + ------ + --------------- + ------------ +
 |  qpos |  tokenized |  normalized |
 + ------ + --------------- + ------------ +
 |  1 |  my |  my |
 |  2 |  running shoes |  crawling |
 |  3 |  running shoes |  shoes |
 + ------ + --------------- + ------------ +
 3 rows in set (0.00 sec)

and to conclude this section, we illustrate how morphology and three different types of word forms interact together:

Words and Positions

After all the treatments, the tokens have certain positions. Usually they are simply numbered sequentially, starting from one. However, each position in the document can belong to several tokens simultaneously ! This usually happens when a single “raw” token generates several versions of the final word, either using merged characters, or lemmatization, or in several other ways.

Magic characters

For example, “AT & T” in the case of a single “&” will be split into “at” in position 1, “t” in position 2, and also “at & t” in position 1.

Lemmatization

This is more interesting. For example, we have the document “White dove flew away. I dove into the pool. ”The first entry of the word“ dove ”is a noun. The second is the verb "dive" in past tense. But analyzing these words as separate tokens, we can’t say anything about it (and even if we look at several tokens at once, it can be quite difficult to make the right decision). In this case, morphology = lemmatize_en_all will lead to the indexation of all possible options. In this example, in positions 2 and 6, two different tokens will be indexed, so that both “dove” and “dive” will be saved.

Positions affect the search using phrases (phrase) and inaccurate phrases (proximity); they also affect ranking. And as a result of any of the four requests - “white dove”, “white dive”, “dove into”, “dive into” will lead to finding the document in the phrase mode.

Stopslova

The step of removing stopwords is very simple: we just throw them out of the text. However, a couple of things still need to be borne in mind:
1. How can I completely ignore stopwords (instead of just wiping them with spaces). Even though the stopwords are thrown away, the positions of the other words remain unchanged. This means that “microsoft office” and “microsoft in the office”, in case of ignoring “in” and “the” as stop words, will produce different indices. In the first document, the word “office” is in position 2. In the second, in position 4. If you want to completely remove stop words, you can use the stopword_step directive and set it to 0. This will affect the search for phrases and ranking.
2. How to add in stop words separate forms or complete lemmas. This setting is called stopwords_unstemmed and it is determined whether the removal of stop words is applied before or after the morphology.

What is left?

Well, we almost covered all the typical tasks of everyday text processing. Now you should be clear what is happening inside, how it all works together, and how to set up the Sphinx to achieve the desired result. Hooray!

But there is something else. In brief, we mention that there is also the index_exact_words option, which instructs to index the initial token (before the morphology is applied) in addition to the morphology. There is also the bigram_index option, which will force the sphinx to index word pairs (“a brown fox” will become “a brown”, “brown fox” tokens) and then use them for an ultra-fast search for phrases. You can also use indexing and query plugins that will allow you to implement almost any desired token processing.

And finally, in the upcoming release of Sphinx 3.0, there are plans to unify all these settings, so that instead of general directives that apply to the entire document, you can create separate filter chains for processing individual fields. So that it was possible, for example, to first remove some stop words, then apply word forms, then morphology, then another word form filter, etc.

Source: https://habr.com/ru/post/246679/

All Articles