Making a really smart search: step by step guide

Search in the corporate information system - already from the phrase itself gets bogged down in the mouth. Well, if it is at all, you can not even think about the positive user experience. How to turn the attitude of users spoiled by search engines, and create a fast, accurate, half-word product? It is necessary to take a good piece of Elasticsearch, a handful of intellectual services and mix them in this guide.

There are already plenty of articles on how to add an Elasticsearch-based full-text search to the existing database. But the articles on how to do a really smart search are clearly not enough.

At the same time, the phrase “Smart Search” has already turned into a buzzword and is used to the site and not. So what should a search engine do to be considered smart? Ultimately, this can be described as the output of a result that the user actually needs, even if this result does not quite match the query text. Popular search engines like Google and Yandex go further and not only find the necessary information, but directly answer user questions.

Okay, we will not aim at the ultimatum at once, but what can be done to bring an ordinary full-text search closer to an intelligent one ?

Elements of intellectuality

Smart search is just the case when quantity can turn into quality and a lot of small and fairly simple features can form a sense of magic.

Correction of user errors - whether it is a typo, a wrong layout, or maybe a query with a suspiciously small number of results, but similar to a query on which there is much more information.
Behind th NLP chat (natural language processing, not what you thought) - if the user entered commercial offers for the last year , did he really want to search for these words in the text of all documents or did he really only need commercial offers and only for the last year ?
Input prediction based on previous queries or popular documents.
Presentation of the result - the usual highlighting of the found fragment, additional information depending on what you were looking for. Since in the previous paragraph, commercial proposals were needed, then maybe it makes sense to immediately show the subject of the proposal and from which organization did it come?
Easy drilldown - the ability to refine the search query using additional filters, facets.

Introductory

There is an ECM DIRECTUM with a lot of documents in it. The document consists of a card with meta information and a body, which can have several versions.

The goal is to quickly and conveniently search for information in these documents in a manner familiar to search engine users.

Indexing

In order to search for something well, you need to index it in the beginning.

Documents in ECM are not static, users modify text, create new versions, change data in cards; New documents are constantly being created and sometimes old ones are deleted.
To maintain up-to-date information in Elasticsearch, documents must be constantly re-indexed. Fortunately, ECM already has its own asynchronous event queue, so when changing a document, it’s enough to add it to the queue for indexing.

Displaying ECM documents on Elasticsearch documents

The body of a document in ECM can have several versions. In Elasticsearch, this could be represented as an array of nested objects, but then it becomes inconvenient to work with them - writing queries is complicated, if you change one of the versions, you need to reindex everything, different versions of the same document cannot be stored in different indexes (why is this necessary? in the next section). Therefore, we will denormalize one document from ECM into several Elasticsearch documents with the same card but different bodies.

In addition to the card and the body, different service information is added to the Elasticsearch document, in which it is worth mentioning separately:

list of IDs of groups and users who have rights to the document - for search with consideration of rights;
the number of references to the document - for tuning relevance;
last indexing time.

The composition of the indices

Yes, indexes are in the plural. Usually, several indexes for storing information similar in meaning to Elasticsearch are used only if this information is immutable and is tied to a certain time interval, for example logs. Then the indices are created every month / day or more often depending on the intensity of the load. In our case, any document can be changed, and it would be possible to store everything in a single index.

But - documents in the system can be in different languages, and storing multilingual data in Elasticsearch has 2 problems:

Wrong stemming. For some words, the base will be found correctly, for some - incorrectly (there will be another word in the index), for some - it will not be found at all (the index will be clogged with word forms). For some words from different languages and having different meanings, the basis will coincide, and then the meaning of the word will be lost. The use of several stemmers in a row can lead to the computation of the basis of an already computed one.

Stemming - finding the basis of the word. The stem does not have to be the root of the word or its normal form. Usually it is enough that the related words are projected into one base.
Lemmatization is a kind of stemming in which the normal (dictionary) form of a word is considered the basis.

Incorrect word frequency. Some of the mechanisms for determining relevance in ES take into account the frequency of the searched words in the document (the more often, the higher the relevance) and the frequency of the searched words in the index (the more often, the lower the relevance). So, a small blotch of Russian speech in an English document, when the index is mainly English documents, will have a high weight, but it is worth mixing English and Russian documents in the index, and the weight will go down.

The first problem can be solved for the case when different languages use different character sets (Russian-English documents use Cyrillic and Latin) - language stemmers will only process "their" characters.

Just to solve the second problem, we used an approach with a separate index for each language.

Combining both approaches, we obtain language indices, which nevertheless contain analyzers for several non-intersecting language symbol sets at once: Russian-English (and separately English-Russian), Polish-Russian, German-Russian, Ukrainian-English, etc. .

In order not to create all possible indexes in advance, we used index templates - Elasticsearch allows you to specify a template that contains settings and mappings, and specify the index name pattern. If you try to index a document into a nonexistent index, whose name matches one of the template patterns, not only will a new index be created, but the settings and mappings from the corresponding template will be applied to it.

Index structure

For indexing, we use two analyzers at once (via multi-fields): default to search for the exact phrase and custom for everything else:

"ru_en_analyzer": { "filter": [ "lowercase", "russian_morphology", "english_morphology", "word_delimiter", "ru_en_stopwords" ], "char_filter": [ "yo_filter" ], "type": "custom", "tokenizer": "standard"}

With the lowercase filter, everything is clear, I'll tell you about the rest.

The russian_morphology and english_morphology filters are designed for morphological analysis of Russian and English text, respectively. They are not part of Elasticsearch and are put as part of a separate analysis-morphology plugin. These are lemmatizers using a vocabulary approach in combination with some heuristics and working significantly, SIGNIFICANTLY, better than the built-in filters for the respective languages.

 POST _analyze { "analyzer": "russian", "text": "   " } >>

AND:

 POST _analyze { "analyzer": "ru_en_analyzer", "text": "   " } >>

Very curious word_delimiter filter. He, for example, helps to eliminate typos, when there is no space after the dot. We use the following configuration:

 "word_delimiter": { "catenate_all": "true", "type": "word_delimiter", "preserve_original": "true" }

yo_filter allows you to ignore the difference between E and E:

 "yo_filter": { "type": "mapping", "mappings": [ " => ", " => " ] }

ru_en_stopwords filter with type stop - our dictionary of stop words.

Indexing process

The bodies of documents in ECM are, as a rule, office files: .docx, .pdf, etc. To extract the text, use the ingest-attachment plugin with the following pipeline:

 { "document_version": { "processors": [ { "attachment": { "field": "content", "target_field": "attachment", "properties": [ "content", "content_length", "content_type", "language" ], "indexed_chars": -1, "ignore_failure": true } }, { "remove": { "field": "content", "ignore_failure": true } }, { "script": { "lang": "painless", "params": { "languages": ["ru", "en" ], "language_delimeter": "_" }, "source": "..." } }, { "remove": { "field": "attachment", "ignore_failure": true } } ] } }

From the unusual to the pipeline - ignoring the absence of body errors (this happens for encrypted documents) and determining the target index, based on the text language. The latter is done in a painless script, the body of which I will give separately, because due to JSON limitations, it has to be written in one line. Coupled with the difficulties of debugging (the recommended way is to generate exceptions here and there), it altogether turns into painful.

 if (ctx.attachment != null) { if (params.languages.contains(ctx.attachment.language)) ctx._index = ctx._index + params.language_delimeter + ctx.attachment.language; if (ctx.attachment.content != null) ctx.content = ctx.attachment.content; if (ctx.attachment.content_length != null) ctx.content_length = ctx.attachment.content_length; if (ctx.attachment.content_type != null) ctx.content_type = ctx.attachment.content_type; if (ctx.attachment.language != null) ctx.language = ctx.attachment.language; }

Thus, we always send the document to index_name . If the language is not defined or not supported, then the document is deposited in this index, otherwise it falls into index_name_language .

We do not store the original file body itself, but the _source field is enabled, since it is required to partially update the document and highlight the found.

If only the card has changed since the last indexing, then to update it we use the Update By Query API without a pipeline. This allows, firstly, not to pull potentially heavy document bodies from ECM, and secondly, it significantly speeds up the update on the Elasticsearch side - it is not necessary to extract the text of documents from office formats, which is very resource-intensive.

As such, there is no update of the document in Elasticsearch; technically, when updating from an index, the old document gets, changes and is again fully indexed.

But if the body was changed, then the old document is deleted and indexed from scratch. This allows documents to move from one language index to another.

Search

For ease of description, here is a screenshot of the final result.

Fulltext

The main type of query we use is Simple Query String Query :

 "simple_query_string": { "fields": [ "card.d*.*_text", "card.d*.*_text.exact", "card.name^2", "card.name.exact^2", "content", "content.exact" ], "query": " ", "default_operator": "or", "analyze_wildcard": true, "minimum_should_match": "-35%", "quote_field_suffix": ".exact" }

where .exact are fields indexed by the default analyzer. The importance of the name of the document is two times higher than the other fields. The combination of "default_operator": "or" and "minimum_should_match": "-35%" allows you to find documents that do not have up to 35% of the searched words.

Synonyms

In general, different analyzers are used for indexing and searching, but the only difference in them is the addition of a filter to add synonyms to the search query:

 "search_analyzer": { "filter": [ "lowercase", "russian_morphology", "english_morphology", "synonym_filter", "word_delimiter", "ru_en_stopwords" ], "char_filter": [ "yo_filter" ], "tokenizer": "standard" }

 "synonym_filter": { "type": "synonym_graph", "synonyms_path": "synonyms.txt" }

Accounting rights

To search with rights, the main query is embedded in Bool Query , with the addition of a filter:

 "bool": { "must": [ { "simple_query_string": {...} } ], "filter": [ { "terms": { "rights": [           ] } } ] }

As we remember from the section on indexing, there is a field in the index with the IDs of users and groups that have rights to the document. If there is an intersection of this field with the transferred array, then there are rights.

Tuning Relevance

By default, Elasticsearch evaluates the relevance of the results using the BM25 algorithm using the query and the text of the document. We decided that three more factors should influence the conformity assessment of the desired and actual results:

the last time the document was edited - the farther it was in the past, the less likely it is that this document is needed;
the number of references to the document - the more, the more likely that this document is needed;
body versions in ECM have several possible states: developed, active, and obsolete. It is logical that acting is more important than the others.

This effect can be achieved with the help of Function Score Query :
```
 "function_score": { "functions": [ { "gauss": { "modified_date": { "origin": "now", "scale": "1095d", "offset": "31d", "decay": 0.5 } } }, { "field_value_factor": { "field": "access_count", "missing": 1, "modifier": "log2p" } }, { "filter": { "term": { "life_stage_value_id": { "value": "" } } }, "weight": 1.1 } ], "query": { "bool": {...} } } 
```
As a result, other things being equal, it turns out that the result rating modifier is approximately as dependent on the date of its last change X and the number of hits Y:

External intelligence

For a part of the smart search functionality, we need to extract various facts from a search query: dates with an indication of their application (creation, modification, approval, etc.), names of organizations, types of documents sought, etc.

It is also desirable to classify the request into a specific category, for example, documents by organization, by employee, regulatory, etc.

These two operations are performed by the intelligent ECM module - DIRECTUM Ario .

Smart search process

It is time to take a closer look at what mechanisms are implemented elements of intelligence.

Correction of user errors

Determining the correctness of the layout occurs on the basis of the trigram language model - for a string it is calculated how likely it is to meet its three-character sequences in texts in English and Russian. If the current layout is considered less likely, then, firstly, a tooltip with the corrected layout is shown:

and secondly, the further stages of the search are performed with the corrected layout:

And if nothing is found with the corrected layout, the search starts with the original string.

Errata corrected using Phrase Suggester . There is a problem with it - if you execute a query on several indexes at the same time, then suggest may not return anything, while performing only on one index results. This is treated by setting confidence = 0, but then suggest suggests replacing the words with their normal form. Agree, it will be strange when searching for “letter a ” to get an answer in the spirit: Perhaps you were looking for a letter about ?

This can be bypassed by using two suggesters at once in the request:

 "suggest": { "content_suggest": { "text": " ", "phrase": { "collate": { "query": {         {{suggestion}} } }, } }, "check_suggest": { "text": "", "phrase": { "collate": { "query": {         {{suggestion}} - ({{source_query}}) }, "params": { "source_query": " " } }, } } }

Of the common parameters are used

 "confidence": 0.0, "max_errors": 3.0, "size": 1

If the first suggester returns the result, and the second does not, then this result is the original line itself, perhaps with words in other forms, and the hint should not be shown. In case the prompt is still required, the original search phrase merges with the prompt. This occurs by replacing only the corrected words and only those that the spell checker (used by Hunspell) deems incorrect.

If a search on the source line returns 0 results, it is replaced with the merged string and the search is performed again:

Otherwise, the resulting string with hints is returned only as a hint for the search:

Query classification and fact extraction

As I already mentioned, we use DIRECTUM Ario, namely the text classification service and the fact extraction service. To do this, we gave analysts anonymous search queries and a list of facts that interest us. On the basis of inquiries and knowledge of what documents are in the system, analysts identified several categories and trained the classification service to determine the category according to the query text. Based on the resulting categories and a list of facts, we formulated the rules for using these facts. For example, the phrase last year in the category Everything is considered the date the document was created, and in the category By organization , the date of registration. At the same time, created in the last year should fall into the creation date in any category.

From the search side, a config was made, in which categories were assigned, which facts were applied to which facet filters.

Auto-complete input

In addition to the aforementioned layout corrections, past user search queries and publicly available documents fall into auto completion.

They are implemented using a different type of Suggester, the Completion Suggester , but each has its own nuances.

Autocompletion: Search History

There are far fewer users in ECM than search engines and to allocate for them a sufficient number of common queries. ~~why is Lenin a mushroom~~ does not seem possible. Show everything in a row is also not worth it for reasons of privacy. Regular Completion Suggester can search only for the entire set of documents in the index, but Context Suggester comes to the rescue - a way to set a certain context for each hint and filter it by these contexts during the search. If we use user names as contexts, then everyone can only be shown his history.

You also need to give the user the opportunity to remove the hint for which he is ashamed. As a key to delete, we used the username and the text of the tooltip. As a result, the index with hints turned out to be a slightly duplicated mapping:

 "mappings": { "document": { "properties": { "input": { "type": "keyword" }, "suggest": { "type": "completion", "analyzer": "simple", "preserve_separators": true, "preserve_position_increments": true, "max_input_length": 50, "contexts": [ { "name": "user", "type": "CATEGORY" } ] }, "user": { "type": "keyword" } } } }

The weight for each new hint is set to one and increases with each repeated input using the Update By Query API with a very simple ctx._source.suggest.weight++ script.

Autocompletion: Documents

But there are a lot of documents and possible combinations of rights. Therefore, here we, on the contrary, decided not to do filtering by rights for auto-completion, but to index only publicly available documents. Yes, and delete individual tips from this index is not necessary. It would seem that implementation in everything is easier than the previous, if not for two points:

The first one - Completion Suggester supports only prefix search, and clients so much like to assign nomenclature numbers to everything, and some .01.01 as you type in the query Abbreviation rules are not found. Here, together with the full name, you can index n-grams derived from it:

 { "extension": "pdf", "name": ".01.01   ", "suggest": [ { "input": "", "weight": 70 }, { "input": " ", "weight": 80 }, { "input": "  ", "weight": 90 }, { "input": ".01.01   ", "weight": 100 } ] }

With the story it was not so critical, yet the same user enters about the same line if he searches for something again. Probably

The second is that by default all the hints are equal, but we would like to make some of them more equal and preferably so that it is consistent with the ranking of the search results. To do this, approximately repeat the functions gauss and field_value_factor, used in Function Score Query .

It turns out such a pipeline:

 { "dir_public_documents_pipeline": { "processors": [ ... { "set": { "field": "terms_array", "value": "{{name}}" } }, { "split": { "field": "terms_array", "separator": "\\s+|$" } }, { "script": { "source": "..." } } ] } }

with the following script:

 Date modified = new Date(0); if (ctx.modified_date != null) modified = new SimpleDateFormat('dd.MM.yyyy').parse(ctx.modified_date); long dayCount = (System.currentTimeMillis() - modified.getTime())/(1000*60*60*24); double score = Math.exp((-0.7*Math.max(0, dayCount - 31))/1095) * Math.log10(ctx.access_count + 2); int count = ctx.terms_array.length; ctx.suggest = new ArrayList(); ctx.suggest.add([ 'input': ctx.terms_array[count - 1], 'weight': Math.round(score * (255 - count + 1)) ]); for (int i = count - 2; i >= 0 ; --i) { if (ctx.terms_array[i].trim() != "") { ctx.suggest.add([ "input": ctx.terms_array[i] + " " + ctx.suggest[ctx.suggest.length - 1].input, "weight": Math.round(score * (255 - i))]); } } ctx.remove('terms_array'); ctx.remove('access_count'); ctx.remove('modified_date');

Why bother with a painless pipeline instead of writing it in a more convenient language? Because now with the help of the Reindex API into the index for prompts, you can overtake the contents of search indexes (indicating only the required fields, of course) literally into one command.

The composition of really necessary publicly available documents is not often updated, so this command can be left on manual start.

Results display

Facets

Facets are such an intuitive thing for everyone, whose behavior, however, is described by very non-trivial rules. Here are a few of them:

The values of the facets depend on the search results, BUT and the search results depend on the selected facets. How to avoid recursion?
Choosing values within one facet does not affect other values of this facet, but affects values in other facets:

The facet values selected by the user should not disappear, even if a choice in another facet annihilates them to 0 or they are no longer in the top:

In elastica, facets are implemented through an aggregation mechanism, but in order to comply with the described rules, these aggregations have to be invested into each other and filtered by each other.

Consider the query fragments responsible for this:

Too big piece of code

 { ... "post_filter": { "bool": { "must": [ { "terms": { "card.author_value_id": [ "1951063" ] } }, { "terms": { "editor_value_id": [ "2337706", "300643" ] } } ] } }, "query": {...} "aggs": { "card.author_value_id": { "filter": { "terms": { "editor_value_id": [ "2337706", "300643" ] } }, "aggs": { "card.author_value_id": { "terms": { "field": "card.author_value_id", "size": 11, "exclude": [ "1951063" ], "missing": "" } }, "card.author_value_id_selected": { "terms": { "field": "card.author_value_id", "size": 1, "include": [ "1951063" ], "missing": "" } } } }, ... "editor_value_id": { "filter": { "terms": { "card.author_value_id": [ "1951063" ] } }, "aggs": { "editor_value_id": { "terms": { "field": "editor_value_id", "size": 11, "exclude": [ "2337706", "300643" ], "missing": "" } }, "editor_value_id_selected": { "terms": { "field": "editor_value_id", "size": 2, "include": [ "2337706", "300643" ], "missing": "" } } } }, ... } }

What is there that:

post_filter allows you to impose an additional condition on the results of an already executed query and does not affect the results of the aggregations. The same recursion gap. Includes all selected values of all facets.
top-level aggregations, in the example card.author_value_id and editor_value_id . Each has:
- filter by the values of all other facets except its own;
- nested aggregation for selected facet values - protection against annihilation ;
- nested aggregation for the remaining facet values. Showing the top 10, and requesting the top 11 - to determine whether to display the button Show All .

Snipples

Depending on the category chosen, the snippet may look different, for example, the same document when searching in a category

All :

and Employees :

Or remember, we wanted to see the subject of a commercial offer and from whom did it come?

In order not to drag the entire card from the elastic (this slows down the search), Source filtering is used :

 { ... "_source": { "includes": [ "id", "card.name", "card.card_type_value_id", "card.life_stage_value_id", "extension", ... ] }, "query": {...} ... }

For highlighting the found words in the text of the document, the Fast Vector highlighter is used - as it generates the most adequate snippets for large texts, and for the name - Unified highlighter - as the least demanding of resources and index structure:

 "highlight": { "pre_tags": [ "<strong>" ], "post_tags": [ "</strong>" ], "encoder": "html", "fields": { "card.name": { "number_of_fragments": 0 }, "content": { "fragment_size": 300, "number_of_fragments": 3, "type": "fvh" } } },

In this case, the name is highlighted in its entirety, and from the text we obtain up to 3 fragments with a length of 300 characters. The text returned by the Fast Vector highlighter is additionally compressed with a homemade algorithm to get the minimized state of the snippet.

Collapse

Historically, users of this ECM have become used to having documents returned to them, but in fact Elasticsearch is searching among document versions . It may happen that several almost identical versions will be found for the same query. This will clutter up the results and confuse the user. Fortunately, this behavior can be avoided with the help of the Field Collapsing mechanism - some lightweight variant of aggregation, which works already on the finished results (in this it resembles post_filter, two crutches are a pair ). The result of the collapse will be the most relevant of the collapsing objects.

 { ... "query": {...} ... "collapse": { "field": "id" } }

Unfortunately, the collapse has a number of unpleasant effects, for example, various numerical characteristics of the search result continue to return as if there was no collapse. That is, the number of results, the number of facet values - all will be slightly wrong, but the user usually does not notice this, as well as the tired reader, who has hardly read this sentence.

The end.

Source: https://habr.com/ru/post/460263/

All Articles