Highlighting large text fields in ElasticSearch

In December 2016, my friend and I started a new project - a collection-indexing-search system for documents. The system is built around ElasticSearch (hereinafter - ES), which we use as the main engine for full-text search.

We would like to share the valuable data acquired during the work on the project with our readers in a series of articles about ES. Let's start with the basis of any search engine - highlighting the search results (hereinafter - highlight).

Proper highlighting of search results is perhaps the most important criterion for the effectiveness of a search engine for a user. Firstly, the logic of including the document in the search results is visible, and secondly, the highlighting of the block of the found text makes it possible to quickly assess the context of the hit found.

One of the key requirements for our search engine was the ability to quickly and efficiently work with large files (over 100 MB). In this article, we will explain how to achieve high performance from ES when highlighting large fields of a document.

The screenshot below shows how the highlighting of search results in our project.

Sample Search Results with Highlight

The first step or the crux of the problem

So, we use ES for storing and searching in metadata and parsed file contents. An example of a document that we store in ES:

{ sha256: "1a4ad2c5469090928a318a4d9e4f3b21cf1451c7fdc602480e48678282ced02c", meta: [ { id: "21264f64460498d2d3a7ab4e1d8550e4b58c0469744005cd226d431d7a5828d0", short_name: "quarter.pdf", full_name: "//winserver/store/reports/quarter.pdf", source_id: "crReports", extension: ".pdf", created_datetime: "2017-01-14 14:49:36.788", updated_datetime: "2017-01-14 14:49:37.140", extra: [], indexed_datetime: "2017-01-16 18:32:03.712" } ], content: { size: 112387192, /*   100 Mb */ indexed_datetime: "2017-01-16 18:32:33.321", author: "John Smith", processed_datetime: "2017-01-16 18:32:33.321", length: "", language: "", state: "processed", title: "Quarter Report (Q4Y2016)", type: "application/pdf", text: "....     ...." } }

As you may have guessed, this is a jarring content of a pdf file with a financial report of just over 100 MB in size. I deliberately shortened the field content.text , it is obvious that its length is approximately equal to 100 MB.

Let's make a simple experiment: take 1000 of these documents and index them with ES without using any special settings of the index or the ES itself. Let's see how fast the search and highlight will work on these documents.

Results:

Search match_phrase in the content.text field: from 5 to 30 seconds.
Formation of a highlight for the content.text field for each of the documents: more than 10 seconds.

Such performance is no good. The user expects to see results instantly (<200 ms), and not in tens of seconds. Let's figure out how to solve the problem of slow formation of a highlight. The problem of fast search in large files will be discussed in the next article of the cycle.

Selecting the highlight algorithm

In ES, it is possible to use three types of highlighters. See the official manual .
For those who are too lazy to read, on the fingers:

Plain is the default option, the slowest but the highest quality (according to ES, almost 100% reflects the Lucene search algorithm, and this is true), unloads the entire document into memory and re-analyzes it to form a highlight.
Postings - a faster highlighter, hits the field for sentences and pulls out for the highlight not the entire document, but the sentences where the token was found, ranking them using the BM25 algorithm. Requires an enrichment of the index positions of these very proposals.
Fast Vector Highlighting (FVH) - positioned as the fastest highlighter, especially for large documents. It requires the index to be enriched with the data on the positions of all tokens in the source document, thanks to which it forms a highlight in almost constant time, regardless of the size of the document.

As described above, the default ES in Plain highlighter is used. Thus, each time, to form ES highlights, it unloads all 100 megabytes of text into memory and, because of this, it responds to the request very, very slowly. We abandoned the Plain highlighter and decided to test the Postings and FVH. As a result, our choice fell on FVH for several reasons:

A document with a size of 100 MB FVH, on average, a highlight is about 10-20 ms, Postings spends about a second on this
Postings do not always correctly break the text into sentences, so the size of the resulting highlight jumps quite often (it can return 50 words, and maybe 300). With FVH, there was no such problem. It returns the specified number of tokens to both sides of the hit.
Postings of highlight tokens regardless of their position, so phrase highlighting in this case does not work correctly. For example, simple_string_query "Ivan Ivan Ivan" ~ 5 zhaylaytit not only cases when two tokens "Ivan" and "Ivan" will be at a distance of no more than 5 tokens from each other, but all the other tokens "Ivan" or "Ivan" in a given field of the document, as if it was just a bool request for a match "ivanov" and "ivan"

Pitfalls Fast Vector Highlighter

In the process of working with FVH, we noticed the following problem: the match_phrase search query "Ivan Ivan Ivan" finds the occurrences "Ivan Ivan Ivan" and "Ivan Ivan Ivanov", but FVH highlights only hits in the order specified in the query. This nuance is not mentioned in any manual on ES, in our opinion this error occurs as a result of the fact that FVH takes into account the positions of the tokens for the match_phrase request. We solved the problem in a detour way - we add a highlight_query field to the query in which all the possible positions of the tokens in the phrase are searched. This is the only way that allowed you to get all the highlights while maintaining performance at the proper level.

Total

Highlight large ES documents can indeed, with fast. It is important to properly configure the index, and take into account the features of the highlighter. If you solved a similar problem and found, as you think, a more elegant solution, tell us about it in the comments.

Source: https://habr.com/ru/post/320390/

All Articles

Highlighting large text fields in ElasticSearch

The first step or the crux of the problem

Selecting the highlight algorithm

Pitfalls Fast Vector Highlighter

Total

More articles: