📜 ⬆️ ⬇️

Highlighting large text fields in ElasticSearch

In December 2016, my friend and I started a new project - a collection-indexing-search system for documents. The system is built around ElasticSearch (hereinafter - ES), which we use as the main engine for full-text search.


We would like to share the valuable data acquired during the work on the project with our readers in a series of articles about ES. Let's start with the basis of any search engine - highlighting the search results (hereinafter - highlight).


Proper highlighting of search results is perhaps the most important criterion for the effectiveness of a search engine for a user. Firstly, the logic of including the document in the search results is visible, and secondly, the highlighting of the block of the found text makes it possible to quickly assess the context of the hit found.


One of the key requirements for our search engine was the ability to quickly and efficiently work with large files (over 100 MB). In this article, we will explain how to achieve high performance from ES when highlighting large fields of a document.


The screenshot below shows how the highlighting of search results in our project.


Sample Search Results with Highlight


The first step or the crux of the problem


So, we use ES for storing and searching in metadata and parsed file contents. An example of a document that we store in ES:


{ sha256: "1a4ad2c5469090928a318a4d9e4f3b21cf1451c7fdc602480e48678282ced02c", meta: [ { id: "21264f64460498d2d3a7ab4e1d8550e4b58c0469744005cd226d431d7a5828d0", short_name: "quarter.pdf", full_name: "//winserver/store/reports/quarter.pdf", source_id: "crReports", extension: ".pdf", created_datetime: "2017-01-14 14:49:36.788", updated_datetime: "2017-01-14 14:49:37.140", extra: [], indexed_datetime: "2017-01-16 18:32:03.712" } ], content: { size: 112387192, /*   100 Mb */ indexed_datetime: "2017-01-16 18:32:33.321", author: "John Smith", processed_datetime: "2017-01-16 18:32:33.321", length: "", language: "", state: "processed", title: "Quarter Report (Q4Y2016)", type: "application/pdf", text: "....     ...." } } 

As you may have guessed, this is a jarring content of a pdf file with a financial report of just over 100 MB in size. I deliberately shortened the field content.text , it is obvious that its length is approximately equal to 100 MB.


Let's make a simple experiment: take 1000 of these documents and index them with ES without using any special settings of the index or the ES itself. Let's see how fast the search and highlight will work on these documents.


Results:



Such performance is no good. The user expects to see results instantly (<200 ms), and not in tens of seconds. Let's figure out how to solve the problem of slow formation of a highlight. The problem of fast search in large files will be discussed in the next article of the cycle.


Selecting the highlight algorithm


In ES, it is possible to use three types of highlighters. See the official manual .
For those who are too lazy to read, on the fingers:



As described above, the default ES in Plain highlighter is used. Thus, each time, to form ES highlights, it unloads all 100 megabytes of text into memory and, because of this, it responds to the request very, very slowly. We abandoned the Plain highlighter and decided to test the Postings and FVH. As a result, our choice fell on FVH for several reasons:



Pitfalls Fast Vector Highlighter


In the process of working with FVH, we noticed the following problem: the match_phrase search query "Ivan Ivan Ivan" finds the occurrences "Ivan Ivan Ivan" and "Ivan Ivan Ivanov", but FVH highlights only hits in the order specified in the query. This nuance is not mentioned in any manual on ES, in our opinion this error occurs as a result of the fact that FVH takes into account the positions of the tokens for the match_phrase request. We solved the problem in a detour way - we add a highlight_query field to the query in which all the possible positions of the tokens in the phrase are searched. This is the only way that allowed you to get all the highlights while maintaining performance at the proper level.


Total


Highlight large ES documents can indeed, with fast. It is important to properly configure the index, and take into account the features of the highlighter. If you solved a similar problem and found, as you think, a more elegant solution, tell us about it in the comments.


')

Source: https://habr.com/ru/post/320390/


All Articles