Search for large documents in ElasticSearch

We continue a series of articles on how we comprehended ES in the process of creating Ambar . The first article of the cycle was about Highlighting large text fields in ElasticSearch .

In this article we will talk about how to make ES work quickly with documents of more than 100 MB. Search in such documents at the approach "in the forehead" takes tens of seconds. We managed to reduce this time to 6 ms.

We ask interested under the cat.

The problem of searching for large documents

As you know, the whole search action in ES is built around the _source field - the source document that came to ES and then indexed by Lucene.

Recall an example of a document that we store in ES:

 { sha256: "1a4ad2c5469090928a318a4d9e4f3b21cf1451c7fdc602480e48678282ced02c", meta: [ { id: "21264f64460498d2d3a7ab4e1d8550e4b58c0469744005cd226d431d7a5828d0", short_name: "quarter.pdf", full_name: "//winserver/store/reports/quarter.pdf", source_id: "crReports", extension: ".pdf", created_datetime: "2017-01-14 14:49:36.788", updated_datetime: "2017-01-14 14:49:37.140", extra: [], indexed_datetime: "2017-01-16 18:32:03.712" } ], content: { size: 112387192, indexed_datetime: "2017-01-16 18:32:33.321", author: "John Smith", processed_datetime: "2017-01-16 18:32:33.321", length: "", language: "", state: "processed", title: "Quarter Report (Q4Y2016)", type: "application/pdf", text: "....     ...." } }

_source for Lucene is an atomic unit that by default contains all the fields in a document. An index in Lucene is a sequence of tokens from all fields of all documents.

So, the index contains N documents. The document contains about two dozen fields, while all the fields are quite short, mainly of the keyword and date types, with the exception of the long text field content.text .

Now we will try in a first approximation to understand what will happen when you try to perform a search on any of the fields in the above documents. For example, we want to find documents with a creation date longer than January 14, 2017. To do this, run the following query:

 curl -X POST -H "Content-Type: application/json" -d '{ range: { 'meta.created_datetime': { gt: '2017-01-14 00:00:00.000' } } }' "http://ambar:9200/ambar_file_data/_search"

The result of this query you will see very soon, for several reasons:

Firstly, all fields of all documents will be involved in the search, although it would seem that we need them if we filter only by the date of creation. This happens because the atomic unit for Lucene is _source , and the default index consists of a sequence of words from all fields of documents.

Secondly, ES, in the process of generating search results, will unload all documents from the index with huge content.text we don’t need.

Third, collecting these huge documents in mind, ES will try to send them to us with one answer.

Ok, the third reason is easy to solve by including source filtering in the request. How to deal with the rest?

Accelerate the search

Obviously, searching, dumping into memory, and serializing results with the larger content.text field is a bad idea. To avoid this, it is necessary to force Lucene to separately store and process large fields separately from the rest of the document fields. We describe the necessary steps for this.

First, in the mapping for a large field, you must specify the parameter store: true . So you will tell Lucene that you need to store this field separately from the _source , i.e. from the rest of the document. It is important to understand that at the level of logic, this field is not excluded from the _source ! Simply Lucene when accessing the document will collect it in two steps: take the _source and add the stored content.text field to it.

Secondly, it is necessary to specify Lucene that the "heavy" field is no longer necessary to include in the _source . Thus, when searching, we will no longer upload large 100 MB documents into memory. To do this, add the following lines to the mapping:

 _source: { excludes: [ "content.text" ] }

So, what we get in the end: when you add a document to the index, _source indexed without the "heavy" content.text field. It is indexed separately. In the search for any "light" field, content.text does not take any part, respectively, Lucene with this query works with cropped documents, not 100 MB in size, but a couple of hundred bytes and the search is very fast. The search for the "heavy" field is possible and effective; now it is performed on an array of fields of the same type. Search simultaneously for the "heavy" and "light" fields of one document is also possible and effective. It is done in three stages:

easy search for cropped documents ( _source )
search in the array of "heavy fields" ( content.text )
quick merge of results without returning the entire content.text field

To assess the speed of work, we will look for the phrase "Ivan Ivanov" in the content.text field with filtering by the content.size field in the index from documents larger than 100 MB. An example request is shown below:

 curl -X POST -H "Content-Type: application/json" -d '{ "from": 0, "size": 10, "query": { "bool": { "must": [ { "range": { "content.size": { "gte": 100000000 } } }, { "match_phrase": { "content.text": " "} } ] } } }' "http://ambar:9200/ambar_file_data/_search"

Our test index contains about 3.5 million documents. All this works on a single small-capacity machine (16GB of RAM, normal storage for RAID 10 from SATA drives). The results are as follows:

Basic mapping "in the forehead" - 6.8 seconds
Our option - 6 ms

Total performance gain of about 1,100 times. Agree, for the sake of such a result, it was worth spending a few evenings studying the work of Lucene and ElasticSearch, and a few more days to write this article. But our approach has one pitfall.

Side effects

In case you store any field separately and exclude it from the _source you are waited by one rather unpleasant reef about which there is absolutely no information in open access or in ES manuals.

The problem is the following: you cannot partially update the document field from _source using update scipt without losing a separate stored field! If, for example, you add a new object to the meta array by the script, then ES will have to reindex the entire document (which is natural), however, the separately stored content.text field will be lost. At the exit, you will receive an updated document, but in stored_fields it will have nothing but a _source . Thus, if you need to update any of the _source fields, you will have to rewrite the stored field with it.

Total

For us, this is the second use of ES in a large project, and again we were able to solve all our problems while maintaining the speed and efficiency of the search. ES is really very good, you just need to be patient and be able to set it up correctly.

Source: https://habr.com/ru/post/321352/

All Articles

Search for large documents in ElasticSearch

The problem of searching for large documents

Accelerate the search

Side effects

Total

More articles: