📜 ⬆️ ⬇️

ElasticSearch - mapping and search without surprises

The article will look at how and why to use mapping. Do I need it at all and in what cases. I will give examples of its installation, as well as try to share some useful tricks that can help you to improve the search on your site.

Anyone who is interested in a modern search engine ElasticSearch, please under the cat.


In the last general vote, this topic was selected. In this article I will post a vote again, please participate. I will try to write the most complete cycle of articles on ES, if it will be interesting to the public.
')

Why do you need mapping?


Mapping is similar to table definition in sql databases. We explicitly specify the type of each field and additional parameters, such as analyzer, default value, source, and so on. More details below.

We can specify mapping when creating an index, thereby determining for one query for all types in the index.
curl -XPOST 'http://localhost:9200/test' -d '{ "settings" : { "number_of_shards" : 1 }, "mappings" : { "type1" : { "_source" : { "enabled" : false }, "properties" : { "field1" : { "type" : "string", "index" : "not_analyzed" } } } } }' 


We can also specify mapping directly for a certain type in the index:
 $ curl -XPUT 'http://localhost:9200/twitter/tweet/_mapping' -d '{ "tweet" : { "properties" : { "message" : {"type" : "string", "store" : true } } } }' 


And we can specify mapping for several indices at once:
 $ curl -XPUT 'http://localhost:9200/kimchy,elasticsearch/tweet/_mapping' -d '{ ... }' 


Is it necessary?


ES does not require explicit definition of data types in the document. In most simple cases, it determines the data type correctly.
So why then should it be determined?
Well, firstly, it is useful for the purity of the code and the confidence that is currently stored in the index.
An important feature of the mapping is the fine tuning of data and their processing, because we can specify whether to analyze the field, whether to store the source. Let's look at most of the possibilities by example.

Base data types


I think everyone already guessed what was going on. There are 7 basic types: string, integer / long, float / double, boolean, null

Example:
 $ curl -XPUT 'http://localhost:9200/twitter/tweet/_mapping' -d '{ "tweet" : { "_source" : {"enabled" : false}, "properties" : { "user" : {"type" : "string", "index" : "not_analyzed"}, "message" : {"type" : "string", "null_value" : "na", "store": true}, "postDate" : {"type" : "date"}, "priority" : {"type" : "integer"}, "rank" : {"type" : "float", "index_name" : "rating"} } } }' 


Here we have specified additional parameters:
  1. "_source" : {"enabled" : false} - Thus, we indicated that it is not necessary to store the source data for this type. When it may need? For example, you have a very heavy document with a bunch of information that you only need to index, but you do not need to display in the answer
  2. "store": true for the message field indicates that this source field must be stored in the index
  3. "index" : "not_analyzed" - here we indicated that this field should not be analyzed, i.e. should be stored as is. What are analyzers
  4. "null_value" : "na" - default value for the field
  5. "index_name" : "rating" - here we specified the alias for the field. Now we can refer to it as a "rank" and to "rating"


Note: By default, _source = true and the entire document is stored in the index in its original state and returned upon request. And it works faster than storing separate fields in the index, provided that your document is not huge. Then storing only the required fields can give a profit. Therefore, I do not recommend touching this field for no good reason.

Types of array / object / nested

We can specify not only the type of array for the field, but also specify the type for each field inside the array, here is an example:
 #source { "tweet" : { "message" : "some arrays in this tweet...", "lists" : [ { "name" : "prog_list", "description" : "programming list" }, { "name" : "cool_list", "description" : "cool stuff list" } ] } } #mapping { "tweet" : { "properties" : { "message" : {"type" : "string"}, "lists" : { "properties" : { "name" : {"type" : "string"}, "description" : {"type" : "string"} } } } } } 

For objects, everything is the same, except that it can be dynamic (by default it is).
Those. You can add a new field to an object at any time and it will be added without errors.
You can disable it like this: "dynamic" : false . You can read more here .

Nested type

Essentially, we define a document within a document. Why do you need it? A great example from the documentation:
 { "obj1" : [ { "name" : "blue", "count" : 4 }, { "name" : "green", "count" : 6 } ] } 


If we search for name = blue && count>5 then this document will be found to avoid such a scenario, you should use the nested type.
Example:
 { "type1" : { "properties" : { "obj1" : { "type" : "nested", "properties": { "name" : {"type": "string", "index": "not_analyzed"}, "count" : {"type": "integer"} } } } } } 


It is not necessary to specify properties for the elements of the object, ES will do this automatically.
To search for a nested type, use nested query or nested filter .

Multi-fields


Starting with version 1.0, this fine parameter has been added to all base types (except nested and object).
What is he doing? This parameter allows you to specify different mapping settings for one field.
Why it may be necessary? for example, you have a field by which you want to search and group. If you disable the analyzer, the search will not work to its fullest, and if enabled, then we will be grouped not by raw data, but by processed data. For example, St. Petersburg after the analyzer will be "St." and "Petersburg" (perhaps a slightly different way, but for example it will come down). If we group by this field, we will not get what we wanted.

Example:
 "title": { "type": "string", "fields": { "raw": { "type": "string", "index": "not_analyzed" } } } 

Now we can refer to the "title" for the search and to the "raw" for grouping and any other types of sorting.

The remaining types

ES supports 4 more data types:
  1. ip type - ip storage in the form of numbers
  2. geo point type - storing coordinates (conveniently when searching for nearest objects to a specific coordinate)
  3. geo point type - a rather specific type for storing certain polygons.
  4. attachment type - Storing files in a base64-encoded database. Usually used in conjunction with its own analyzer. (Although as for me, the pleasure is dubious)

I did not consider these types in detail, because they are quite specific or do not fundamentally differ from the ones discussed above (for example, IP).

I hope that I was able to lucidly talk about the main functions of the mapping in ES. If you have questions, I will be glad to answer.

Other articles on ES:
ElasticSearch - data aggregation
ElasticSearch and search vice versa. Percolate API


- Achievements of goals

Source: https://habr.com/ru/post/227531/


All Articles