ElasticSearch - data aggregation

In the article we will look at how to correctly implement data aggregation, why it may be needed, and we will pick it up with a bunch of working examples.

For all who are interested in how to make their requests in ES more interesting and look at the usual search on the other hand, please under the cat.

In the previous article, users divided equally between the article on a simpler topic and on a more complex one, so I chose a not very complicated topic, but rather fresh, which was added to ES relatively recently (v1.0) and carries a rather interesting functionality.
')

Aggregation module

This module came to replace Facets in ES, and in a persistent way, Facets are now considered obsolete and will be removed in the next releases. Although the units were added in v1.0.0RC1, and now> 1.2, I still do not recommend using Facets.
Why did you need to change the working tool?
Probably the main feature of the aggregates is their nesting. I will give the general syntax of the query:

"aggregations" : { "<aggregation_name>" : { "<aggregation_type>" : { <aggregation_body> } [,"aggregations" : { [<sub_aggregation>]+ } ]? } [,"<aggregation_name_2>" : { ... } ]* }

As can be seen from the structure, there can be arbitrarily many aggregates, and each element can have an embedded element without depth restrictions.
Using nesting, we can get very interesting statistics (example at the end of the article).

Types of units

There are a lot of types of units, but all of them can be combined into 2 main types:

- Bucketing (Generalization)
For ease of understanding, this can be compared with the familiar “GROUP BY” tools. Of course, this is a fairly simplified comparison, but the principle of operation is similar. This type, based on filters, summarizes documents, according to some particular attribute, a good example is terms aggregation .

- Metric (Metric)
These are aggregates that calculate any value for a specific set of documents. For example sum aggregation

I think, for the beginning of the theory is enough, everyone who is interested in more fundamental information on this module can get acquainted with it at this link .

Simple example

Those who wish to try everything with my own hands, I suggest using this dump

Structure and data for the test

The dump is brazenly taken from this beautiful article.

 curl -XPUT "http://localhost:9200/sports/" -d' { "mappings": { "athlete": { "properties": { "birthdate": { "type": "date", "format": "dateOptionalTime" }, "location": { "type": "geo_point" }, "name": { "type": "string" }, "rating": { "type": "integer" }, "sport": { "type": "string" } } } } }' curl -XPOST "http://localhost:9200/sports/_bulk" -d' {"index":{"_index":"sports","_type":"athlete"}} {"name":"Michael", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["5", "4"], "location":"46.22,-68.45"} {"index":{"_index":"sports","_type":"athlete"}} {"name":"Bob", "birthdate":"1989-11-2", "sport":"Baseball", "rating": ["3", "4"], "location":"45.21,-68.35"} {"index":{"_index":"sports","_type":"athlete"}} {"name":"Jim", "birthdate":"1988-10-3", "sport":"Baseball", "rating": ["3", "2"], "location":"45.16,-63.58" } {"index":{"_index":"sports","_type":"athlete"}} {"name":"Joe", "birthdate":"1992-5-20", "sport":"Baseball", "rating": ["4", "3"], "location":"45.22,-68.53"} {"index":{"_index":"sports","_type":"athlete"}} {"name":"Tim", "birthdate":"1992-2-28", "sport":"Baseball", "rating": ["3", "3"], "location":"46.22,-68.85"} {"index":{"_index":"sports","_type":"athlete"}} {"name":"Alfred", "birthdate":"1990-9-9", "sport":"Baseball", "rating": ["2", "2"], "location":"45.12,-68.35"} {"index":{"_index":"sports","_type":"athlete"}} {"name":"Jeff", "birthdate":"1990-4-1", "sport":"Baseball", "rating": ["2", "3"], "location":"46.12,-68.55"} {"index":{"_index":"sports","_type":"athlete"}} {"name":"Will", "birthdate":"1988-3-1", "sport":"Baseball", "rating": ["4", "4"], "location":"46.25,-68.55" } {"index":{"_index":"sports","_type":"athlete"}} {"name":"Mick", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["3", "4"], "location":"46.22,-68.45"} {"index":{"_index":"sports","_type":"athlete"}} {"name":"Pong", "birthdate":"1989-11-2", "sport":"Baseball", "rating": ["1", "3"], "location":"45.21,-68.35"} {"index":{"_index":"sports","_type":"athlete"}} {"name":"Ray", "birthdate":"1988-10-3", "sport":"Baseball", "rating": ["2", "2"], "location":"45.16,-63.58" } {"index":{"_index":"sports","_type":"athlete"}} {"name":"Ping", "birthdate":"1992-5-20", "sport":"Baseball", "rating": ["4", "3"], "location":"45.22,-68.53"} {"index":{"_index":"sports","_type":"athlete"}} {"name":"Duke", "birthdate":"1992-2-28", "sport":"Baseball", "rating": ["5", "2"], "location":"46.22,-68.85"} {"index":{"_index":"sports","_type":"athlete"}} {"name":"Hal", "birthdate":"1990-9-9", "sport":"Baseball", "rating": ["4", "2"], "location":"45.12,-68.35"} {"index":{"_index":"sports","_type":"athlete"}} {"name":"Charge", "birthdate":"1990-4-1", "sport":"Baseball", "rating": ["3", "2"], "location":"46.12,-68.55"} {"index":{"_index":"sports","_type":"athlete"}} {"name":"Barry", "birthdate":"1988-3-1", "sport":"Baseball", "rating": ["5", "2"], "location":"46.25,-68.55" } {"index":{"_index":"sports","_type":"athlete"}} {"name":"Bank", "birthdate":"1988-3-1", "sport":"Golf", "rating": ["6", "4"], "location":"46.25,-68.55" } {"index":{"_index":"sports","_type":"athlete"}} {"name":"Bingo", "birthdate":"1988-3-1", "sport":"Golf", "rating": ["10", "7"], "location":"46.25,-68.55" } {"index":{"_index":"sports","_type":"athlete"}} {"name":"James", "birthdate":"1988-3-1", "sport":"Basketball", "rating": ["10", "8"], "location":"46.25,-68.55" } {"index":{"_index":"sports","_type":"athlete"}} {"name":"Wayne", "birthdate":"1988-3-1", "sport":"Hockey", "rating": ["10", "10"], "location":"46.25,-68.55" } {"index":{"_index":"sports","_type":"athlete"}} {"name":"Brady", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"46.25,-68.55" } {"index":{"_index":"sports","_type":"athlete"}} {"name":"Lewis", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"46.25,-68.55" }'

Let's group athletes by their sport and find out how many are in each sport:

 curl -XPOST "http://localhost:9200/sports/athlete/_search?pretty" -d' { "size": 0, "aggregations": { "the_name": { "terms": { "field": "sport" } } } }'

Here we use the “terms” aggregate, which groups the document into the “sport” field.
"size" : 0 (0 is replaced by Integer.MAX_VALUE automatically) says that we need all the documents without exception, in our case, speed is not important, but we must take into account that a more accurate result takes more time.

Answer:

 { ... "aggregations" : { "the_name" : { "buckets" : [ { "key" : "baseball", "doc_count" : 16 }, { "key" : "golf", "doc_count" : 2 }, { "key" : "basketball", "doc_count" : 1 }, { "key" : "football", "doc_count" : 1 }, { "key" : "hockey", "doc_count" : 1 } ] } } }

Great, baseball players the most.
Let's sort the athletes by the average value of their ranking, from major to minor:

 curl -XPOST "http://localhost:9200/sports/athlete/_search?pretty" -d' { "size": 0, "aggregations": { "the_name": { "terms": { "field": "name", "order": { "rating_avg": "desc" } }, "aggregations": { "rating_avg": { "avg": { "field": "rating" } } } } } }'

Here you can clearly see what an embedded unit is and how it can help us select documents as flexibly as possible.
First, we indicate that we need to group athletes by name, then sort by “rating_avg”, which is calculated in the “avg” aggregate, by “rating” field. Notice how elegantly ES works with arrays ( "rating" : [10, 9] ) and easily calculates the average value.

Answer:

 { ... "aggregations" : { "the_name" : { "buckets" : [ { "key" : "brady", "doc_count" : 1, "rating_avg" : { "value" : 10.0 } }, { "key" : "wayne", "doc_count" : 1, "rating_avg" : { "value" : 10.0 } }, { "key" : "james", "doc_count" : 1, "rating_avg" : { "value" : 9.0 } }, { "key" : "bingo", "doc_count" : 1, "rating_avg" : { "value" : 8.5 } }, ... {} ... { "key" : "duke", "doc_count" : 1, "rating_avg" : { "value" : 3.5 } }, { "key" : "bob", "doc_count" : 1, "rating_avg" : { "value" : 3.5 } } ] } } }

Another great feature of aggregates is the use of "script" . For example:

 curl -XPOST "http://localhost:9200/sports/athlete/_search?pretty" -d' { "size": 0, "aggregations": { "age_ranges": { "range": { "script": "DateTime.now().year - doc[\"birthdate\"].date.year", "ranges": [ { "from": 22, "to": 25 } ] } } } }'

Starting with version 1.2.0, script execution is disabled by default. You can turn it on , provided that users do not have direct access to ES (I hope this is so, otherwise I advise you to immediately close this access for the sake of your data security).

Aggregation in all its glory or something more complicated

Let's find all the athletes who are within a radius of 20 miles from the point "46.12,-68.55"
Group them by sport and display detailed statistics on the rating of athletes in this sport.
It sounds good, and here is an example.

 curl -XPOST "http://localhost:9200/sports/athlete/_search?pretty" -d' { "size": 0, "aggregations": { "baseball_player_ring": { "geo_distance": { "field": "location", "origin": "46.12,-68.55", "unit": "mi", "ranges": [ { "from": 0, "to": 20 } ] }, "aggregations": { "sport": { "terms": { "field": "sport" }, "aggregations": { "rating_stats": { "stats": { "field": "rating" } } } } } } } } }'

Answer:

 { ... "aggregations" : { "baseball_player_ring" : { "buckets" : [ { "key" : "*-20.0", "from" : 0.0, "to" : 20.0, "doc_count" : 13, "sport" : { "buckets" : [ { "key" : "baseball", "doc_count" : 8, "rating_stats" : { "count" : 14, "min" : 2.0, "max" : 5.0, "avg" : 3.357142857142857, "sum" : 47.0 } }, { "key" : "golf", "doc_count" : 2, "rating_stats" : { "count" : 4, "min" : 4.0, "max" : 10.0, "avg" : 6.75, "sum" : 27.0 } }, { "key" : "basketball", "doc_count" : 1, "rating_stats" : { "count" : 2, "min" : 8.0, "max" : 10.0, "avg" : 9.0, "sum" : 18.0 } }, { "key" : "football", "doc_count" : 1, "rating_stats" : { "count" : 1, "min" : 10.0, "max" : 10.0, "avg" : 10.0, "sum" : 10.0 } }, { "key" : "hockey", "doc_count" : 1, "rating_stats" : { "count" : 1, "min" : 10.0, "max" : 10.0, "avg" : 10.0, "sum" : 10.0 } } ] } } ] } } }

Conclusion

I hope I was able to convey the general possibilities of this beautiful module. Anyone who is interested in this topic, I advise you to read the entire list of filters on this link .
Glad any useful comments and additions on the topic.

You can also read my previous article on ES - ElasticSearch and search vice versa. Percolate API
And take part in the voting at the bottom of the article.

- Achievements of goals

Source: https://habr.com/ru/post/227131/

All Articles