📜 ⬆️ ⬇️

Monitoring Elasticsearch without pain and suffering

"And it does magic there"
Someone from those whom I remotely advised on Elastic.

I always say that I believe in three things: monitoring, logs and backups.

The topic of how we collect and store logs was fully disclosed in previous articles , the topic of backups in Elasticsearch is a completely separate story, so in this, perhaps final, article of the cycle I will tell you how my favorite cluster is being monitored. This is not very difficult (and does not require the use of additional plug-ins and third-party services) - because the REST API provided by Elasticsearch itself is simple, understandable and easy to use. All you have to do is to go a little deeper into its internal structure, understand what all these metrics mean, thread pools, the weight of the distribution of shards across the nodes, the queue settings - and there will be no questions about what the elastic is doing for “magic” right now.
')


At the recent Highload ++ 2017 conference, I talked about how I built the cluster of my dreams, and said that it is not enough just to build a service. It is critical at any time to know what state it is in, and the control must be multi-level. Wake me up in the middle of the night (hello to the monitoring department!) - and in two minutes I will know the state of the cluster. And one minute of two will go to connect to the corporate VPN and login to Zabbix.

Disclaimer: The described structure consists of the Elasticsearch 5.xx cluster, with dedicated master, data, and client nodes. The cluster is divided into “hot” and “warm” segments (hot / warm cluster architecture). Screenshots are given on the monitoring system Zabbix. But the general principles of ES monitoring can be applied to a cluster of almost any configuration and version.
So, the levels of monitoring from the bottom up:

Server level


The lower level of monitoring is hardware and basic metrics, which are collected from any server, but some, in the case of the Elasticsearch cluster, require special attention. Namely:


I specifically highlighted network errors — elastic is extremely sensitive to problems with the network. By default, the current active master pings the rest of the nodes once a second, and the nodes, in turn, check for the presence of the quorum of the master node. If at some stage the node did not respond to ping within 30 seconds, then it is marked as inaccessible, with consequences in the form of cluster rebalance and long integrity checks. And this is not at all what a properly designed cluster should be busy in the middle of the working day.

Service level


The level is higher, but monitoring is still the same standard:


If any of the metrics has fallen to zero, this means that the application has fallen or is stuck, and an alert for the disaster level is immediately called.

Application level


From this moment begins the most interesting. Since Elasticsearch is written in java, the service runs as a regular jvm application and at first I used Jolokia agent to monitor it, which became the de facto standard for monitoring java applications in our company.

Jolokia, as stated on the developer’s website , is “JMX on capsaicin,” which, in essence, is a JMX-HTTP gate. Runs with the application as a javaagent, listens to the port, accepts http requests, returns JMX metrics wrapped in JSON, and does it quickly, unlike the slow JMX agent of the same Zabbix. Able to authorization, firewall-friendly, allows the separation of access to the level of individual mbeans. In short, a great service for monitoring java-applications, which turned out to be in the case of Elasticsearch ... completely useless.

The fact is that Elasticsearch provides a great API that completely covers all monitoring needs - including the same metrics about the state of the java-machine as Jolokia. Accepts http requests, returns JSON - what could be better?

Let's go through the API in more detail, from simple to complex:

Cluster Level


_cluster / health


General cluster state metrics. The most important ones are:


Based on these metrics, the following graph is built up with information:

Here and further on click will open larger

In perfect condition, all its metrics should be zero. Alarms on these metrics rise if:


The rest of the metrics do not have a critical impact on the work of the cluster, therefore no alarms have been set for them, but they are collected and graphs are built on them for monitoring and periodic analysis.

_cluster / stats


Extended statistics on the cluster, from which we collect only a few metrics:


Node level


_nodes / stats


Despite the fact that nodes perform different tasks, the API for getting statistics is called the same way and returns the same set of metrics, regardless of the role of the node in the cluster. And already our task, when creating a template in Zabbix, is the selection of metrics depending on the role of the node. In this API, there are several hundred metrics, most of which are useful only for in-depth analytics of cluster work, and the following are monitored and alerts:

Metrics on process memory elastic


jvm.mem.heap_max_in_bytes - allocated memory in bytes, jvm.mem.heap_used_in_bytes - used memory in bytes, jvm.mem.heap_used_percent - used memory, as a percentage of allocated Xmx.

A very important parameter, and here's why: Elasticsearch stores in the RAM of each data node the index part of each shard belonging to this node for searching. Garbage Collector comes regularly and clears unused memory pools. After some time, if there is a lot of data on the node and it no longer fits into memory, the collector performs longer and longer cleaning, trying to find something that can be cleaned up to the full stop the world.

And due to the fact that the Elasticsearch cluster is running at the speed of the slowest data node, the entire cluster starts to stick. There is another reason to follow the memory - in my experience, after 4-5 hours in a state of jvm.mem.heap_used_percent> 95% drop in the process becomes inevitable.

The graph of memory usage during the day usually looks something like this and this is a completely normal picture. Hot Zone:



Warm zone, here the picture looks calmer, although less balanced:



Reducing the value of http.total_rps metrics from all nodes into one graph allows us to assess the adequacy of load balancing on nodes and see the distortions as soon as they appear. As you can see on the chart below - they are minimal:



throttle_time_in_millis is a very interesting metric that exists in several sections on each node. What section to collect it depends on the role of the node. The metric shows the time to wait for the operation to complete in milliseconds, ideally should not differ from zero:


There are no specific ranges of upper values ​​that throttle_times should fit into, it depends on the specific installation and is calculated empirically. Ideally, of course, should be zero.

search is a section containing metrics for queries executed on the node. It has nonzero values ​​only on data nodes. Of the ten metrics in this section, two are particularly useful:

Based on these two metrics, a graph is plotted of the average time spent on executing one search query for each data node of the cluster:



As you can see - the perfect schedule is even throughout the day, without sharp bursts. There is also a sharp difference in search speed between the “hot” (servers with SSD) and “warm” zones.

It should be noted that when performing a search, Elasticsearch performs it on all the nodes where the index shards are located (both primary and replica). Accordingly, the resulting search time will be approximately equal to the time spent on the search on the slowest node, plus the time to aggregate the data found. And if your users suddenly start complaining that the search has become slow, the analysis of this graph will allow you to find the node that is causing it.

thread_pool is a group of metrics describing queue pools where operations get to be executed. There are many queues, a detailed description of their purpose is in the official documentation , but the set of returned metrics for each queue is the same: threads, queue, active, rejected, largest, completed. The main queues that must be monitored are generic, bulk, refresh and search.

But on these metrics, in my opinion, you need to pay special attention:


The growth of this indicator is a very bad sign, which shows that the elastic does not have enough resources to receive new data. To fight, without adding iron to the cluster, is difficult. You can, of course, add the number of processor processors and increase the size of the bulk queues, but the developers do not recommend this. The right decision is to increase the capacity of iron, add new nodes to the cluster, split up the load. The second unpleasant moment is that the data delivery service before elastic should be able to handle failures and “boost” the data later to avoid data loss.

The failure metric should be monitored for each queue, and an alarm should be attached to it, depending on the node's purpose. It makes no sense to monitor the bulk queue on nodes that do not work on receiving data. Also, the information on loading pools is very useful for optimizing cluster operation. For example, comparing the load of the search pool in different zones of the cluster, you can build an analyst with a reasonable period of data storage on hot nodes before transferring them to slow nodes for long-term storage. However, analytics and fine tuning of the cluster is a separate volume topic and, probably, not for this article.

Data acquisition


Well, more or less understood the metrics, it remains to consider the last question - how to receive them and send them to monitoring. Using Zabbix or any other is not important, the principle is the same.

Options, as always, several. You can perform curl queries directly in the API, right up to the desired metric. The pros are simple, the minuses are slow and irrational. You can make a request for the entire API, get a big JSON with all metrics and parse it. Once I did just that, but then I switched to the elasticsearch module in python. In essence, this is a wrapper over the urllib library, which translates functions into the same requests to the Elasticsearch API and provides a more convenient interface to them. The source code of the script and a description of its work can traditionally be viewed on GitHub . Templates and screenshots in Zabbix, unfortunately I can not provide, for obvious reasons.



Conclusion


There are many (several hundred) metrics in the Elasticsearch API and listing them all is beyond the scope of one article, so I described only the most significant of those that we collect and what we need to pay close attention to when monitoring them. I hope that after reading the article, you have no doubt that elastic does not do any “magic” inside - only the normal work of a good, reliable service. Peculiar, not without cockroaches, but nevertheless, beautiful.

If something remains unclear, any questions have arisen or some places require more detailed consideration - write in the comment, we will discuss.

Source: https://habr.com/ru/post/358550/


All Articles