How we built monitoring on Prometheus, Clickhouse and ELK

My name is Anton Baderin. I work at the Center for High Technologies and do system administration. A month ago, our corporate conference ended, where we shared our accumulated experience with the IT community of our city. I talked about monitoring web applications. The material was intended for junior or middle level, which did not build this process from scratch.

The cornerstone of any monitoring system is solving business problems. Monitoring for the sake of monitoring is not interesting to anyone. What does a business want? So that everything works quickly and without errors. Business wants proactivity so that we can identify problems in the service and eliminate them as quickly as possible. This, in fact, is the tasks that I have solved all of last year on the project of one of our customers.

about the project

The project is one of the largest loyalty programs in the country. We help retailers increase the frequency of sales with various marketing tools like bonus cards. In total, the project includes 14 applications that run on ten servers.

In the process of interviewing, I repeatedly noticed that admins are far from always the right approach to monitoring web applications: until now, many people stop at the metrics of the operating system, occasionally monitor services.

In my case, Icinga was the basis of the customer’s monitoring system. She did not solve the above problem. Often the client himself told us about the problems and at least we simply did not have enough data to get to the bottom of the cause.

In addition, there was a clear understanding of the futility of its further development. I think those who are familiar with Icinga will understand me. So, we decided to completely rework the system for monitoring web applications on the project.

Prometheus

We chose Prometheus based on three main indicators:

A huge number of available metrics. In our case, there are 60 thousand. Of course, it is worth noting that the vast majority of them we do not use (probably about 95%). On the other hand, they are all relatively cheap. For us, this is another extreme, compared with the previously used Icinga. In it, adding metrics was a special pain: the existing ones were expensive (just look at the source of any plug-in). Any plugin was a script in Bash or Python, the launch of which is not cheap in terms of resources consumed.
This system consumes a relatively small amount of resources. All our metrics have 600 MB of RAM, 15% of one core and a couple of dozen IOPS. Of course, it is necessary to launch exporters of metrics, but all of them are written in Go and also do not differ in gluttony. I do not think that in modern reality this is a problem.
Allows you to go to Kubernetes. Given the plans of the customer - the choice is obvious.

ELK

Previously, we did not collect logs and did not process them. The disadvantages are clear to everyone. We chose ELK because we already had experience with this system. We only store application logs there. The main selection criteria were full-text search and its speed.

Clickhouse

Initially, the choice fell on InfluxDB. We recognized the need to collect Nginx logs, statistics from pg_stat_statements, and store Prometheus historical data. We did not like Influx, as it periodically began to consume a large amount of memory and fell. In addition, I wanted to group requests by remote_addr, and the grouping in this DBMS is by tags only. Tags of the road (memory), their number is conditionally limited.

We started the search again. We needed an analytical database with minimal resource consumption, preferably with data compression on the disk.

Clickhouse satisfies all these criteria, and we have never regretted our choice. We do not write to it any outstanding data volumes (the number of inserts is only about five thousand per minute).

Newrelic

NewRelic has historically been with us, since it was the choice of the customer. We use it as an APM.

Zabbix

We use Zabbix exclusively to monitor the Black Box of various APIs.

Determination of monitoring approach

We wanted to decompose the task and thereby systematize the approach to monitoring.

For this, I divided our system into the following levels:

"Iron" and VMS;
operating system;
system services, software stack;
attachment;
business logic.

What is convenient such an approach:

we know who is responsible for the work of each of the levels and, on this basis, we can send out alerts;
we can use the structure in suppressing alerts — it would be strange to send an alert about database unavailability when the virtual machine is generally unavailable.

Since our task is to detect violations in the system, we must at each level select a certain set of metrics that we should pay attention to when writing alert rules. Next, go through the levels of "VMS", "Operating System" and "System Services Software Stack."

Virtual machines

Hosting gives us the processor, disk, memory and network. And with the first two we had problems. So, metrics:

CPU stolen time - when you buy a virtual machine on Amazon (t2.micro, for example), you should understand that you are not allocated a whole processor core, but only a quota of its time. And when you run out of it, the processor will be taken from you.

This metric allows you to track such moments and make decisions. For example, is it necessary to take the tariff fatter or spread the processing of background tasks and API requests to different servers.

IOPS + CPU iowait time - for some reason, many cloud hosting companies sin because they do not have IOPS. Moreover, the schedule with low IOPS is not an argument for them. Therefore, it is worth collecting and CPU iowait. With this pair of graphs — with low IOPS and high I / O waiting — you can already talk to the hosting and solve the problem.

operating system

Operating system metrics:

amount of available memory in%;
swap usage activity: vmstat swapin, swapout;
amount of available inodes and free space on the file system in%
load average;
the number of connections in the tw state;
conntrack table fullness;
The quality of the network can be monitored using the ss utility, using the iproute2 package — you can get an indicator of RTT connections from its output and group them by dest-port.

Also at the level of the operating system, we have such an entity as processes. It is important to highlight in the system a set of processes that play an important role in its work. If, for example, you have several pgpool, then you need to collect information on each of them.

The set of metrics is as follows:

CPU;
memory is resident in the first place;
IO - preferably in IOPS;
FileFd - open and limit;
significant page failures - so you can understand what process is being written.

All monitoring is deployed in our Docker, we use advisor to collect data of metrics. On the other machines, we use the process-exporter.

System services, software stack

Each application has its own specifics, and it is difficult to select a set of metrics.

Universal set are:

rate requests;
number of mistakes;
latency;
saturation.

The most striking examples of monitoring this level are Nginx and PostgreSQL.

The most loaded service in our system is a database. Previously, we often had problems with finding out what the database did.

We saw a high load on the disks, but the slologs did not really show anything. We solved this problem with the help of pg_stat_statements, a view that collects statistics on queries.

That's all the admin needs.

We build graphs of the activity of requests for reading and writing:

Everything is simple and clear, each request has its own color.

No less a vivid example is Nginx logs. Not surprisingly, few parsit them or mention them in the list of mandatory ones. The standard format is not very informative and needs to be expanded.

Personally, I added request_time, upstream_response_time, body_bytes_sent, request_length, request_id. We plot the response time and the number of errors:

We build graphs of response time and the number of errors. Remember? did i talk about business tasks? So quickly and without errors? We have already closed these questions in two charts. And for them it is already possible to call the administrators on duty.

But one more problem remained - to ensure the prompt elimination of the causes of the incident.

Incident elimination

The whole process from identifying to solving a problem can be divided into a number of steps:

problem identification;
notification of the administrator on duty;
incident response;
elimination of causes.

It is important that we should do this as quickly as possible. And if at the stages of identifying a problem and sending a notification, we can’t win any particular time - two minutes will be spent on them anyway, then the next ones are just an uncultivated field for improvements.

Let's just imagine that the duty phone rang. What will he do? Search for answers to questions - what is broken, where is broken, how to react? This is how we answer these questions:

We simply include all this information in the notification text, give it a link to a page on the wiki, where it describes how to respond to this problem, how to solve it and escalate it.

I still have not said anything about the level of the application and the business of logic. Unfortunately, our applications have not yet implemented the collection of metrics. The only source of any information from these levels is the logs.

A couple of moments.

First, write structured logs. No need to include context in the message text. This makes it difficult to group and analyze. Logstash takes a long time to normalize everything.

Secondly, use severity levels correctly. Each language has its own standard. Personally, I distinguish four levels:

there is no error;
client side error;
a mistake is on our side, we do not lose money, we do not bear risks;
a mistake on our side, losing money.

I summarize. It is necessary to try to build monitoring from business logic. Try to monitor the application itself and operate with such metrics as the number of sales, the number of new user registrations, the number of users currently active, and so on.

If your entire business is one button in the browser, you need to monitor whether it is burning through, whether it is working properly. All the rest does not matter.

If you do not have it, you can try to catch it up in the application logs, Nginx logs, and so on, as we did. You should be as close as possible to the application.

The operating system metrics are of course important, but they are not interesting to business, we are not paid for them.

Source: https://habr.com/ru/post/449352/

All Articles