My name is Anton Baderin. I work at the Center for High Technologies and do system administration. A month ago, our corporate conference ended, where we shared our accumulated experience with the IT community of our city. I talked about monitoring web applications. The material was intended for junior or middle level, which did not build this process from scratch.
The cornerstone of any monitoring system is solving business problems. Monitoring for the sake of monitoring is not interesting to anyone. What does a business want? So that everything works quickly and without errors. Business wants proactivity so that we can identify problems in the service and eliminate them as quickly as possible. This, in fact, is the tasks that I have solved all of last year on the project of one of our customers.
The project is one of the largest loyalty programs in the country. We help retailers increase the frequency of sales with various marketing tools like bonus cards. In total, the project includes 14 applications that run on ten servers.
In the process of interviewing, I repeatedly noticed that admins are far from always the right approach to monitoring web applications: until now, many people stop at the metrics of the operating system, occasionally monitor services.
In my case, Icinga was the basis of the customer’s monitoring system. She did not solve the above problem. Often the client himself told us about the problems and at least we simply did not have enough data to get to the bottom of the cause.
In addition, there was a clear understanding of the futility of its further development. I think those who are familiar with Icinga will understand me. So, we decided to completely rework the system for monitoring web applications on the project.
We chose Prometheus based on three main indicators:
Previously, we did not collect logs and did not process them. The disadvantages are clear to everyone. We chose ELK because we already had experience with this system. We only store application logs there. The main selection criteria were full-text search and its speed.
Initially, the choice fell on InfluxDB. We recognized the need to collect Nginx logs, statistics from pg_stat_statements, and store Prometheus historical data. We did not like Influx, as it periodically began to consume a large amount of memory and fell. In addition, I wanted to group requests by remote_addr, and the grouping in this DBMS is by tags only. Tags of the road (memory), their number is conditionally limited.
We started the search again. We needed an analytical database with minimal resource consumption, preferably with data compression on the disk.
Clickhouse satisfies all these criteria, and we have never regretted our choice. We do not write to it any outstanding data volumes (the number of inserts is only about five thousand per minute).
NewRelic has historically been with us, since it was the choice of the customer. We use it as an APM.
We use Zabbix exclusively to monitor the Black Box of various APIs.
We wanted to decompose the task and thereby systematize the approach to monitoring.
For this, I divided our system into the following levels:
What is convenient such an approach:
Since our task is to detect violations in the system, we must at each level select a certain set of metrics that we should pay attention to when writing alert rules. Next, go through the levels of "VMS", "Operating System" and "System Services Software Stack."
Hosting gives us the processor, disk, memory and network. And with the first two we had problems. So, metrics:
CPU stolen time - when you buy a virtual machine on Amazon (t2.micro, for example), you should understand that you are not allocated a whole processor core, but only a quota of its time. And when you run out of it, the processor will be taken from you.
This metric allows you to track such moments and make decisions. For example, is it necessary to take the tariff fatter or spread the processing of background tasks and API requests to different servers.
IOPS + CPU iowait time - for some reason, many cloud hosting companies sin because they do not have IOPS. Moreover, the schedule with low IOPS is not an argument for them. Therefore, it is worth collecting and CPU iowait. With this pair of graphs — with low IOPS and high I / O waiting — you can already talk to the hosting and solve the problem.
Operating system metrics:
Also at the level of the operating system, we have such an entity as processes. It is important to highlight in the system a set of processes that play an important role in its work. If, for example, you have several pgpool, then you need to collect information on each of them.
The set of metrics is as follows:
All monitoring is deployed in our Docker, we use advisor to collect data of metrics. On the other machines, we use the process-exporter.
Each application has its own specifics, and it is difficult to select a set of metrics.
Universal set are:
The most striking examples of monitoring this level are Nginx and PostgreSQL.
The most loaded service in our system is a database. Previously, we often had problems with finding out what the database did.
We saw a high load on the disks, but the slologs did not really show anything. We solved this problem with the help of pg_stat_statements, a view that collects statistics on queries.
That's all the admin needs.
We build graphs of the activity of requests for reading and writing:
Everything is simple and clear, each request has its own color.
No less a vivid example is Nginx logs. Not surprisingly, few parsit them or mention them in the list of mandatory ones. The standard format is not very informative and needs to be expanded.
Personally, I added request_time, upstream_response_time, body_bytes_sent, request_length, request_id. We plot the response time and the number of errors:
We build graphs of response time and the number of errors. Remember? did i talk about business tasks? So quickly and without errors? We have already closed these questions in two charts. And for them it is already possible to call the administrators on duty.
But one more problem remained - to ensure the prompt elimination of the causes of the incident.
The whole process from identifying to solving a problem can be divided into a number of steps:
It is important that we should do this as quickly as possible. And if at the stages of identifying a problem and sending a notification, we can’t win any particular time - two minutes will be spent on them anyway, then the next ones are just an uncultivated field for improvements.
Let's just imagine that the duty phone rang. What will he do? Search for answers to questions - what is broken, where is broken, how to react? This is how we answer these questions:
We simply include all this information in the notification text, give it a link to a page on the wiki, where it describes how to respond to this problem, how to solve it and escalate it.
I still have not said anything about the level of the application and the business of logic. Unfortunately, our applications have not yet implemented the collection of metrics. The only source of any information from these levels is the logs.
A couple of moments.
First, write structured logs. No need to include context in the message text. This makes it difficult to group and analyze. Logstash takes a long time to normalize everything.
Secondly, use severity levels correctly. Each language has its own standard. Personally, I distinguish four levels:
I summarize. It is necessary to try to build monitoring from business logic. Try to monitor the application itself and operate with such metrics as the number of sales, the number of new user registrations, the number of users currently active, and so on.
If your entire business is one button in the browser, you need to monitor whether it is burning through, whether it is working properly. All the rest does not matter.
If you do not have it, you can try to catch it up in the application logs, Nginx logs, and so on, as we did. You should be as close as possible to the application.
The operating system metrics are of course important, but they are not interesting to business, we are not paid for them.
Source: https://habr.com/ru/post/449352/
All Articles