Monitoring agent: simple thing or not?

Now there are quite a few systems for storing and processing metrics (timeseries db), but the situation with agents (software that collects metrics) is more complicated. Not so long ago, telegraf appeared, but still the choice is not great.

At the same time, almost all cloud monitoring services are developing their agents, and we are no exception. The motivation is quite simple - there are many specific requirements that do not fit well with the architecture of existing solutions.

Our main specific requirements:

reliable delivery of metrics to the cloud
plugins' complicated logic: they interact with each other
diagnostics: we must be able to understand why an agent cannot collect certain metrics
the agent should consume as few resources of the client server as possible

Under the cut I will tell several aspects of the development of an agent for collecting metrics.

Delivery of metrics to the cloud at any cost

The monitoring system should always work, and especially in case of problems in the client’s infrastructure. We began by saying that all metrics are first written to the server disk on which they were collected (we call this spool). After that, the agent immediately tries to send a pack of metrics to the collector and, if successful, removes this pack from the disk. Spool is limited in size (500 megabytes by default), if it overflows we begin to delete the “oldest” metrics.

In this approach, an error was immediately found, we will not be able to work if the disk on the server is full. For this situation, we are trying to immediately send the metrics if it was not possible to write to the disk. Entirely, the logic of sending was decided not to change, since we want to minimize the time that the metrics are found only in the agent’s memory.

If the pack of metrics could not be sent, the agent retries. The problem is that the packs of metrics can be quite large and unequivocally choosing a timeout is quite difficult. If we make it big, we can get a delay in the delivery of metrics due to one "stuck" request. In the case of a small timeout, the big metric packs will stop climbing. We can not crush large packs into small ones for several reasons.

We decided not to use the general timeout for the request to the collector. We set the timeout for setting up the tcp connection and tls handshake, and the connection "liveliness" is checked by TCP keepalive . This mechanism is available on almost all modern OSs, and where there is none (we have clients with FreeBSD 8.x for example :) we have to set a large timeout for the entire request.

This mechanism has 3 settings (all time intervals are set in seconds, for services that are sensitive to delays, this is not very suitable):

Keepalive time - after what time after receiving the last packet with data in the connection, start sending samples
Keepalive probes - the number of failed samples, after which the compound is considered dead
Keepalive interval - interval between samples

The default values are not very practical:

$ sysctl -a |grep tcp_keepalive net.ipv4.tcp_keepalive_time = 7200 net.ipv4.tcp_keepalive_probes = 9 net.ipv4.tcp_keepalive_intvl = 75

These parameters can be redefined for any connection, our goal is to identify the problem connection as early as possible and close it. For example, such settings time = 1, probes = 3, interval = 1 allow you to catch a problem connection in 4 seconds.

When the connection of the agent with the collector disappears completely, we cannot do anything else. But quite often we encountered a situation where there is a connection, but the DNS server used by the server does not work. We decided in the case of a DNS error to try to otrezolvit domain collector through google public DNS .

Plugins

We pay a lot of attention to autodetecting services on client servers, it allows clients to quickly implement our monitoring and helps to not forget to configure anything. To do this, most plug-ins need a list of processes, and it is needed more than once at the start of the agent, but constantly to pick up new services. We get a list of processes once per interval, use it to get the metrics for the processes and send it to the plugins that need it for other tasks.

The plugin can also run other plugins or additional instances of itself. For example, we have the nginx plugin, it once a minute produces a list of processes, for each running nginx it:

locates its config
reads the config and all nested configs
find all log_format and access_log directives
based on log_format generates a regular expression for log parsing
for each access_log runs an instance of the logparser plugin, which starts to parse the log

If some log is added / removed or the format is changed, the logparser settings are changed, the missing instances are started, and the extra ones are stopped.

Logparser can take as input a single file, and glob. But since we want to parse the logs in parallel, glob periodically opens and runs the required number of instances of the same plugin.

Recently, another rather confused place has appeared - traffic sniffer, while only mongodb plugin interacts with it, but we plan to expand it:

according to the process list, the mongodb plugin finds the running Mongu on the server
informs the sniffer that he wants to receive packets on a specific TCP port
receives packets from the sniffer, additionally parses tcp payload and reads various metrics

As a result, we did not get quite plug-ins in the usual sense, but simply some of the modules that can interact with each other. Such scenarios would be very difficult to integrate into some kind of ready-made solution.

Agent diagnosis

The support of the first clients turned out to be hell for us, we had to correspond for a long time, ask the client to run different commands on the server and send us the output. In order not to blush in front of clients and speed up the time of searching for problems, we put an agent log in order and began to deliver it to us in the cloud in real time.

The log helps to quickly catch most of the problems, but communication with customers does not completely replace. The most common problems are related to the fact that the services are not automatically located. Most clients use fairly standard configs, log formats, etc., everything works like a clock for them. But as a rule, in companies where experienced administrators work - the possibilities of the software are used to the full and cases appear that we didn’t even suspect.

For example, recently we learned about the possibility of configuring a postgre via ALTER SYSTEM SET , which generates a separate config postgresql.auto.conf, which redefines different values of the main config.

We have the feeling that over time our agent turns into a piggy bank of knowledge about how various projects are arranged :)

Performance optimization

We constantly monitor the consumption of resources by our agent. We have several plugins that can significantly load the server: logparser, statsd, sniffer. For such cases we try to do various benchmarks , often we profile the code on our stands.

We write the agent on golang, and it has a profiler that can be included under load. We decided to use this and taught the agent to periodically spit out the cpu profile in the log in a minute, this allows us to understand how the agent behaves under the client load.

Since the agent reduces the consumption of resources by absolutely all processes on the server, customers can always see how much our agent consumes. For example, on one of the client front-end servers, the agent parses the log with approximately 3.5k rps:

By chance, nginx-amlify is tested next to our agent, which parses the same log :)

Total

monitoring agent is not as simple as it seems (it takes us about half the time to develop and support the agent)
for us, everything is complicated by the fact that all clients are different and differently configure their infrastructure
looking back, it is clear to us that we would not be able to implement our Wishlist on something ready
continue to build your bike :)

Source: https://habr.com/ru/post/312560/

All Articles