Steroids for Munin

Munin is a very good thing to monitor servers, especially one or two. However, if the number of servers grows, it works worse and worse. Under the cut, the story of how I clocked it up to monitor more than 1000 virtual locks (275K rrd files in the system).

Why munin

Munin is beautiful:
- not demanding on resources (while there are few servers);
- easy setup (convenient defaults, simple text config);
- just plugins are written (and there are a bunch of ready-made plugins).

Munin is terrible:
- convenient defaults are not changed;
- integration with Nagios is useless;
- inconvenient groupings of graphs;
- "does not like" long-running plug-ins, you need to make crutches;
- The code inside also does not shine with beauty.

Based on rrd, it adds both pros and cons.
')
Munin's plug-in model turned out to be convenient, the developer can add graphics to the role without waiting for someone to start something in the central database. It still will to us goes around the size of the config, but it’s convenient the same.
The main thing, Munin was already. Switching to another system means reworking existing software and retraining people.

Problems

- began to step on his tail (does not have time to complete in 5 minutes);
- loads disk very hard;
- inconvenient to edit configs (always forget to add a new server or remove the old one);
- aggregated graphs are difficult to write;
- there was no integration with Nagios.

In addition, aggregated graphs are lying. Because if you add up the number of requests from 10 servers, and then turn off 5, then the historical data will decrease by a factor of 2! Naturally, since the total graph is calculated every time and is not saved, the formula has changed - the previous graph has changed.

And their decision

Overclock

It is necessary to make the generation of graphs as CGI / FastCGI, it accelerates but not much.

Radically overclock Munin can only be a costly way, put all the rrd in memory (tmpfs). Nothing else, neither RAID nor SSD helps, alas. 275K rrd occupy 14GB, which is not so much, a server with 32GB RAM is not uncommon (a few more GB will consume the processes of Munin itself). But the disk may be the most common.

Naturally, every couple of hours you need to drop to disk and pack existing RRDs just in case. Packed archive perfectly written to SATA disk.

Munin himself does not clean rrd, so unused rrd must be cleaned.
/ usr / bin / find / mnt / ramdrive / -type f -mtime +5 -delete

A small digression about the inevitability of memory use.

The problem of all such systems is that the data comes "cross" (one measurement from each metric), and processed "longitudinally" (all measurements for one metric) and shuffling this stream is very difficult. You can write as rrd immediately "longitudinally" and wait for the record, you can write to the database "transversely" (just add records one by one to a large table) and then slowly read them.
In any case, without a large cache / index in memory can not do.

Since we have a lot of memory, we run the update without limiting the processes, it helps with slow plugins. Only I cocked the timer to kill munin-update, which work longer than 3 minutes.

The next begins to blunt drawing graphics. Profiling showed that the more servers there are, the bigger the config is and it is parsed each time munin-graph-cgi is started. My config has grown to 64 megabytes and parsing it took up to 7 seconds (100% CPU load). The solution is obvious, do not parse it again, but in order to substitute this crutch, you must edit the Munin code.

munin-update will read and write the config as usual, plus save the config object using the Storable module. munin-graph will read the storable file if it has one.
This accelerates the drawing of graphs, but eats memory, 64 megabytes of the config are converted into 375MB of virtual memory per process.

Oddly enough, this is not a big problem for munin-update, since it first expands the config in memory and then forks. As a result, in the top 1000 processes with RSS in 250MB and only 21GB of RAM used (14 of them are rrd!)
There is a big problem with munin-graph, since there each process honestly eats its 400MB of memory, but so far there is enough memory.

The next problem was with the launch of munin-html, he did not have time to work out. Cured by running asynchronously, and several forks in the code. HTML is drawn every 10, not 5 minutes, but this is not necessary especially if you add a new server.

And make it more convenient

It is inconvenient to edit text configs with your hands, but it is very convenient to generate a script. Fortunately, by that time we already had a server database with partitioning into systems (virtual machines) and data centers (physical hosts). A simple script rewrites the Munin config based on this database and groups the servers by systems (Domain in Munin terms). New servers appear automatically, old ones also disappear automatically. Beauty!

The next problem is drawing aggregated graphs.
Plan such a script do a selection of the necessary rrd, take the last value of them and, for example, add.

It turns out this munin plugin for drawing the amount of requests for servers

#!/usr/bin/perl -w use strict; use warnings; if($ARGV[0] && $ARGV[0] eq 'config') { print <<<'EOS'; host_name Aggregated graph_title Total requests graph_args --base 1000 -l 0 graph_vlabel requests graph_category Nginx graph_info Total count of requests total.label total requests total.info total requests EOS exit(0); } my @rrd_files = `/bin/ls /mnt/ramdrive/munin/oMobile/*-nginx_log_access-total-d.rrd`; my $sum = 0; foreach my $filename (@rrd_files) { next unless(-f $filename); my $lastline = `/usr/bin/rrdtool fetch $filename AVERAGE -s -15M -e now | fgrep -v 'nan' | tail -n 1`; if($lastline =~ m/^\d+: ([-+.0-9e]+)\s*$/) { $sum+=$1; } } print "total.value $sum\n"; exit(0);

Never do that! This is just for demonstration, it makes a cloud of unnecessary exec, in production you must use RRD.

Most often, either a sum is made, or a grouping of several lines on a single chart (it is very clearly seen whether someone is out of the crowd). But you can also calculate something more complicated. For example, given the average request processing time on each server and the number of requests to the server, then you can calculate the average request time for the entire system.

Pay attention to host_name, so you can create virtual hosts and domains in Munin and group interesting graphics in them.

If you read not only the last value from the RRD, then you can catch sharp ups and downs of graphs. As a rule, sharp, percent by 20, change in the number of requests, response time, LA, etc. not good and needs attention. We take the data for the last hour, day, week and compare. If the discrepancy is large, then you can send an alert (passive check in Nagios, for example).

Total

- Any system based on rrd can be overclocked by putting rrd files into memory (tmpfs);
- Munin convenient for small projects can be stretched to a noticeable number of servers (I have> 1500 munin-node);
- A small script and you can draw quite complex graphics;
- A small script and you can analyze trends and report alerts to them.

Source: https://habr.com/ru/post/146032/

All Articles