Monitoring projects using the messenger on the example of Nagios and Telegram, with the analysis of fakapov from the life of Highload 24x7

Figure: Margarita Zakiyeva

What will be under the cut:

The basic settings Nagios in conjunction with the Telegram.
The general concept of our project monitoring with colleagues.
Analysis of the rake, which we managed to step on when working with this system.

Our article will be useful for those who:

Dissatisfied with the information content of their current monitoring.
Experiencing daily pain below the back with alerts for problems.

This article is not about "Telegram Bot API "

We started setting up the bundle that will be discussed a month before the public release of the API, so from the very beginning, Telegram CLI for Linux was used to send alarm signals from the monitoring server. The article, first of all, is devoted to this console client. At the end of the article we explained in detail why we did not abandon it in favor of innovations from the world of bots.

Who we are and what we do

We are a friendly “Operations” team and we have dozens of servers to admin, it can be either VPS or “iron” servers, including Colocation, and they are scattered around the world. Proper and effective work with monitoring is our main priority.
')

General concept

We don’t have people in the state who would be responsible for not sleeping at night and monitoring the monitoring, but we have one account registered on the “left” SIM card, on whose behalf we send messages and a certain amount:

Nagios instances - this has nothing to do with the implementation of sending notifications, we just want to emphasize that with several Nagios at the same time, everything works without any failures.

Facade # 0 - Do not monitor
Sooner or later you will come across the fact that monitoring can also break down, but you want to know about it right away, and not on Monday, after the weekend. At the same time, it is logical to check some services “from the inside”, and others, for example, the status of the response of your site via HTTP - “outside”. In order to “kill two birds with one stone”, set up another Nagios for yourself with another provider and distribute the checks you need between the two monitoring, not forgetting to set up checking the check_nagios of one instance to another and mirror the opposite. I hope for you, as well as for us, the simultaneous fall of two providers in different countries is an extremely unlikely scenario.

What does our monitoring monitoring look like?

Customized notifications for services — the key point here — is to configure only the most important notifications in the messenger; most likely it will be CRITICAL notices on the most key metrics on the most important hosts. The rest, for example, WARNING or sandbox hosts are configured to send messages outside of the scheme described in this article. This may be, for example, mail or "lichka" with a robot in the same Telegram.

Fakap №1 - Send notifications that require immediate intervention in the system to fix the problem in the same chat as those alarms that can wait or even soon disappear, after the automatic repair service.
If you do this, then everyone who will view the chat will soon completely stop paying attention to it, especially if they have to wake up at 4 am due to a false positive. The opposite situation is a complete shutdown for the night of monitoring for the log of an important web server. You do not need to do this, there is always the possibility that it is at night that very important information has crept in there, which will need to be dealt with during the day, a sufficient measure - sending such messages to the mail you read during business hours. Separate and conquer.

Typical channel to which the attendant responds

System administrators , who in turn start the daily "duty" on monitoring, which lasts a day from 23:00 to 23:00. The administrator who is on duty must turn on (or not turn off) notifications for the channel that is configured as the default destination for critical alarms from Nagios.

Fakap №2 - Respond to notifications on the principle of "who first saw."
If you do not appoint a duty officer, then one night nobody will wake up, and in the morning no one will be guilty. In order not to oversleep any notifications at night while on duty, on a mobile device, we recommend setting up notifications as shown in the picture below.

Set up notifications on your phone or tablet

Reserve channels . The idea is simple - if no one responded to a specific breakdown within half an hour, monitoring automatically switches from a regular chat to an emergency one, in which, like the previous one, all administrators are located. Its difference lies in the fact that it can not be ignored by anyone, notifications should be included always and in all. You can also make another chat not only with the administrators, but also, for example, with the directors, in case the service does not work for an hour and no one whose responsibility it is to monitor them, does not respond to monitoring. How exactly they are implemented from a technical point of view - at the very end of the article.

Fakap number 3 - rely only on duty.
Bitter experience has shown us that an accident in your DC can happen at the same time as the Internet access system has been disconnected from the on-duty system administrator at home. Despite the fact that everyone has a mobile Internet, by default everybody has a smartphone connected to home Wi-Fi and doesn’t care about accessing the global web there, “all three sticks”. However, the admin may be unavailable due to simpler and linear life scenarios.

Backup channels, for which everyone always has notifications

Themed channels . A system administrator can eliminate far from all the faults detected by monitoring, for example, errors in application logs or specific deadlocks in the database. The concept of “waking the system administrator to wake up the developer’s backend” seems to us to be wrong, therefore, for such notifications, “thematic” channels were created separately, the responsibility for which is borne not by system administrators, but by other specialized specialists.

Fakap №4 - Send notifications from the robot in chat rooms, where working discussions take place.
It may seem to you that you will attract more attention to the problem and it will be solved faster, but in fact it is not, you will only annoy people with the presence of incomprehensible messages in the midst of an important discussion of the quarterly report. If necessary, simply send a message describing the problem from a special channel to a working chat yourself.
As an example, I show a screenshot with “backup” channels and one thematic dedicated database.

Thematic channel dedicated to the database

Small summary: after accepting the arrangements described above, it has become much easier for system administrators to work. This allowed them to be distracted by notifications from the smartphone less often and provided an opportunity to learn how to spend working time on improving the infrastructure of the company. The quality of admins sleep has improved, and the "tops" no longer worry about the fact that at night there will be a problem with daoutime vital services for the company and its reputation will be undermined.

We send Nagios notifications to Telegram.

Installation and first launch of the console client

Even if you find a telegram-cli in the repositories of your distribution (for example, the RPMfusion Repository for CentOS) or a ready-made package on the Internet, we strongly recommend you “assemble and compile” yourself , since this procedure is discussed in detail right on the github page of the project for many * nix systems.

Note for fans of Fedora and CentOS

for Fedora 20 and CentOS 6, you must first compile libjansson yourself , which was not included in the standard turnips.

After successful compilation of the binary, you need to create a user on the system with the login "telegramd", so that after the first launch of the client, you will create the /home/telegramd/.telegram-cli directory on your system, within which the client, after confirming his authorization, will store service files, for example, the received private key from Telegram servers.

Why username is 'telegramd'

telegramd - this is exactly the default username used by the client, if you run it on the system as the superuser, we did not find such information in the documentation, but we spotted it in “ main.c ”.

How not to lose access to the account registered on "left sim card"

It is enough to backup the same “.telegram-cli” folder that was mentioned earlier. Transferring it to another server, Telegram will immediately launch with the necessary authorization and settings.

And so, you have in your hands a phone with a SIM card, to which we will register Telegrams, and the server console with monitoring is open on the computer.

adduser telegramd # --disabled-login ./bin/telegram-cli -k tg-server.pub

Follow the instructions on the screen and get into the console telegram

Now you can add someone to the " contact_list " by his phone number, as far as we know - this is the only way to put the user in the "contacts" so that later notifications from Nagios can be sent there. This can be done from the console or from any other client , including the Telegram Web-version , of course, having logged in there with the same phone number that you just used. To send messages to the general chat or channel on the side of the “robot”, you don’t need to do anything at all, just take care that it is an administrator if you send messages to the “channel”.

 add_contact +79991112233 My Contact quit

Configuring the client for sending alerts

Now we have a configured console client with one contact for sending notifications there. For ease of use, we wrap this into a bash script.

/usr/local/bin/telegram.sh

 #!/bin/bash #This script helps integrate Nagios instances #with telegrams chats or channels. sendFunc() { "$tgBinPath" ` `--rsa-key "$tgKeyPath" ` `--wait-dialog-list ` `--exec "$tgSendCmd $contactName $messageText" ` `--disable-link-preview ` `--logname "$mesLogFile" ` `>> $mesLogFile } #Path setup tgSendCmd="msg" tgDir="/usr/local/bin" tgBinPath=""$tgDir"/telegram-cli" tgKeyPath=""$tgDir"/tg-server.pub" logDir="/var/log/telegram" #dont forget to setup log rotation mesLogFile=""$logDir"/telegram.log" #Parse arguments contactName="$1" messageText="$2" sendFunc #send telegram message exit $?

Configuring permissions in the system (tested in Debian 8 jessie)

 mkdir -p /var/log/telegram chown nagios:telegramd /var/log/telegram -R chmod 755 /var/log/telegram -R chown telegramd:nagios /usr/local/bin/t* chmod +x /usr/local/bin/t* chown telegramd:nagios /home/telegramd/ -R chmod 770 /home/telegramd/ -R ln -s /home/telegramd/.telegram-cli/ /var/lib/nagios/.telegram-cli

Send a “foo” message to “My Contact”

 /usr/local/bin/telegram.sh My_Contact foo #

Let's send “bar” to the “Monitoring” channel

 /usr/local/bin/telegram.sh Monitoring bar

Send a notification from Nagios

The command description is based on the classic template for Jabber. In the text of the message # MONITORING NAME is used, thus, it becomes a hash tag in the text of the message, for us it is convenient.

Contact definition for Nagios config

define contact{
name telegram-contact
service_notification_period 24x7
host_notification_period 24x7
service_notification_options u,c,r,f ; , "Warning"
host_notification_options d,u,r,f
service_notification_commands service-notify-by-telegram
host_notification_commands host-notify-by-telegram
register 0
}

define contact{
contact_name telegramonlycrucial
use telegram-contact
alias Telegram OnlyCrucial
address1 Monitoring ;
}

Defining commands for the Nagios config

define command{
command_name host-notify-by-telegram
command_line /usr/local/bin/telegram.sh $CONTACTADDRESS1$ "***** #Nagios_Instance_Name ***** Host $HOSTNAME$ is $HOSTSTATE$ - Info: $HOSTOUTPUT$"
}
define command{
command_name service-notify-by-telegram
command_line /usr/local/bin/telegram.sh $CONTACTADDRESS1$ "***** #Nagios_Instance_Name ***** $NOTIFICATIONTYPE$ $HOSTNAME$ $SERVICEDESC$ $SERVICESTATE$ $SERVICEOUTPUT$ $LONGDATETIME$"
}

The final touch is to monitor Telegram itself.

For us, monitoring is the most important and critical thing in the entire infrastructure, and since notifications are one of its main components, it is necessary to monitor the telegram-cli itself with the following metrics:

Every minute we launch a client in which we request a list of contacts, then we check the exit code from the client, if everything is good, it should always be zero. (It is done by a separate bash script, we think you will not have problems in writing your own implementation of such a check)
We check that there are no lines containing “FAIL” in the message log, this keyword indicates that something is wrong when sending notifications. (We use check_logfiles for this check)
We check that the instances of telegram-cli are not hung up, and more and more instances of this process do not appear in the system, which are trying to leave your server without RAM. (Standard check_procs is perfect for such monitoring)

Facac # 5 - Do not monitor local agent sending notifications to Telegram.
Almost immediately after we started using this increasingly popular messenger on servers with Nagios, it turned out that Telegram could break down , and we would completely remain without notice for hours, and partly even for a couple of days. In case the monitoring detects any problems with sending notifications via Telegrams, this is reported via email.

Why local unofficial client, instead of the official API in the cloud?

1. telegram-cli is updated regularly, so it works stably and has all the functionality we need.
2. You still need to keep an eye on the API, for example, during the Bot Api 2.0 release, failures were noticed with it, while the regular client was working properly.
3. Since we do not use any communication with our robot and do not manage monitoring with it, we are just satisfied with the current solution. Works - do not touch.

Undiscovered Telegram capabilities in conjunction with monitoring

When triggered by an error in the log, you often want to hurt the problem part, not including your work computer or see a beautiful graph illustrating the scale of the problem next to the next critical alarm, for example, to promptly forward it to your colleagues.
Of course, sending images and other types of documents in the Telegram is out of the box, so the possibilities of such monitoring are limited only by your imagination.
Here's how, for example, how we have implemented the mechanism of “backup” channels, here is a simplified version of the code, so that it is easier for you to understand it.

The previously promised software part, which is responsible for the channel reservation mechanism.

Good luck with monitoring your projects and great uptime to you, colleagues!

Source: https://habr.com/ru/post/306272/

All Articles