
A small summary: an article about the successful implementation of Zabbix with the automation of most processes, does not claim to be a tutorial, but if you need more details, I can provide it.
For many years,
our company has used monit + cacti, a proven and established monitoring system. But everything flows - everything changes. And we grew up so much that monit stopped coping. The verification cycle has grown from a minute to 10-20 minutes, which is simply unacceptable! Since the monit developers could not help us, it was decided to add (there is not much monitoring) a new monitoring system. The principle “works - don't touch” does not work here anymore. Long or short, but the choice fell on Zabbix. Why? They read, argued, thought and the performer decided. Each system has pros and cons, information about this is more than enough, and everyone chooses what is convenient for him. For example, I already knew how to monitor OracleDB in Zabbix. Perhaps this article will move someone to the side of Zabbix - I will be glad.
')
So, the main goal that I pursued was: reliably, quickly, conveniently and without unnecessary gestures (laziness is the engine of everything). For hardware, they did not bother, they took an old ex6 server from hetzner and put a container in there, feature:
- CPU: Intel Corporation Xeon E3-1200 Processor
- RAM: 16 GB
- HDD: SATA software RAID 1
In general, and generally not impressive, but come down, at least at the stage of implementation.
Zabbix-server is installed and configured by office. instructions + tuned a couple of times. Used CentOS 6.5 nginx + apache + mysql.
Now you need to understand what we will automate (everything?). To do this, I will tell you what basic tools we use: configuration management system and Redmine. So you need to take a list of hosts and connected templates from the configuration management system (did not begin to abbreviate) and do tasks in Redmine automatically.
An example of how we store lists of hosts, for example, a domain.ru client. There is a file domain.ru.conf, and in it a list of servers according to the following principle:
d1.domain.ru: nginx.domain.d1 mysql.domain.d1 zabbix.domain.d1 role4.domain.d1
etc.
Add servers to Zabbix-server.
For this we will use Actions - Auto registration. Stuck very useful. Through the configuration management system on servers where there is a Zabbix role, we install the zabbix-agent and set HostMetadata = d.domain.ru to the config. it was possible to get by with just a domain, but we have from d. or v. This node or container depends. We register all other settings (server host), restart zabbix-agent and the whole business.
Now the machinations on the server. For each domain, you need to make host groups and the Auto registration rule itself. And a lot of them, and even arrive. Here ZabbixApi is in a hurry to help us.
Documentation is good, it is mastered quite easily. There are of course a few moments that are straining, for example, not being able to add a template to the host, not having erased the old templates ... And so, who needs examples of my work with api (I wrote a separate python library for myself) I can put it somewhere.
Having mastered ZabbixApi, we simply take the current state of the projects (we have domain.ru.conf files) and create / delete groups and auto-registration rules according to the changes.
I will give an example of the autoregistration rule for a node:

Ok, here we went to add servers with standard templates. Now you need to add additional templates according to server roles in the configuration management system. We write a parser that takes up-to-date information, compares with a standard, does or does not do something, and rewrites the standard. This is where I ran into the problem that in ZabbixApi one cannot just add a template, the others are overwritten and one cannot simply “not add” - the history and triggers are not deleted. In the same script, we delete the hosts that are in the template, but which are not in the configuration management system. I will not load the article by listing these scripts, there are many lines there, describe the principles:
The simplest thing is to delete the host, delete if it is in the standard, but according to new data it is not. The addition is worse, that is, if there is no reference in the standard, but according to new data, we ignore the host. After all, we add them through the rules of auto-registration. The main work is on the list of templates. If the host has added a new role, then we add to it with one request all the old templates + new. If you delete one role, then firstly we check if there was such a template, and if so, then clean it and untie it from the host.
On this with the addition of hosts and templates all! All we have to do now is to add the server to the configuration management system, assign the necessary roles to it and we can enjoy life.
Now the second point, email notifications are interesting, but we are used to the tasks in Redmine. Yes, and Redmine we already send all sorts of SMS and customers see activity on tasks. Redmine also has an API. What we do: we configure in Zabbix actions, which, under certain conditions, executes the Remote command on Zabbix server. For example, our client site checks for the correct substring in the response, the command looks like this:
/srv/scripts/redmine_api_content.py {TRIGGER.DESCRIPTION} {TRIGGER.STATUS} {TRIGGER.SEVERITY} '{TRIGGER.NAME}' '{ITEM.NAME2} {ITEM.KEY2}: {ITEM.VALUE2}' >> /var/log/zabbix/redmine_api_content.log 2>&1
In the script {TRIGGER.DESCRIPTION} is the project in which the task will be created, in normal checks (ping and so on) here it is transferred {HOST.NAME} from which the project identifier is formed. {TRIGGER.STATUS} - if there is no PROBLEM and tasks with that name, then create, if the task is, then add a comment to it. If OK and the task is - then add a comment otherwise we do nothing. {TRIGGER.SEVERITY} - the importance of the trigger, is converted into the status of the task (high, Failure!). {TRIGGER.NAME} - actually what is happening :) this will be the name of the task. {ITEM.NAME2} {ITEM.KEY2}: {ITEM.VALUE2} here I add information about the reason for the web verification (web.test.error).
I will give the listing of this script, it uses
the python-redmine package :
What we did with the load, not for nothing that I brought the configuration of iron. With such data:

The load on the server is kept in the region of 2-3 la, that is quite decent. The most affected disks, because RAM is too small. Of course, the system will now be overgrown with new templates and checks, the load will grow and you will have to move to a new hardware. By the way, a little advice. Exclude all history * tables from the backup.
Total: We got a working, stuffed with functionality and automated monitoring system. Tasks are created so actively and sensitive to problems that
we ourselves are not happy everyone is happy. The overall impression of Zabbix is ​​more than positive. Whoever said that Zabbix is ​​not at all agile, I think I simply set it up on the wrong side. It seems that one can infinitely increase and finish his capabilities. What we will do, because the best of course ahead ... Thank you all!
Author: Roman Burnashev, Chief System Administrator, centos-admin.ru