Monitoring system, are you sure that it works?

Our company is engaged in server maintenance. Monitoring for us is a supercritical system, its refusal can lead to large financial losses. Another monitoring system can monitor the physical accessibility of monitoring, but logical errors ...

I will tell the story of one mistake, what conclusions we made, and how we changed the approach to the monitoring system. There will be no code lines, it will be about ideology. Who cares, welcome tackle.

Input: There are several hundred customers. There is a monitoring system zabbix. For each client, a separate group is established, in which all client servers are located. New hosts are added automatically. The client has access to metrics by the hosts in their group. There is one special host that checks the availability of all client sites. All triggers lead to the creation of a task in redmine. This is how our monitoring looked a couple of years ago.
')
It all started with a simple request from the client to provide access to monitoring accessibility of the site. As you remember, we have one for all clients and we could not give access. And wrap ...

Part One: Epic Fail, We Found You

We sat down, pondered and decided that it was high time to make such a special host for each client and give access to the metrics. Admins are lazy people, they didn’t want to copy with their hands. They wrote a script that will sort out all the projects, create and transfer all the necessary metrics and triggers. The script, tested "on cats", worked like a Swiss watch.

Run the script in production, and to our amazement, he throws out the bag and a small cart of errors. A cursory analysis shows that metrics are collected, and there are no triggers for them! Problem checks accumulated about 10%. It seems not so much, but the situation itself is extremely sad. The client will call: “Our site has fallen!” And what will we answer? “Thank you for reporting, we ourselves do not know ...”? And if the site lay 10 hours? We were lucky we found the problem before she fired.

Part two: who is to blame and what to do

Guilty were found. The reason turned out to be trivial: adding new checks, lazy and cloned existing ones. Sometimes they forgot to click clone and change the settings of the current trigger. Yes, yes, this is shame and shame. But if you can make a mistake, sooner or later a person will make it. The first conclusion: the human factor always works.

What to do? Of course, write another script that will create triggers where they are missing. One could add this script to cron and close this situation. But we eliminated the consequences, not the problem. Let me remind you the problem: the human factor always works.

We analyzed and identified tasks that can be fully or partially automated.

We have automated all work related to the addition of templates. We wrote scripts that create checks, the admin needs only to set the input data. Need to add a site availability check? Just run the script --url site.com. Need to add a host to maintenance? Run the script ... Well, you understand. We even added functionality for checking for updates in git so that everyone is guaranteed to use the latest version of the scripts. We also wrote a script that does all possible checks for errors, disabled hosts, hosts left in maintenance, etc. etc. And we began to live joy and fun. Happy end.

Part Three: A New Wave of Problems

It would seem that happiness has come. But early we were delighted. One fine Friday evening, as usual, a colleague wrote: "I have no templates for a dozen hosts." Let me remind you that the process is automated, the script looks at the roles in the configuration system and hooks the necessary templates. On Friday evening, no one wants to take on such a thing. We ask a colleague to wait half an hour, suddenly it will improve itself. An hour passed, the second, the third ... The templates did not catch on. Nowhere to go, climbed to watch. It turned out that after the next update, some of the scripts stopped working. At the same time, making changes, we did selective testing, but it was the failed scripts that did not fit into the sample.

Since one person is involved in monitoring, the culprit was obvious. They did not punish, but asked: "What should be done so that this does not happen again?"

What to do? Of course, write autotests. Now we have all the scripts to deploy on the server autotests.

Epilogue

After reading this article, someone will say: "Shame, how can you work this way?" But have you never deleted the prod database yourself ?? Even in Gitlab this happened ... We are not afraid of mistakes, each file gives us the opportunity to make the company better. The main thing is to find and eliminate the causes, and not to treat the symptoms.

Are you sure that your monitoring system is 100% healthy?

PS: While working on the article, I caught myself thinking that in devops the work of ops is more and more like dev. So, do not reinvent the wheel, we will take practices and tools from dev and use in ops.

PSS: We do not show code samples, because the code is tailored to our needs and our architecture. But if you're interested, write in the comments, and we'll upload the code on github.

Source: https://habr.com/ru/post/341294/

All Articles

Monitoring system, are you sure that it works?

Part One: Epic Fail, We Found You

Part two: who is to blame and what to do

Part Three: A New Wave of Problems

Epilogue

More articles: