Building the right alert system - respond only to business critical issues

Translation of the article by infrastructure director @Synthesio - a cry of the soul about tired of alerts and pain from non-cloud-ready monitoring.

Last year, my colleague and I, Guillaume, spent two rush months when only the two of us were left with support. We worked more than 300 hours overtime, which is 4 times more than usual and twice as long as the busiest month since we work in the company.

We had a lot of incidents and each of them had reasons, but in fact, about half did not require immediate intervention. Nevertheless, we received alerts, and we had to wake up in the middle of the night or be distracted on weekends in order to fix everything. Because with such a bunch of alerts, we didn’t know what could wait and what is critical and requires immediate repair.

Before that, my team spent 2 years on improving the infrastructure. We replaced outdated software and hardware, switched to a model of operation in two data centers, and made sure that all systems were duplicated. We also improved monitoring by adding more than 150,000 metrics and 30,000 triggers, more than 5,000 of which are alerts in Pagerduty.
')
We had excellent infrastructure and excellent monitoring, but at the same time the problem of constantly receiving alerts.

The old monitoring and warning system was created for our old infrastructure, over which we took control two years earlier and which could not survive the loss of a single server. Existing monitoring was not suitable for our new infrastructure and it had to be completely reworked.

Over these two years, we have added a huge amount of server-oriented, as well as functional checks. And believe me, these were excellent checks.

One fine Monday morning, I told Guillaume that we are switching from infrastructure-based monitoring to business-oriented monitoring.

> “We can afford to lose half of any cluster: search, databases, processing, or even the whole data center if the data flow remains the same.”

The idea was promising, but we did not know whether it was possible to implement it, given the specifics of our current monitoring system.

It turned out that it is easy. We began to redo each alert that came in the last months, under one requirement:

> "No need to notify that does not require immediate action and is not critical for business."

We started small and “weakened” the alert thresholds for our Elasticsearch clusters. Each cluster consisted of more than 40 servers, which made it easy to survive the loss of iron without affecting production. In addition, these servers were heavily loaded in the last two months, and it was from them that most of the alerts came. We left only those alerts that applied to the entire cluster as a whole, and left it for a week to see what would happen.

Everything went wonderfully.

A week later, we did the same on the whole platform. For each component, we asked ourselves a simple question:
> “If we lose it, can it wait until the next morning or the next Monday?”
If the answer was yes, the alert level decreased. The answer “no” often meant the absence of functional checks, and then we added them.

This technique worked perfectly. We reduced the number of triggers that raised an alarm in Pagerduty from 5,000 to 250. In the first month after this, the amount of overtime went down 4 times. 3 months later, it decreased 7.5 times compared with our busiest months. Work has become great.

However, one question remained open: “Could we have done it before?”

This is a difficult question and there is no concrete answer. Part of the infrastructure was ready to switch from server-based to business-oriented monitoring. But then we had too many things that were prioritized above the problem of reducing the number of alerts. Or, to be honest, we had so many tasks to solve that alerts had a secondary priority.
> “The problem of reducing the number of alerts became a priority only when everything went out of control.”

A few months later, I can say that the problem of reducing the number of alerts should have become the main one before the situation got out of control, for several reasons:

Constant warnings did not allow the team to focus on what was important.
Distraction even to what can wait a few hours reduces our productivity when we work on things that cannot wait.
Awakening in the middle of the night every day, several times a night, exhausts the team and makes people less productive during the day and more prone to mistakes.
Too much non-working time cost the company a lot of money, which instead could be spent on improving the infrastructure or hiring someone else.

What to do next is also obvious, and we already have the first ideas: a trend analysis, so that our monitoring system can signal in advance, before problems arise. We will write about this in the next article.

Source: https://habr.com/ru/post/329402/

All Articles

Building the right alert system - respond only to business critical issues

More articles: