Moira: Realtime Alerting

The contour makes several dozen products, each of which consists of several dozen microservices, each of which is running on dozens of servers.

This infrastructure generates metrics at all technological levels - load on hardware, OS state, application metrics. Baseline data is collected in one large Graphite cluster. Now we have a million unique metrics for which 20 thousand values per second are generated in total.

It is clear that for a million metrics not to keep track of eyes on TVs and dashboards - you need a system for sending notifications about abnormal situations. Before writing our own Moira system, we used Seyren for this task.

')

Pros and cons of Seyren

Seyren allows you to specify an arbitrary expression in the Graphite function language via the web interface and subscribe to notifications. Notifications come in the event that the result of evaluating the expression exceeds a fixed threshold.

It seems that this is what you need. But for the year of operation of the Seyren, we stepped on a few rakes:

Inside the Outline there are several commands that make different products. In Seyren there is no concept "user" or "group" - everyone sees the settings and events of all the others in one huge list.
Seyren relies on the Graphite API to calculate the value of a metric and does not itself validate the entered expressions. Errors while falling somewhere inside the logs, and in the web interface, everything looks working. If you draw a graph in the Graphite web interface, there is no guarantee that the same expression will work in Seyren. And if it does, its calculation may give an unexpected result.
Not all graphite queries work efficiently. With a large number of metrics, the expression can be calculated for several seconds. It does not matter if this happens once at the user's request. But Seyren requests each saved expression through the Graphite API once per minute. If there are several such expressions in Seyren, they are guaranteed to kill a cluster of any size. I do not want to teach all users of monitoring the art of compiling optimal queries in Graphite.
Seyren misses events. We saw points on the graph that went beyond the threshold value, but Seyren was inexplicably silent at the same time. We were unable to consistently reproduce and, moreover, fix this bug.

Seyren Alternatives

Why did we even choose Seyren? There must be other options.

Seyren is the best thing that was in the Monitoring section on the Tools That Work With Graphite page from the official Graphite documentation. We tried everything and chose Seyren. Perhaps something else will suit you from this page (now Moira is there too).

Still have a riemann . Most likely, this is the most flexible and productive monitoring system in the world. It supports incoming data in Graphite format. Settings riemann is a code that is stored in the repository. In short, this is a real DevOps.

Unfortunately, riemann has a very high entry threshold. For Seyren, this threshold is zero - if you use Graphite, copying an expression to Seyren does not require any additional knowledge. The riemann configuration is a Clojure code. The Just enough Clojure to work with Riemann page is a couple of hours of thoughtful reading, even if you are familiar with the concept of functional programming. We want any developer to be able to set up notifications about problems in his area of responsibility without learning a new programming language.

And since we started to discuss what we want or do not want, let's understand what we have requirements for the notification system.

Requirements for sending notifications

Should not load the Graphite cluster with frequent automated queries. Well, if it does not depend on the main cluster Graphite.
Must support all Graphite functions and calculate them the same way they are calculated when plotting graphs.
Must be able to distinguish between users and allow them to customize what they see in the web interface.
Must be able to send notifications through popular channels - email, Slack, Pushover, Telegram. At the same time adding new plugins should not be too complicated.

Moira: how we built such a system

Moira consists of modules, like Graphite itself. Now it is called microservice architecture. This allows at any time to replace a part of the system with a more productive or more functional one, without throwing out the entire system. All parts of Moira are independent of each other and communicate only through a single data store Redis.

moira-cache

We want to leave Graphite alone - therefore, we want to receive a copy of the flow of incoming metrics. But there are a lot of incoming metrics.

To solve this problem, you need to consider two things:

For alerts, only fresh information is needed - history can be viewed in Graphite if necessary.
Among the entire flow of incoming metrics, only those for which at least one alert rule is configured are needed.

Moira-cache is a fast Go service that accepts incoming metrics, filters the necessary ones and saves the cache of values of these metrics to Redis for the last hour.

moira-worker

We want to support all the Graphite functions, and that they are calculated exactly as in Graphite itself. Since Graphite is uploaded to opensource, the surest way to achieve results is to take the Graphite sources and include them directly in our application.

Moira-worker is a Python service that checks values in the cache using the original Graphite function code and generates alert events.

In addition, we have added to moira-worker the ability to use not only Graphite functions, but also more versatile Python expressions for notifications.

notification edit page

moira-web

We want to be able to distinguish between users and show everyone just what he wants to see. These are two tasks.

How to distinguish between users? The monitoring system is a tool for the intranet, and here it is impossible to come up with one authentication solution that would suit everyone. Someone needs LDAP, someone needs a login through GitHub or Google. We did not invent anything: Moira simply trusts the X-Webauth-User header received from the upstream web server (for example, nginx or oauth_proxy). In order for Moira to learn to distinguish your users, you need to pass a user ID (for example, a login) in the value of this header.

How to show the user only those settings that he wants to see? To do this, we use tags that work both in the web interface and in setting up notifications.

list of configured notifications on the main screen

Subscription Edit Page

moira-notifier

We want to be able to send notifications through popular channels - Slack, Telegram, Mail, Pushover. In addition, moira-notifier:

Able to comply with a given schedule (for example, does not bother at night for trifles).
It protects the user from being overwhelmed by emails if one of the triggers is "enraged" and starts sending notifications too often.
Resends the notification if a delivery error has occurred.

How to try Moira? How to ask a question to developers?

We have been using Moira in production environment for half a year now. On one eight-core virtual machine, we process the incoming stream of 20 thousand points per second with thousands of monitored metrics and hundreds of users.

At the end of 2015, we decided that Moira is already stable enough to offer the opensource community to try using it.

Instructions for installation and use are in our documentation . In addition, you can see the code on GitHub , send bugreport or pullrequest.

If your workload is higher, or your resources are more limited, come and chat in our Gitter chat . We will tell you how to scale Moira for your case, or together we will try to optimize its code.

And, of course, we will be happy to answer your questions in the comments.

Source: https://habr.com/ru/post/276403/

All Articles