Dziiiiiiin! At 3 in the morning, you watch a wonderful dream, and suddenly - the bell. You are on duty this week, and apparently something has happened. Automated system calls to figure out what's wrong. This is an important point in managing modern computer systems, but let's see how to make notifications more convenient for people.
Get acquainted with the monitoring philosophy that was born for several decades of my duty in different monitoring teams. It was influenced in many ways by the real Bible from Rob Evaschuk My Philosophy on Alerting (My philosophy of notifications), included in the book on Google SRE , and John Olspo's book Considerations for Alert Design .
Kelly Dunn , Arijit Mukhery and Maxim Petatsoni - thanks for the help in editing the post.
I decided to come up with a beautiful abbreviation, like the USE Brendan Gregg method or the RED Tom Wilkie method . I call this the CASE method . He describes four points that you need to pay attention to when working with automatic monitoring:
If you use CASE, you treat notifications with a healthy opinion and do not wake people at night. It is necessary to regularly evaluate monitoring for utility and effectiveness. When a person receives a notification, he will have better mental models and more confidence.
To make it easier to remember, imagine that you need a CASE [that is, the case, the reason - the translator's comment ] to justify each alert. : sunglasses:
Duty can be agonizing . For many reasons. And CASE will not eliminate them all. But with him at night you will wake up from better notifications. This method covers various organizational processes that will also help in this matter.
The beauty of the RED and USE methods is that with their help we not only know how to work, but also speak each other in the same language. I hope that with the CASE method it will be easier to discuss notifications that protect our systems, but do not give rest to colleagues.
The bottom line is that you need to create a culture in your organization where notifications are treated with a healthy apathy. Notifications can be created in the case, but not the fact that they will not lose value later. Why did we set up this notification? Have his criteria been revised for a long time? CASE answers these questions.
3 am is not the best time to read messages with a lot of buzzwords. To respond effectively, you need information. Ideally, this should be information about a specific problem, on which the context is immediately understood, and you need to configure notifications so that this is possible. This is “observation” and “orientation” from the NORD cycle . It is not a pity to spend time on this setting, because constantly distracting a person is even more expensive. Let's respect each other.
There are many sources of problems. Especially ghosts.
How to help the attendant? First of all, the duty officer sees the notification, so he builds all the hypotheses based on it. Then he watches instructions and dashboards, but is there always data on a specific notification, and not just general information? Olspo advises “to think about how to interpret or respond to a notification” (slide 29) 1 . Good notification is focused on the attendant, and not just configured on the threshold value.
Therefore, here are ideas on how to improve the context of notifications:
Ideally, the incident management program gives advice on how to improve the notification context when investigating incidents. There is always something to work on!
Should the attendant do something in response to the notification? If nothing needs to be done or it is not clear what to do, why did they wake up? It is necessary to avoid notifications that reach attendants and do not require action.
What to do? Whats up?
Previously, when the systems were simple, and the teams were small, we set up monitoring to simply be in the know. Notice that the load on the heap has increased will give us context if later the service will malfunction. On a large scale, such notifications will only confuse, because our systems always work in a state of degradation of varying degrees of severity. This quickly leads to fatigue from notifications and, of course, to loss of sensitivity. Therefore, the duty officer ignores or even filters such notifications and does not always respond to them as needed. Don't fall into this trap! Do not configure all notifications in a row, then send them to the mail in any god forgotten folder.
Here's what the notice with practical value looks like:
I want to clarify: I am not saying that notifications should come only for the most important SLO (service-level objectives, service level goals) for the API. SLO monitoring is constantly split up and divided and requires the same approach to all services. It is clear that you will track the most important SLO for customers who pay you. But infrastructure SLOs, such as databases, also need to be monitored. Soon you will have to deal with internal customers and support them. And so on to infinity.
Whether you like it or not, you work in a distributed system (Kawaj) 2 . As a result, you use different tactics to isolate services and protect them from failures (Traynor and others.) 3. And although the lingering garbage collection or a thoughtful query to the database indicates problems, you do not need to rush to fix them if users do not have problems in the near future.
These are important signals, and they may be of practical value, but if they do not interfere with users, then it is not urgent enough to distract the attendant. Reason-based notifications are snapshots of our mental models of system failure. It is better to track important symptoms than to try to list all the possible causes of a failure.
For notifications to be of practical value, focus on performance indicators that are important to users. Evaschuk calls this “user monitoring.” Remember that this philosophy needs to be applied throughout the organization. If a service somewhere in the depth of the infrastructure has urgent problems, they will be dealt with by the appropriate team. Protecting systems from such failures is a completely separate issue (Treiner et al., Section on strategies for minimizing critical dependencies) 3 .
Richard Cook recalls that in complex systems a lot of flaws, flaws and problems 4 . Trying to list all possible causes - Sisyphean labor. You try to describe problems, but they change all the time. Cindy Sridharan believes that “systems do not have to be in perfect condition every second,” and it is better to use a more humane approach ( Distributed Systems Observability , 7) 5 .
Typically, to correct incidents, notifications are configured for reasons. And these limited notifications about the fact of the incident create a false sense of confidence, because the system each time comes up with new ways to break.
Diagnostic monitoring tools will only help if you perceive them as a way to move from a symptom to a solution. Without this feedback, you will simply be overwhelmed by late notifications and charts of past failures — and not a word about future ones. For the organization is a great opportunity to go from defense to attack. And developers and product managers will have the same expectations and clear goals. Case - CASE (: wink :) - for each notice is clear.
Sometimes our system leaves us almost no choice in terms of reason-based notifications. And sometimes the attendants are well aware that the symptom will necessarily lead to failure, and therefore contains practical value. Maybe you are just not sure what is happening, and set up notifications for reinsurance. Let's hope that this action is required temporarily, until we change the system to resolve the issue of performance degradation.
Remember about other CASE components when dealing with such situations. If this is temporary, it does not mean that you can not think with your head.
Any changes in the system (new code, new infrastructure, anything new) expand the range of failures (Cook, 3). 4 Is this notification still working as expected? Clear and relevant mental systems models and experiences responding to some notifications in support of a preventive approach are key features of a learning-oriented organization . Defects in systems are constantly evolving, and we must keep up with them.
You need to constantly evaluate the quality of each notification so that they work as expected. Dear leaders! It will be much easier for your teams if you help them manage this process! Here are some ideas for evaluation:
I believe the CASE method helps developers and organizations discuss how to set up and send automatic notifications. One developer can start evaluating notifications using the CASE method, and then the entire organization will join it with other developers, management and incident management programs to keep the notifications in good condition. To do this, do not need any special tools or complex processes.
The entire industry must think about the human factor while on duty without compromising first-class customer service. All these tools and practices can and should be improved. I hope the CASE method will help with this.
Enjoy enhanced notifications!
Source: https://habr.com/ru/post/448454/
All Articles