CASE method: humane monitoring

Dziiiiiiin! At 3 in the morning, you watch a wonderful dream, and suddenly - the bell. You are on duty this week, and apparently something has happened. Automated system calls to figure out what's wrong. This is an important point in managing modern computer systems, but let's see how to make notifications more convenient for people.

Get acquainted with the monitoring philosophy that was born for several decades of my duty in different monitoring teams. It was influenced in many ways by the real Bible from Rob Evaschuk My Philosophy on Alerting (My philosophy of notifications), included in the book on Google SRE , and John Olspo's book Considerations for Alert Design .

Kelly Dunn , Arijit Mukhery and Maxim Petatsoni - thanks for the help in editing the post.

What is CASE?

I decided to come up with a beautiful abbreviation, like the USE Brendan Gregg method or the RED Tom Wilkie method . I call this the CASE method . He describes four points that you need to pay attention to when working with automatic monitoring:

C ontext-heavy (context binding)
A ctionable (practical value)
S ymptom-based (emphasis on symptoms)
E valuated

If you use CASE, you treat notifications with a healthy opinion and do not wake people at night. It is necessary to regularly evaluate monitoring for utility and effectiveness. When a person receives a notification, he will have better mental models and more confidence.

To make it easier to remember, imagine that you need a CASE [that is, the case, the reason - the translator's comment ] to justify each alert. : sunglasses:

And why is this all?

Duty can be agonizing . For many reasons. And CASE will not eliminate them all. But with him at night you will wake up from better notifications. This method covers various organizational processes that will also help in this matter.

The beauty of the RED and USE methods is that with their help we not only know how to work, but also speak each other in the same language. I hope that with the CASE method it will be easier to discuss notifications that protect our systems, but do not give rest to colleagues.

The bottom line is that you need to create a culture in your organization where notifications are treated with a healthy apathy. Notifications can be created in the case, but not the fact that they will not lose value later. Why did we set up this notification? Have his criteria been revised for a long time? CASE answers these questions.

Context-Heavy - context binding

3 am is not the best time to read messages with a lot of buzzwords. To respond effectively, you need information. Ideally, this should be information about a specific problem, on which the context is immediately understood, and you need to configure notifications so that this is possible. This is “observation” and “orientation” from the NORD cycle . It is not a pity to spend time on this setting, because constantly distracting a person is even more expensive. Let's respect each other.

There are many sources of problems. Especially ghosts.

How to help the attendant? First of all, the duty officer sees the notification, so he builds all the hypotheses based on it. Then he watches instructions and dashboards, but is there always data on a specific notification, and not just general information? Olspo advises “to think about how to interpret or respond to a notification” (slide 29) 1 . Good notification is focused on the attendant, and not just configured on the threshold value.

Therefore, here are ideas on how to improve the context of notifications:

Show the user something useful and specially created, and not just ordinary instructions or dashboards. Previously, the guys and I used dashboards to investigate, configured for specific notifications. This will help if the problem is known, and will only confuse in other cases. Here we must find a balance.
Tell us about the history of the notice: is it new? Does it often work? Is it seasonal?
Show recent changes in system status. Has anything changed recently? (For example, warm-up or enable / disable functionality.)
Show the relationship and give information for the mental model: the dependencies of the system should be clearly visible, preferably with an indication of operability.
Quickly associate the user with the team: he sees current incidents or can find out who else in the company received a notification? Is the incident management program activated?

Ideally, the incident management program gives advice on how to improve the notification context when investigating incidents. There is always something to work on!

Actionable - practical value

Should the attendant do something in response to the notification? If nothing needs to be done or it is not clear what to do, why did they wake up? It is necessary to avoid notifications that reach attendants and do not require action.

What to do? Whats up?

Previously, when the systems were simple, and the teams were small, we set up monitoring to simply be in the know. Notice that the load on the heap has increased will give us context if later the service will malfunction. On a large scale, such notifications will only confuse, because our systems always work in a state of degradation of varying degrees of severity. This quickly leads to fatigue from notifications and, of course, to loss of sensitivity. Therefore, the duty officer ignores or even filters such notifications and does not always respond to them as needed. Don't fall into this trap! Do not configure all notifications in a row, then send them to the mail in any god forgotten folder.

Here's what the notice with practical value looks like:

The notification requires action, and not just reports the news.
This action is difficult or risky to automate. If the action can be automated, so take and automate, stop pestering people!
The notification contains urgent recommendations in the form of a Service Level Agreement (SLA) or Recovery Time Objective (RTO). Then the duty officer can use the incident management program in the organization.

I want to clarify: I am not saying that notifications should come only for the most important SLO (service-level objectives, service level goals) for the API. SLO monitoring is constantly split up and divided and requires the same approach to all services. It is clear that you will track the most important SLO for customers who pay you. But infrastructure SLOs, such as databases, also need to be monitored. Soon you will have to deal with internal customers and support them. And so on to infinity.

Symptom-based - emphasis on symptoms

Whether you like it or not, you work in a distributed system (Kawaj) 2 . As a result, you use different tactics to isolate services and protect them from failures (Traynor and others.) 3. And although the lingering garbage collection or a thoughtful query to the database indicates problems, you do not need to rush to fix them if users do not have problems in the near future.

These are important signals, and they may be of practical value, but if they do not interfere with users, then it is not urgent enough to distract the attendant. Reason-based notifications are snapshots of our mental models of system failure. It is better to track important symptoms than to try to list all the possible causes of a failure.

For notifications to be of practical value, focus on performance indicators that are important to users. Evaschuk calls this “user monitoring.” Remember that this philosophy needs to be applied throughout the organization. If a service somewhere in the depth of the infrastructure has urgent problems, they will be dealt with by the appropriate team. Protecting systems from such failures is a completely separate issue (Treiner et al., Section on strategies for minimizing critical dependencies) 3 .

Symptoms are not so changeable.

Richard Cook recalls that in complex systems a lot of flaws, flaws and problems 4 . Trying to list all possible causes - Sisyphean labor. You try to describe problems, but they change all the time. Cindy Sridharan believes that “systems do not have to be in perfect condition every second,” and it is better to use a more humane approach ( Distributed Systems Observability , 7) 5 .

Avoid incident notifications

Typically, to correct incidents, notifications are configured for reasons. And these limited notifications about the fact of the incident create a false sense of confidence, because the system each time comes up with new ways to break.

Do not be fooled by notifications of reasons. Better think:

Why did symptom-based notification not notice the problem?
Would it be helpful to improve the context for the user?
How to improve monitoring tools to quickly diagnose, rather than accumulate notifications about what happened?

Diagnostic monitoring tools will only help if you perceive them as a way to move from a symptom to a solution. Without this feedback, you will simply be overwhelmed by late notifications and charts of past failures — and not a word about future ones. For the organization is a great opportunity to go from defense to attack. And developers and product managers will have the same expectations and clear goals. Case - CASE (: wink :) - for each notice is clear.

Reason based notifications are moderately tolerable

Sometimes our system leaves us almost no choice in terms of reason-based notifications. And sometimes the attendants are well aware that the symptom will necessarily lead to failure, and therefore contains practical value. Maybe you are just not sure what is happening, and set up notifications for reinsurance. Let's hope that this action is required temporarily, until we change the system to resolve the issue of performance degradation.
Remember about other CASE components when dealing with such situations. If this is temporary, it does not mean that you can not think with your head.

Evaluated - evaluation

Any changes in the system (new code, new infrastructure, anything new) expand the range of failures (Cook, 3). 4 Is this notification still working as expected? Clear and relevant mental systems models and experiences responding to some notifications in support of a preventive approach are key features of a learning-oriented organization . Defects in systems are constantly evolving, and we must keep up with them.

You need to constantly evaluate the quality of each notification so that they work as expected. Dear leaders! It will be much easier for your teams if you help them manage this process! Here are some ideas for evaluation:

Use chaos engineering , game days, or other notification test methods. The team can do it by itself, without using the ponderous incident management system!
Include the collection of all incident-related notifications in the incident management program. Mark useful, harmful, inappropriate, incomprehensible, etc. Use them as feedback.
Correct notifications work infrequently and thoroughly checked. Make sure all the links work, point to the right context, and so on.
If the notification never works or works too often, something is wrong with it. Fix it or remove it. Beware of excessive passivity or activity!
Configure timestamps for expiration dates for notifications. If the expiration date has expired, evaluate the notification using the CASE method and update the time stamp. Regularly check the expiration date, as in food.
Simplify the process of improving notifications. Use code monitoring and store notifications in the Git repository. Pool requests help attract a team, and you will have a history of past notifications. And you will no longer be afraid to change notifications or ask permission from those responsible for them.
Make feedbacks for notifications, even if it's just a Google form , so that attendants flag notifications as useless or intrusive. Embed a link or call to action into the notification itself and regularly review your feedback.
Set a rule in the team - let the attendants work on simplifying the duty when there is little work. Let everything be a bit better after you than it was before.

Conclusion

I believe the CASE method helps developers and organizations discuss how to set up and send automatic notifications. One developer can start evaluating notifications using the CASE method, and then the entire organization will join it with other developers, management and incident management programs to keep the notifications in good condition. To do this, do not need any special tools or complex processes.

The entire industry must think about the human factor while on duty without compromising first-class customer service. All these tools and practices can and should be improved. I hope the CASE method will help with this.

Enjoy enhanced notifications!