📜 ⬆️ ⬇️

Anatomy of an incident, or how to work on reducing downtime


Sooner or later in any project it is time to work on the stability / availability of your service. For some services at the initial stage, the speed of developing features is more important, at this moment the team is not fully formed, and the technologies are not chosen very carefully. For other services (most often b2b technology) to gain customer confidence, the need to ensure high uptime arises with the first public release. But suppose that the moment X has nevertheless arrived and you have begun to worry about how long your service "lies" in the reporting period. Under the cut, I propose to see what makes up the downtime, and how best to work on reducing it.


Indicators


Obviously, before you improve something, you need to understand the current state. Therefore, if we started to reduce downtime, we should first start measuring it.


We will not discuss here in detail how to do it specifically, the pros and cons of different approaches, but the process looks something like this:



When talking about uptime / downtime, they often mention another indicator:


MTTR (mean time to repair) - the average time from the beginning of the incident to its end.
Problems with it begin right from the first word in the abbreviation. Considering that all incidents are different, averaging the duration cannot tell us anything about the system.


This time we will not average anything, but just see what happens during the incident.


Anatomy of an incident


Let's see which significant stages can be distinguished during the incident:




Perhaps our model is incomplete and there are some other stages, but I propose to introduce them only after realizing how this will help us in practice. For now we will consider each stage in more detail.


Detection


Why do we spend time on detecting an emergency? Why not send a notification on the first error the user received? In fact, I know many companies that tried to do this, but this idea was rejected in just a few hours, for which several dozens of SMS messages were received. I think that there is not a single more or less large service that does not have a constant "background" stream of errors. Not all of them are a sign that something has broken, there are also bugs in the software, invalid data obtained from the form and insufficient validation, and so on.


As a result, the level of errors (or other metrics) that exceeds daily fluctuations is used as a criterion for opening an incident. This is exactly what leads to the notification of responsible employees after the actual start of the problem.


But back to our original task - reducing the duration of incidents. How can we shorten the detection time? Faster notify? Come up with super logic detection of anomalies?


I propose to do nothing so far, but look at the following stages, since in reality they are interrelated.


Reaction


Here we have a purely human factor. We assume that the monitoring coped with the detection of the problem and we successfully woke up the engineer on duty (the entire escalation also worked at the previous stage).


Consider the "worst" case, we do not have a dedicated duty service, and an alert overtakes a peacefully sleeping admin. His actions:



As a result, in the worst case, we get 35 minutes of reaction. According to my observations, this reaction time seems to be true.


Since at this stage we are dealing with people, it is necessary to act extremely carefully and thought out. In no case do not need to write a regulation, according to which a person who has just woken up should move! Let's just create the conditions.


Let's first remove the engineer’s doubts that the problem will end on its own. This is done very simply: to make the alert criterion insensitive to minor problems and notify if the incident lasts for any significant time . Yes, we have just increased the duration of the "detection" stage, but let's look at an example:



Potentially, this approach will reduce the total reaction time by 15+ minutes. If this reaction time does not suit you, you should think about the duty service.


Investigation


Perhaps this is the most difficult stage of the accident, when you need to understand what is happening and what to do. In reality, this stage is very often combined with the stage of taking action, since the process usually goes like this:



This stage is usually the most significant in the total duration of the incident. How to reduce it?
Here everything is not very clear, there are several vectors:



Elimination


As I said above, this stage often merges with the previous one. But it happens that the reason is immediately clear, but the recovery will be very long. For example, you have a dead server with master primary (I won’t get used to it :) for a long time :), but you never prompted a replica, that is, you’ll read the documentation, roll out a new config of applications, etc.


Naturally, after each significant incident, you need to figure out how to prevent this from happening again or greatly speed up the recovery. But let's see which areas we can try to work proactively:



MTBF (Mean Time Between Failures)


Another common indicator that is mentioned when discussing uptime. I propose again not to average anything, but simply to talk about the number of incidents that occur during the time interval.


This is where the question of how you take care of the resiliency of your service comes to the fore:



Sometimes, in order to calculate / predict all this, a “risk map” is made, where each scenario (which could naturally be assumed, always contains those that we do not yet know) has the likelihood + impact effect (short / long time, data loss, reputational loss). , etc). Then, according to such a map, they systematically work, closing, in the first place, highly probable and serious scenarios.


Another technique that can be used is the classification of past incidents. Now there is a lot of talk about the fact that it is very useful to write "post mortem" incidents, where the causes of the problem, the actions of people are analyzed, and possible future actions are worked out. But in order to quickly look at the causes of all accidents over the past period, it is convenient to sum up their duration with grouping into “problem classes” and where downtime is most likely to take action:



That is, if you do not even try to predict possible scenarios of problems, then it is definitely worth working with incidents that have already happened.


Total


All incidents are different:



The algorithm for working on increasing uptime is very similar to any other optimization:


 ->  ->   ->   

From my own experience, I can say that in order to significantly improve uptime, it’s enough just to start following it and analyzing the causes of incidents. It usually happens that the most simple changes bring the most significant effect.


Our monitoring service helps not only with the "detection" stage, but also greatly reduces the "investigation" (customers will confirm)


')

Source: https://habr.com/ru/post/422973/


All Articles