📜 ⬆️ ⬇️

ITSM literacy program: 7 ways to diagnose the causes of IT incidents and problems

image Translation of a curious article by Stuart Raines, giving an overview of some approaches and techniques for finding the causes of incidents and problems. The review is superficial, but even this level of immersion is enough to generate interest in the subject concerned.

Posted by Stuart Rance
Posted on 10/31/2017 in the SysAid blog section of ITSM
Reference to the original: 7 Ways to Diagnose IT Incidents and Problems

It is necessary to train support staff and the rest of the IT staff in incident and problem diagnostics techniques, and to accompany their use. The presence of sufficient technical knowledge and skills in ITSM processes without the skills of these techniques is not sufficient for the effective performance of diagnostic tasks.

Diagnose IT incidents and problems.


Every IT organization has processes for managing incidents and problems. Often they are based on ideas from ITIL, whose descriptions of the best practices of IT service management are now most commonly used in the world. According to ITIL, an incident is “an unplanned interruption of an IT service or a deterioration in its quality ...”, and a problem is “any reason causing one or more incidents ...”. The goal of incident management is to restore the planned state of the service, while problem management helps reduce the consequences of future incidents.
')
Incident and problem management processes determine the steps employees perform to plan and implement problem solving. As part of these steps, there is almost always one, called “Examination and Diagnostics” (or something very similar to it), during which the magic of finding the cause is accomplished.

For people whose job is to correct a situation where something goes wrong, the most important is to identify the causes of errors and determine the result of their elimination. Of course, many other actions are carried out within the framework of the process itself, such as keeping up to date the information in the call record and informing the user when there is a solution, but most of the time is spent on “Examination and Diagnostics”.

When we train IT support staff and other IT staff, we often send them to technical courses to be sure that they understand the technologies they work with, then we send them to ITIL courses (or other industry best practices) for confidence that they understand how processes work and how they fit in with the rest of the IT activities. But we very rarely really teach people how to examine and diagnose incidents and problems. Often not even provided a mentor to give the skills to identify the causes of faults. We believe that they already know how to do it. And the extremely regrettable fact is that, in fact, inexperienced personnel have no idea how to approach these examinations and diagnostics, and actually know what to do, we practically do not have a clue.

And so, if you do the diagnosis of incidents and problems yourself or control those who do them, read on, where I will tell you about the features of the approaches that allow you to solve these problems. Explore these approaches and you can use them if necessary. The most useful practices will be given, but their current list does not exhaust all possible options.

Approaches for Diagnosing Incidents and Problems


Some of the approaches described can only perform diagnostics, while others can solve a wider range of tasks. Understanding all their features will allow you to decide which approach is best suited in a particular situation.

1 Approach by Richard Feynman


The famous physicist Richard Feynman proposed a process for solving physical problems, which looks like this:

  1. Describe the task
  2. Think very hard
  3. to write an answer

This method is beautiful in its simplicity, but perhaps it will not work for those who are not smart enough to receive the Nobel Prize. So, I am sure that this approach can be used if you are VERY smart or work with a simple task and have access to all the knowledge and information that you may need. It is worth using this approach in conjunction with others, which will be discussed below, but to think hard before drawing conclusions is always good practice.

2. Analysis of the history of observations


This is such a simple way to investigate an incident or problem that it is hardly worth telling about it. You just need to place a list of everything that happened with the object of analysis on the timeline and examine the resulting list. It is important that all received records, regardless of the data source, contain the date and time when the event occurred, and be sorted by it. Your timeline may contain data from system logs, letters, records in the user access database, and many other sources. This approach is surprisingly effective for building an overall picture of what was happening.


Figure 1 - An example of the analysis of the history of observations

I myself almost always start the survey with an analysis of the history of events, because this often allows us to understand exactly what happened, and also it allows to obtain all the required information for applying more sophisticated approaches, if their application is necessary.

3. Problem solving by the Kepner-Trego method


Despite the fact that I sincerely believe that this approach is extremely effective, under a licensing agreement, when using this proprietary approach for training, I am obliged to voice my interest in it.

This is a structured approach to solving problems, in which the problem is defined through a number of different aspects (what, where, when, how much) and also to relate the problem to aspects that did not fail. And then you can see the difference in these specific possible situations.


Figure 2 - A simplified example of using Kepner-Trego problem solving

4 Ishikawa diagram or fish skeleton


The Ishikawa diagram is a way to gradually eliminate potential causes of problems. Causes are grouped into categories and allow you to understand and visualize their relationships. You can create such diagrams to simplify the identification of all potential causes of problems during diagnosis. And they can also be created as part of the product documentation, which makes it possible to immediately use them in solving any emerging issues.


Figure 3 - Simplified example of Ishikawa chart for email service.

5. Knowledge oriented support


This is primarily a methodology for collecting and managing information that meets the needs of IT staff and employees of the Service Desk. If the requested information becomes available to the person who needs it at the time they need, then this can lead to a quick awareness of what happened and a quick resolution of incidents and problems. And people who have access to the right knowledge are much more likely to be able to use Richard Feynman’s problem-solving method!

6. “Anthill” (Swarming)


This is a collective approach that differs from the classical incident management, not only in the diagnostic phase, but also in many other aspects. There is no escalation to higher levels of support, and instead of a specific person who can help, participation in the “anthill” is included, which means there are many people from different parts of the organization with an extensive range of relevant knowledge and skills to work together to resolve the issue. “The Anthill” can also use some of the approaches described in this blog, but its key feature is in collaboration between many people with diverse skills, resulting in faster and more accurate diagnostics, as well as solving incidents and problems.

Read more about the "anthill" can be read in this blog by John Hall

7. As always + occasionally (Standard + Case)


This is another approach in which many familiar aspects of incident management are replaced. It was developed by Rob England and described in this article and other publications that can be found by the name of the method. The main idea of ​​the approach is that typical activities should be managed through well-defined processes, while more rare and complex (complex) activities require situational management, using techniques developed in areas such as health care, social services, legislation and law enforcement. This technique is highly effective in managing incidents and at the same time provides an opportunity for a flexible approach to solving complex (complex) incidents.

Conclusion


It is necessary not only to train support staff and other IT personnel on how to diagnose incidents and problems, but also to accompany their application. It will not be effective only because the performers have sufficient technical knowledge and skills to work in the ITSM processes.

There are many techniques and methodologies that you can use, and your task is to try to evaluate the diversity of different approaches. The part will simply not be applicable to your environment, but the more diverse approaches you know, the more likely you will be able to choose the best one when needed.

Source: https://habr.com/ru/post/353066/


All Articles