How we built the “Fault Tree”

Hello. I have been working in a company creating corporate information systems for a relatively long time. In this article I want to share some partially negative experiences, maybe someone else will be interested in other people's “rakes”.
One of the interesting tasks for the team of our design engineers was to build a single “fault tree” for a large corporate information system for monitoring equipment.

Formulation of the problem

The information system into which we have embedded this classifier performs the centralization of information about accidents on the most diverse equipment, and also collects data on various problem situations from completely dissimilar systems, databases and devices. It is clear that the primary alarm messages in such an architecture will be completely diverse. For example, 3 different external systems sent us a fault “break”, but in one case it was a break in the carrying cable in the suspension of the contact network, in the other a break in the power supply, and in the third a loss of communication between subscribers. This situation did not suit us, as we were required to have a clear classification for later use in reporting and analytical tasks.

Our accident handler when finding new types of incoming events simply added them to the directory, and by the time work began to systematize only the types of messages, more than 1000 had accumulated.

We set ourselves the following goals:

remove synonyms - combine records that have exactly the same meaning, but a different spelling;
develop a unified hierarchical classification structure into which we could “put” faults of any nature;
to simplify the further development (specification) of this classification by defining clear and consistent principles for its construction.

We sought to ensure that the characteristics of our classification are:

general applicability, possibility of use in our other projects and products;
simplicity and intuitive clarity, which makes it easy to develop this structure and find the “right place” for new faults;
persuasiveness, the opportunity to prove this and potential automation customers our rightness - the fact is that our automation system covered the activities of several large organizational units in each of which adopted its own approach to this issue.

Progress and our mistakes

Shortly after the work began, it became clear that the development of the principle of classification is the key task of the whole topic. It was not possible to take as a basis any of the classifications coming from external systems for the following reasons:
- the narrowness of the overall focus of assessments due to the specificity of the problems solved by specific systems,
- the absence in many cases of a hierarchy of problems (flat fault lists not built into tree structures),
- entanglement wording, mixing in some positions of causes and consequences.
')
We created the first version on the basis of grouping by infrastructural objects on which these fault manifestations occurred. In fact, this was the easiest way, as it assumed a simple merging of separate “foreign” fault lists based on a single (our) infrastructure model.

In general, it turned out like this:

…. Around 1600 lines, of which about 600 could not be tied to specific objects. At the same time, not all problems had a clear object binding and not all the objects mentioned were introduced into our resource base. This approach, though a little unraveling the situation, did not allow us to introduce a common hierarchy, identify synonyms and reduce the total number, which was one of our goals.

In the future, the “applicability” of faults to objects remained with us in the system, but this became a separate reference book from the general hierarchy of faults.

Result

So, at some point, it became clear that we could not create a single structure, either on the basis of previously deployed information databases and systems, or on the basis of regulatory documents adopted by the organization.

As a result, we have developed the following principles of work:

to separate in the root of the tree created violations (deviations from the norms) and the manifestations of natural processes classified by natural sciences;
to preserve for natural phenomena and processes their classification from the point of view of modern science;
to separate the consequences from the primary manifestations, and the wording containing both the phenomenon and the consequences refer to the branch where the consequence is related (for example, “Blackout” refers to power failures, and “Data loss as a result of blackout” to violations in information systems);
all that can not yet be classified into the group “Others” and organize a systematic work on the “parsing” of this group based on the principles of classification adopted above.
when determining the location of each new record in the general structure, be guided only by the principle: “A particular case of which is already given to the tree is a manifestation”, thus to look for the place of this record in the tree starting from its root.
add missing “generalizations” to the tree independently (if we do not have such an initial emergency message).

Acting this way, we got about the following set of branches for the first level of the tree:

What is the result?

Unfortunately, this work was not completed, and the result we stopped at was extremely “raw.”
I believe that the reasons for this failure are as follows:
- this work should have been organized and continued by the owner of the infrastructure itself, but there simply were no experts ready to take it on themselves;
- the experts “on the ground” were quite comfortable with the names and classifications that were familiar to them, and our attempts at summarizing and distinguishing subgroups met with their resistance;
- implementation of the global analytical reporting for which this work was carried out has not been launched.
In general, the customer was not ready for such changes, and we did not have sufficient administrative resources to influence its employees.

Of course, you can say that time was not wasted. What has been gained is considerable experience in conducting such work, which is partly articulated in the principles described above. For myself, he personally concluded that it is important to divide such projects into small stages, to constantly demonstrate the intermediate result to the customer and to ensure active support for changes on his part.

Why did it happen after all? Why was the intermediate result, which we got even in our opinion, far from perfect?
As it turned out during the implementation, users are basically ready to accept (and forgive us) any classification, but with one simple condition - Add a text search to the form!

Classification is a product of the systematization of experience. Obviously, each person, guided by a unique personal experience, sees it in his own way. For example, in the mail program, some (including myself) create a complex system of sorting incoming mail, while others do not sort mail at all, store everything in one folder and at the same time are perfectly oriented there. And they quickly find me the right letter. Maybe these people have Yandex in their heads?

In addition, any predefined classification can be 100% perfect only after it has been finalized, taking into account the latest data received in the system. That is, the classification requires constant care, and the user needs not to work on the system, but to use it. Search is indexing, and it effectively works on the actual data always. Is classification necessary then?

Source: https://habr.com/ru/post/246765/

All Articles

How we built the “Fault Tree”

Formulation of the problem

Progress and our mistakes

Result

What is the result?

More articles: