NOC: Introduction to Fault Management

Events and accidents are an integral element of network operation. Every second, thousands of events are recorded, the maintenance service is constantly busy eliminating several accidents, there are probably several other accidents somewhere, but they have not yet been detected or diagnosed. Rapid diagnosis and detection of accidents is a very difficult task that can be solved only by a set of organizational and technical measures. And not the last role in it is played by automated means of detection and handling of accidents.

There are many monitoring systems that actively test the network and network services via ICMP and SNMP. The quick and wrong answer is obvious. It is enough to set up a magic monitoring system, and complete happiness will come. All deceitfulness of this error is understood with time. First, it turns out that the detection of accidents occurs only on those services that are put on monitoring. Well, if you managed to cover at least the basic services. The rest, alas, will be put on monitoring as a result of bitter experience and at the cost of a late reaction. A little later, the mystic begins. Something is clearly working wrong, there are complaints, but the monitoring system says that everything is in order. What is the reason?

The reason may be that the polling of services is discrete in time and occurs at a certain interval, and the service is provided continuously. Problems also arise and go continuously. If we were lucky, we noticed the problem, while we were trying to figure it out, she left herself. The result is strange: the client complains, and we seem to see nothing. It would be possible to shorten the polling interval, but at the same time, the load on those services that we monitor and in the best traditions, the system designed to detect accidents itself begins to generate them.
')
Around the middle comes an understanding that in the case of serious accidents, the active monitoring system becomes harder than good because it gives out too much information and makes it difficult to detect the real problem. Of course, active monitoring systems are useful and fulfill their function. But it is obvious that they are not a silver bullet and rely entirely on them short-sighted. Fault Management (FMS) systems come to the rescue. As a first approximation, these are systems for analyzing and processing events. In the general stream, there are probably several events with information about the accident. The first task of the FMS is to filter out the excess and leave only what requires immediate attention. Usually, too much information still remains and the operators do not have time to process it. Therefore, the FMS system should prioritize the detected accidents to allow the most significant ones to be identified. Further, it turns out that events are not independent, and some of them are interconnected. As a result of the correlation of events, the FMS establishes a connection between accidents and hides redundant information. Then the most difficult question arises: many accidents of varying degrees of severity were detected, which of them is the cause? For example, a dropped link may break a hundred MPLS LSPs, dozens of BGP sessions, cause a topology reorganization in IGP and make a farm of a thousand servers unavailable. Some FMS systems perform root-cause analysis. As a result, the operator will be shown the true cause of all induced accidents and it will be possible without immediately wasting time to begin troubleshooting.

We define the terminology:

Event: a message about what happened and what happened at an arbitrary point of the network at an arbitrary time and which caused a change in the state of the network
Alarm: Significant incident that leads to partial or complete degradation of the service and requires the immediate response of the operator

The main advantages of FMS are compared with active monitoring systems. For the time being, they are passive, do not put an excessive load on network devices, just listen to the logs and SNMP Traps, which the hardware already issues. In addition, they have another remarkable quality - emergency situations begin to be monitored immediately after the flow of events begins, and there is no need to tell the system what it should look at. The result at first is simply shocking: the system finds long-dead power supplies, failed transceivers, pokes at flashing links and motivates in every possible way to correct jambs.

Technically, FMS is a real-time expert system. The heart of the system is the knowledge base, which contains both the generalized experience of others and new knowledge gained from the study of a particular network. This is the main difference between FMS and monitoring systems. The monitoring system does only what it was taught during installation, while the FMS not only uses someone else’s many years of experience in operating networks, but is also able to learn and adapt to the network itself.

Depending on the organization of the knowledge base, the FMS can be divided into 4 main categories:

Rule Based: The most straightforward type. The knowledge base consists of a set of rules that are used to infer
Codebook: operate on sets of events that can lead to an accident
Neural networks: The base of the knowledge base is a pre-trained neural network.

Each type of system has its advantages and disadvantages. Rule-based systems are prone to a somewhat soldafon straight-line approach, on the other hand, they are as predictable as possible and you can always understand on what basis the system made a conclusion. Codebook Vector sometimes shows better results at the expense of less obstinacy, but sometimes they are mistaken. Neural networks are able to learn, work better in noisy conditions or when a lost part of events. On the other hand, it is sometimes difficult to understand as a result of such an inspiration the system came to a concrete conclusion. In general, this generates some wariness and mutual distrust between the operator and the system. In addition, like all neural networks, FMS of this type tend to suddenly be aware of themselves and try to take over the world.

The reader may well have a reasonable question - if there are such wonderful systems, then why are they not as widespread as conventional monitoring systems? Why, especially in budget installations, everybody wants to put nagios / zabbiks / cacti, but usually they don’t remember about FMS? The answer to this question is extremely simple - the price! FMS has never been a cheap treat. FMS, in fact, is a refined experience of others, and the experience is worth a lot. The second reason for the high price is that, in addition to the experience of the developers, you have to buy the experience of integrators, since with a high probability you will have to additionally adjust the FMS for a specific network.
A typical budget for FMS implementation usually contains six or more zeros, which automatically makes them the lot of large networks. An attentive reader will certainly say - well, but we have open source! The open source phrase can, of course, be repeated like a mantra, but it won't be any sweeter in the mouth. We can assume that the direction of FMS in open-source is not presented.

As part of an integrated approach to network management , described in the previous article, we also did not stand aside and implemented a full-fledged FMS as part of the NOC . The main objective of the FMS project in the framework of the NOC is to create a full-fledged FMS level of commercial systems in the framework of the open-source project.

The basic idea that led us to create an open-source FMS is quite simple: FMS is an experience. Networkers are willing to use open-source products and no less willing to develop them. So what will prevent them from formalizing their experience a little and sharing it with their colleagues? As a result, the system acquires several unexpected properties that are already closer to social networks. The learning process of the system becomes distributed. Someone faced an accident about which the system did not know, poduchil it, the new rules entered into a common database and spread with the following updates to everyone. Currently, the system is able to recognize about 40 different types of accidents (Details can be viewed by reference ). System training is performed on networks built on the basis of Juniper, Force10, Cisco, f5, DLink, Zyxel equipment and equipment support is continuously expanding. According to the plans, NOC 0.7 will be released in September, which will include a new FMS with a solid knowledge base. I hope that with time FMS will cease to be a lot of large networks and will become more widespread.

The features of the FMS implementation in the NOC will be explained in more detail in the following articles.

Source: https://habr.com/ru/post/126051/

All Articles

NOC: Introduction to Fault Management

More articles: