Perhaps you will say that I do not understand anything in zabbix or nagios, and therefore I rush to such loud phrases, and only for this, slap me a minus in karma. I ask you before you do this to answer the question. What problem does the network administrator solve in 2k L2 devices spaced geographically into the city square?
It would be great to get an answer in the form of comments. Below is my vision of the situation.
Actually, these reflections were prompted to me by reading J. Patton's book “User Stories, The Art of Agile Software Development,” before that I was angry and vaguely imagined how this should work. Now the story began to take shape and we will start it with user stories through the eyes of several participants in the network maintenance and operation process.
Network Administrator:
My job begins: if I received a report on the failure of the communication center
(well, somehow I don’t have the habit of admiring the graphics of how the network works, I’m probably defective)')
Why do I climb there ?: I need to understand which node in my three-level network has died and it doesn’t matter to me if there are any slave devices behind the point of failure.
What do I see ?: - I see a pack of alerts, one of which with elevated privilege - aha aggregation node ...
The case of the degradation of communication channels is a separate story. Everyone who knows about the phenomenon of flooding after a thunderstorm shakes hands) ... the point of failure is not known, it is necessary to identify indirect signs, the loop load or storm control triggered the load on the CPU or the interface reached peak values.
In general, I don’t need beautiful graphics, I need a working switch tree and alerts that will arrive exactly, and I don’t have any desire to know that the switch tree branch has fallen off.
What do I want to see ?: I only need a node to which everything is fine and after bad.
What am I going to do with this information ?: I will call the city electric networks, then I will send a technical group. It is important that the sooner I localize the problem, the faster it will be possible to take corrective action.
Does Zabbix help me in this? Of course, yes, but the redundant information and the need to write out a tree through the zabbiks configuration is tiring.
Technical support by phone:
My work begins at the time of receipt of the call from the subscriber. I have access to the monitoring system and I have to look at events in zabbix in order to understand what the client came from and which side of the subscriber cable the problem is.
Does Zabbix help me? Of course, yes, but I need a list of equipment failures, and here's another moment, you saw how quickly this list changes in moments of great flood ...?
It would be easier for me to receive a call to have all the information about the subscriber connection point accurate to the port and not waste time searching.
Field technical support
My work begins in two cases, the Administrator has determined the point of failure and can not eliminate the causes remotely, Technical support by phone revealed a "last mile" malfunction.
Does zabbiz help me? Yes, I can understand by the number of failed nodes that I need to refuel a car.
It would be easier for me to have operational information about my manipulations on the client side or a failed communication center, for example, having errors on the device ports before and after my manipulations and it is advisable not to carry a laptop with a console cord and not to run from the subscriber’s computer to the switch .
Subscriber
I still had a thunderstorm, or Zelenstroy was sawing trees, or the electric networks turned off half of the city, I want to know what happened with my service and when it will be restored. I want to understand that this is a problem with my router or operator again, and if I don’t meet the deadlines, I will go to my competitors ...
Based on these stories, it can be seen that efficiency plays an important role in services. The modern user does not want to wait and delve into the problems of the operator, he wants a service.
Universal monitoring in the reviewed histories can only give a vector for further work, but does not help to solve problems.
This led me to the conclusion that universal monitoring sucks!
What is your opinion on this matter?
Content:
Next article