A few tips to automate data center monitoring. Part 1

Monitoring the infrastructure of the data center is not an easy task. Automation is often used to simplify it. Well, what is great is to get all the monitoring system notifications on your monitor. Somehow we have already written that the automation of everything and everyone is not bad. This is a rather complicated, but solvable problem. Why should it be solved at all by automating the discovery of new devices, connections, software, creating scenarios for the system to respond to the emerging triggers?

This is because the person is lazy, automation in many cases will work better. But there are problems here. It would seem that can happen after the introduction of such a system? It seems that the main problems are solved, any problem will not go unnoticed. But in fact, some important issues remain, often, unresolved. Moreover, they are very common. We are talking about two such problems, and they will be discussed later.

Problem 1: error notification is not all

Strategy can be called the search for broccoli in stores and the buying process. Tactics in this case can be called the ability to persuade children to eat cooked.
Thomas LaRock, SolarWinds
')
Before delving into automation, be it the automatic detection of a problem, the sending of a report or an action scenario in case of an unforeseen situation, it is necessary to take measures with respect to one critical thing. This is the so-called DPR cycle, which stands for Detection, Prevention, Response. In other words, we are talking about the procedure for detecting a problem, preventing its occurrence and responding to data center employees in the event of a problem.

Now we’ll dwell on errors and messages about their occurrences. Say, the support received such an automatic warning system message, great. Now we need to understand why this error occurred, and also to find a way to prevent its future recurrence.

In the process of creating an automatic error notification service, you must also ensure that this is just the beginning. After all, you need to do more and hard work to analyze the situation, in order to find the cause of an undesirable situation. After that, you need to create additional test modules to identify the situation that has already happened. Maybe we are sure that it will not appear anymore, but anything can happen.

The automatic reaction to the notification of the warning system allows you to relax a bit, because automation is responsible for everything. But engineers must still understand why the problem arose. Automation is often incapable of this.

Problem 2: Deploying Monitoring Automation System

The point is that before introducing an automation system, you must have a plan for what such a system should be able to do. It must be carefully considered so that later there will be no problems. Well, in the plan you need to provide the following:

The presence of a sample of test machines. These can be purely “laboratory” servers on which the automation system will be worked out, or machines that are used during the course of work, but which for one reason or another are of lower priority than all the others;
It is not necessary to work out the situation, bringing the work of the machine to a critical one. For example, to work out the system of notifications about the critical load on the server, it is not necessary to bring the load on the system resources to 90%, and a smaller level will suffice;
If the system supports logging, then this function should be enabled. In the end, if a problem happens, you will understand exactly what happened and why. The event log should be as detailed as possible;
In some cases, you should not send problem notifications by e-mail. All this provokes additional delays in the work. Imagine that 740 messages arrived in the mail, and it is now necessary to rake in them, opening each notification in turn. It is best to save notifications locally, displaying simultaneously on the display;
The test results of the notification system should be discussed with the support; perhaps, the representatives of the support team will advise the efficient things;
After the system has been tested in a test version, it is worthwhile to slowly put it into operation. Moreover, the system should not be deployed all at once, but in stages. First, enable it for 10-20 systems. Then evaluate the results. Then expand the system to another 50 machines, and check again. The phased activation of the notification automation system will help avoid large-scale errors when deploying such a system to 100% of the machines at once. In the latter case, if a problem arises, it may haunt the entire data center.

If you use these tips, you can see the shortcomings of the automation system, and figure out how to correct them even before a major failure occurs. In order to use truly relevant tools in the implementation of the automation system, it is worth constantly discussing the next stages of work with the team. What do experts complain the most about? This is what needs to be addressed first.

If everything works out, then you can save yourself and the team from constantly arising and repeating situations, which have to waste your time, which, of course, is always not enough. What is stated in the material is only a small part of the work on the implementation of the data center automation monitoring system. The part has already been shown earlier, the rest we plan to publish in the near future.

Source: https://habr.com/ru/post/317632/

All Articles

A few tips to automate data center monitoring. Part 1

Problem 1: error notification is not all

Problem 2: Deploying Monitoring Automation System

More articles: