📜 ⬆️ ⬇️

Monitoring in IT, how to organize work

In this article I want to share my experience in the organization of the IT monitoring system. Here one of the possible solutions to organizational problems will be considered, the technical aspects of modern monitoring systems will not be discussed. I want to clarify right away that in my case, the monitoring objects are exclusively IT components: servers, operating systems, databases, transactions, etc. d.



Every time in my professional activities, solving the task of building an integrated IT monitoring system, I try to adhere to this approach, which allows the customer to receive a well-functioning monitoring system. The approach is universal, convinced that it can find its application and not in the IT field.

As happens


By the nature of their activities, they often encounter monitoring systems, which after introduction are evenly covered with a thick layer of dust, and those for whom the systems were created safely forget about their existence, and only hard-working servers continue to consume electricity from data centers in good faith.
And further! what is remarkable - the software used in such “grief” systems is the most modern, the most expensive, and there can be no complaints about technical flaws. But the system administrator, as he received notifications of failures, using his proven script for years, still gets it! I can supplement the picture with the administrator with the following familiar syndromes:

What is monitoring for?


To prepare the reader for further reading, I want to add a few definitions:
The monitoring system is a system that detects the deviation of the observed parameter from a given norm and as a result performs a certain action.
Monitoring tasks:

I am also convinced that a modern monitoring system in IT must necessarily provide a greater number of functions, for example: collecting and storing values ​​of monitored parameters, predicting possible failures, automatic detection of infrastructure components, etc. But the performance of two main functions (Detection and Action) makes the system a monitoring system.
Another couple of definitions:
Customer - a person interested in receiving monitoring services.
Provider - the person providing the monitoring service.
')

Definition of the rules of the game (regulations)


So who is such a Customer is, as a rule, a person responsible for the functioning of a service, software, or even just a server within an organization, it is from him that they are asked for the quality of functioning of the listed facilities.
The provider has the ability to organize monitoring for the Customer, the necessary objects of their parameters for the timely detection or prevention of failure.
So! It is necessary to determine certain rules of interaction between the Customer and the Provider. The study of this issue is a necessary measure to create a comfortable interaction environment, because the Customer is a responsible person and he needs to understand how responsibility will be distributed between him and the Monitoring Provider.
Once again I want to note in one form or another, but the rules should be!
Regulations - a document that defines the rules of interaction and rules for the distribution of responsibilities between the Customer and the Provider.
Here is a list of questions that the rules should answer:

Request


The main document of interaction between the Customer and the Provider will be the Application for monitoring.
An application is a document in which all the requirements of the Customer are formalized for the tasks that he intends to solve with the help of the monitoring system.
I propose to describe all the rules for filling, approving, testing, putting into operation in the regulations, as well as the list of persons eligible to act as the Customer.
An application is like a small contract that corresponds to a certain form between the customer and the Provider, and it will also be the main means of resolving disputes between them, for example: “Whether monitoring was able to accomplish its task in the manner prescribed in the Application!”
The method of registration of the application depends on the methods of maintaining documents used in the organization:

Application structure


So I have already talked about the two most important things in the organization of the monitoring system: The regulation is one time, the Application is two. Here the reader can cry out “Like this, that's not enough!”. I will answer: "Of course, not enough, the rest is just not included in the scope of this article."
Now I want to add technical details, I will give an example of the structure of the Application that I had to use. And since I’m going to talk further about technical details, then everyone who is ready to finish reading this article at this very place, I invite you to the discussion, I will be happy to discuss the issues that have arisen.
Let's return in the structure of the application, I can describe the application in the form of the following groups:

General:
Here I propose to determine the unique identifier of the application, maintain versioning, maintain status, etc.

Responsible - the person acting on the part of the Customer, who oversees the creation of the Application. Responsible obligatory attribute of the Application, it may change, but it can not be. The circle of persons who have the opportunity to initiate the process of creating an Application is proposed to be determined in the Regulations.

Configuration Units (CU) is a list of objects for which the conditions of this application are necessary. Information about the KE should be fully sufficient for its unambiguous identification. At the design stage of the monitoring system, it is proposed to develop a KE plan, their possible types and necessary attributes.
Depending on the monitoring system architecture, the list of CIs in the application can be generated manually or based on the data of the detection of infrastructure objects (there are specialized systems that are able to inventory the entire IT infrastructure). The list of KE in the application is also allowed to be generated automatically based on the condition (For example: all servers belonging to a specific subnet, this can be a link to a query in the database where information about infrastructure objects is stored).

Conditions - the exact wording of the checks that must be performed to implement the monitoring. Conditions must be carefully recorded in the application, it will be the initial data for development. Each condition is associated with a list of KE or with individual instances of KE.
Example of condition types:

Act
Each condition may correspond to actions that must be performed. Each action must have a unique name (identifier). This identifier must be included in the attributes of the message entering the monitoring system, which will allow to identify the necessary action on the side of the monitoring system. In turn, the monitoring system should have settings that allow you to perform an action depending on the name of the action.
Examples of action types:

Configuration - a link to the configuration of the software that implements the conditions of the application. May contain the following information:

The reference to the monitoring system configuration can be input to automate the deployment process. Also, the link to the configuration will allow you to determine which Application is associated with certain settings of the monitoring system.
Here is the outline of the application:


Application life cycle


The last thing I want to pay attention to is the life cycle of the application, I will highlight the following steps:
  1. Registration;
  2. Development;
  3. Test operation;
  4. Industrial exploitation.

More about each:
Registration
Here, the Application is filled out in the prescribed form, the monitoring conditions are agreed upon and, in the translation to the next stage, the development.

Development
Everything is simple! Creating configurations in the monitoring system in accordance with the terms of the Application and transfer to the next stage - test operation.
We do not exclude the possibility of returning an Application for the clearance stage to clarify or redefine the monitoring conditions.

Test operation
Here we solve the delicate political question: “Who is engaged in the preparation of the test environment” and carry out all the necessary activities related to the testing of the Application.
According to the results of testing, it is possible to return the Application for the stages of registration or development, depending on the identified problems.
With the successful completion of the test operation - the next stage, industrial operation.

Industrial exploitation
Everything is serious here: the system works, incidents are detected, notifications are sent. At the stage of industrial operation, I provide for the possibility of making changes to the Applications, such as changes:

More about these types:
Minor - changes that do not require a return to the development stage, for example, changing the list of monitoring objects or changing the mailing list to recipients. In case of minor changes, I do not change the version number of Applications.
Major - changes that require a return to the design stage, for example, changing the monitoring conditions. In case of major changes, we change the versioning of the Application, previous versions must be decommissioned.

And lastly, it is possible to decommission the Application from industrial operation or transfer it to the maintenance mode. Maintenance mode is the case when it is temporarily necessary to turn off monitoring for technical activities.

Schematically, the life stages of the application will be presented in the following figure:


Total


Summing up, I want to say that the successful operation of the monitoring system is based on three pillars: the Regulations, the Application and the technical capabilities of the system itself.

It will be a good practice to create a certain number of applications containing the typical monitoring conditions. For such applications, it is sufficient to simply determine the objects of monitoring, determine the recipients of the notifications, and forward to the stage of industrial operation.

That's all that I wanted to tell in this article, I am pleased to invite you to the discussion, ready to answer questions.

Source: https://habr.com/ru/post/226639/


All Articles