What is monitoring in IT or why admins began to sleep more?

What is IT monitoring for ?

To administrators learn about problems in the infrastructure before users. It is, in fact, a complex of rapid diagnostics, which gives timely notification of problems and accurate information about where and what exactly happened.

Example: at 15:05 there is a problem with the mail. Thanks to the monitoring system, the admin already at 15:07 sees that a specific Windows service has not started on the server, which is why Exchange has not started, and users will not receive letters. Without monitoring, the manager would have called him at around 5:00 pm and asked where the letter from the partner, which he had already sent three times half an hour ago.
')

How was it before?

Previously, information about the entire infrastructure (servers, network devices, and so on) was simply collected. The role of the “intellectual processor” lay on the administrator: he, as a pilot in the cockpit, had to glance at all the devices with a glance in order to understand the picture. It is clear that not everyone could do this.

Now more and more automated and a little more difficult from the point of view of the system. The statuses are tried to be closely tied to the business servers so that there is no monitoring information “in a vacuum”.

Also monitoring was added on behalf of the end user, when the user's actions are emulated - this is a robot, which once in a certain period of time launches a special script: as if the user is running through the menus, pressing something - and if the robot fails to do something, and the man will not work.

Plus, the configuration database is now used: information about the monitoring objects is presented as a set of configuration units. Each server, each network device is a unit, it is all stored in a centralized database. This view allows you to later integrate the monitoring system with the service desk, the asset management system and extend the functionality further.

Virtualization

Previously, the entire infrastructure was physical, all servers were separate pieces of hardware, were in the rack, they could come and feel, until the admin sees. Now the infrastructure often consists of virtual machines, when the server is physically one, but on it, for example, it is running with a dozen virtual machines. This requires a number of subtleties in setting, but it gives a lot of advantages. For example, for us, as for the developers of monitoring systems, this is an obvious plus: you can put everything in a virtual environment. The monitoring system is software that consists of several modules. And before each module needed a separate server. When there were several pieces of iron, the customer could say that, they say, too much equipment is required for your system. Now you can make these servers virtual and place them on the same physical server. This, moreover, seriously reduces the cost of good monitoring systems.

An example of how this works.

There is one life example (names and faces changed). So worth HP Operations. Users who are accustomed to changing files via FTP, at some point find that the file can not be put. The first user stuck: the server did not let him. The user thought that the failure was temporary and sent the file by mail. Then a couple more people poked, they didn’t succeed either, and someone wrote a support ticket. Support began to figure out what was wrong. It looked good: the server was operational, and yet the service was not available. To look for such a problem “for hot” (despite the fact that it is impossible to stop the work of other services) is a standard task, but very dreary without a monitoring system. The admin simply looked into the list of monitoring events and saw a lot of firewall alerts. And multiple appeals were recorded outside. Very quickly (surprise!) A DDoS attack on this FTP was discovered, which was cut off. I think that, without monitoring, the search for a problem would be about three or four hours longer, which could lead to further complications.

Automation

More monitoring systems are able to automatically perform service actions. For example, a typical situation: the server runs out of space due to temporary files, applications start to slow down. Admin comes in, cleans temporary files, leaves - all type-top until the next repetition. Monitoring is able to determine the moment, for example, when 90% of the disk is full, generate an event - and start the cleaning on its own in automatic mode.

Since the monitoring system is able to integrate with service desks, it can automatically create problem tickets. That is, the ninja from the support can quietly and suddenly solve the problem even before the first call.

How to implement it at home?

It can be said that the monitoring system, like any other system of large volume, is rather complicated. Implementation is usually done in stages, regardless of whether the customer does it on his own or with the help of an integrator.

First, the monitoring objects (network equipment, servers, applications, etc.) are determined. Then select critical indicators for each object. If you take too much data, admins will drown in the stream of alerts about exceeding the limit values, and if too little - they can miss something important. After that, you need to decide on the architecture, choose a product, solution, vendor. Then you can begin to configure. Sometimes a pilot zone layout is made, and then this layout is extended to the entire infrastructure.

Ready Products

Monitoring systems are focused on customers of different levels. Large, complex and expensive solutions require huge labor costs for their deployment and implementation, but for large businesses it is worth it. There are smaller and simpler options for small and medium businesses, they are a kind of box that is fairly easy to implement. The most famous low-cost solution is Microsoft SCOM. There are a number of open-source options, they are generally free and require only fairly painstaking customization.

For what size company is the system useful?

The limit is where the system administrator can’t cope with the amount of work and can no longer control each server. In small companies, there is usually no point in using such systems (or partial solutions can be made), and in medium-sized companies there should be a more or less serious monitoring system. Such systems began to develop about 10 years ago, and now almost all large customers of IT-services have already implemented something similar.

What else can monitoring do?

Build reports, for example, on the use of resources. You can measure CPU, memory, hard disk, etc. The administrator can see that some server is overloaded, which means that it is necessary to remove part of the tasks from it, and another - not underloaded, and part of the services can be transferred there. This is the task of capacity planning and their rational use.
Visualize the problems. There is a certain idea of IT systems, for example, a large screen with a map of the company's branches and indicators of the state of the systems in each of them. Or, for example, a large map of applications. Industrial monitoring systems have the ability to build dashboards, where you can display the necessary indicators, draw maps, and so on. Accordingly, this gives clarity: the administrator does not run through the menus, does not search for the necessary information, but looks at the big screen and sees everything at once. Such an engineering interface is very useful on high-load projects or in particularly responsible business sectors.
Search for bottlenecks. When the system for the first time calls you a specific broken switch that you need to go and replace with a working one, you will understand how cool it is to have problem finding algorithms.

Code monitoring

Relatively recently, code level monitoring has appeared. This applies mainly to J2EE and .NET applications. Such modules can detect delays in system calls, memory leaks, delays in executing SQL queries, and so on.

Training

Initially, the system required a lot of effort to adjust the threshold values (what is considered an emergency situation - when the disk is filled by 90% or 95%?). Naturally, with a large number of monitoring objects, this was a time consuming task. Now monitoring systems are able to analyze historical data, study the behavior of objects and on the basis of this build so-called “dynamic thresholds”. That is, the monitoring system “learns” and understands what is a normal object delivery, and what it says about the accident.

What will this change for the IT department?

Administrators will be able to get rid of routine work and concentrate on more important and interesting tasks. They will accurately represent what is happening in the system at the moment, i.e. the infrastructure will be transparent. The work style, when they are forced to extinguish fires and constantly repair malfunctions, will not happen; The solution of routine problems can be automated. Of course, unforeseen accidents will still have to be corrected "manually", but it will be easier, as there will be an accurate diagnosis.

It remains only to read Habr and convince the accounting department that if the admin works a little, then this is incredible happiness.

Source: https://habr.com/ru/post/144941/

All Articles