How to reduce critical system downtime

Working with the database: a list of JDBC connections with the connection parameters.

The company from the TOP-5 in Russia earns an average of 7 to 9 million dollars per hour. Accordingly, a technical simple length of two hours, which was reduced to one hour by superhuman effort of will, is worth exactly this amount.

BSM is a class of systems designed specifically for those who suddenly realized that one minute in our program today is equal in price to an apartment in Moscow. And she really wants no downtime.
')
Now I will tell you how we implemented such systems.

An example of a pure software solution

In oil companies, for example, there is a set of software for accounting for deliveries, which, in fact, is ERP. Without this component, the company's work gets up until the “underground gnomes” repair everything back.

In one such company, each technological link had its own monitoring system, and if something happened somewhere, the first interesting quest was to find a problem. He took up half the time from each idle. We deployed a monitoring system that allowed us to clearly find the problem. And if earlier it could take from several hours to several days to find and eliminate problems in the software, the new solution significantly reduced the time to determine the root cause of the problem. The number of failures is now much less, the timing of system recovery has also decreased.

Now the system simply shows where and what is wrong. Most often we are talking about a specific non-responsive or incorrectly responding service, which is enough to restart to continue. I want to emphasize that in similar cases the whole “bodanovka” disappears with the search for who is responsible for the problem - instead of solving the issue on which side the plug (as it happens with many departments and contractors), you can immediately run and eliminate.

Iron Integration

BSM systems can integrate with iron. To illustrate the work in this case, I will tell you about how BSM is placed at the airport.

So, there is an airport. In it, the critical objects are servers, storage systems and, in general, everything that can be classified as “IT solutions of the local data center”. But there is also, for example, a navigation system, which almost by the voice of GladOS reports from time to time where and to which passenger to go. In the case of its fall, of course, you can declare the voice of a living person, but it is better, of course, to avoid this - reputation, excessive panic ... Another critical system is the baggage management system. If she gets up or starts to give the luggage in the other direction, the entire terminal stops serving passengers.

Accordingly, we approach as follows:

We carry out a complete examination of the criticality of systems.
We are looking for bottlenecks.
We design the solution. In our case, for each system you need to come up with a metric that checks its work. For example, in the case of baggage - we can connect to the baggage distribution management system and track metrics indicating interruptions or degradation of service. In the case of working with storage systems, we simply use the level of load on it for each process.
Expand the solution itself. This is a database with patterns of "good" and "bad" behavior, a set of sensors and information collectors (both in the form of hardware and software agents), an application server, a processing server, a warning system.
If necessary, set up automation at the “event - response” level. For example, if one of the applications stops responding - we can perform automated diagnostics of the application and either restore work automatically or, if work restoration cannot be formalized as an executable algorithm, automatically switch to another instance of the application if necessary. If the luggage tape is broken and information about it can be obtained from the corresponding management system, you can automatically call the repairman through an SMS alert, initiate an incident in the HelpDesk class system and notify those who are participating in the support process via e-mail.

Quality control

So, BSM is able to reduce the time to determine the place of failure. Able to track both software and hardware. And even inside BSM systems there are usually user simulators - these are stupid “robots” that can check, for example, the availability of a TCP port or the availability of a GET response for a web page. In a more complex implementation, robots can emulate a sequence of user actions with the application interface, and you can record these operations and translate them into scripting languages in an interactive visual mode. There are also modules that, using the traffic assembly to the application level, can isolate the main operations and sequences of related user operations and collect statistics on delays and availability of operations for each real user.

Now a small lyrical digression. Each time, introducing an IT-system, you need to think about what tasks it solves. For example, if customer service without an IT system lasted 12 minutes, and an automation system was introduced that allows you not to fill a part of the paper with your hands, then you want to believe that the service now takes a maximum of 10 minutes, right? And if it takes 14 minutes instead of the old 12, then there’s a problem somewhere.

So, one of the tasks of BSM is to monitor the quality of service provision. Not only its availability, but also the search for problems with braking interfaces, the delay in decision making by users, unnecessary links in the chains.

If we take as a basis the situation when the developer performs the development and testing of applications and new releases in the most conscientious and high-quality manner, independent control by the customer still requires quality performance of applications, since the reason is obvious - no one except application consumers and the customer needs quality work . A quality level can only be determined by the customer.

But in practice it happens, rather, that one fine morning after the next application release is transferred to commercial operation, users come to the office and understand that everything is slowing down. We had an example when BSM just collected information on the quality indicators of the system. A third-party system after implementation worked like a clock. But with the increase in the number of users, surprises began with the fact that in terms of operations, users observed significant delays in the response of the application. BSM ran as a user, repeating the basic patterns for real people - and caught a couple of bottlenecks, where the response of the interface could be up to 12 seconds under a number of conditions.

Such a solution can be built, for example, on HP Business Service Management (BSM) - this is an example of a couple of my recent projects. And if you also integrate the HP Executive Scorecard (XS) here, then you can correlate business operations with monitoring with IT asset management and customer service metrics.

Code Review

The same HP BSM does not know how to look out of the box, where the problem is in the code. Nevertheless, quite a lot of problems of solving a downtime problem rest on this. And so it has a convenient integration with products for working at the code level. In this case, screenshots from HP Diagnostics:

Horizontal bars indicate calls in the application, their length shows the time each call is made, and their sequence is shown below in the tree.

The same screenshot shows the ability to track exceptions.
Viewing calls allows you to understand what is the procedure for slowing down the application.

Analyzing data flows and detected components Diagnostics builds the topology of the application components between themselves and the client:

Good to know where and what

Summary

Despite the fairly simple description, BSM is an expensive and complex toy that, in fact, deploys a whole network of accompanying processes that collect data for everything that is running in the IT infrastructure.

In general, BSM is implemented at least a month or two, and allows in practice to reduce the idle time of critical services. More precisely, given that there is no 100% reliable services - to turn the inevitable downtime into a shorter one.

Source: https://habr.com/ru/post/226471/

All Articles