Monitoring IT Performance with Splunk IT Service Intelligence

It is not a secret for anybody that business is continuously striving to increase its competitiveness. It seems that the easiest way to do this through innovation, let's increase the efficiency of one or two or three business processes with the help of a modern, high-tech system and we will be happy. This scheme works until the complexity of our IT system begins to work against us.

Do you know how effective your IT infrastructure is? How do they affect each other? Maybe some element impairs the performance of the system as a whole? How to understand what this element is?

In our article, we want to talk about a solution that can help analyze and monitor the performance of each individual element, and the entire IT system as a whole, determine dependencies between services, and notify you when performance indicators change below or above the threshold level.
')
This is a Splunk IT Service Intelligence (ITSI) . This is a tool for monitoring and analyzing the work of the IT infrastructure, which runs on the Splunk operational analytics platform, which we described earlier .

The platform allows you to collect, monitor data from various devices, systems and applications. The ITSI workflow provides visibility into the performance of system elements, key KPIs, as well as the behavior of critical IT and business services and related supporting services.

With the help of a single tool, you can solve many problems in the field of IT, including event analysis, metrics, logs, allowing you to identify and fix the most important problems in the work of IT infrastructure elements.

Let's look at the main elements of Splunk IT Service Intelligence , and why they are needed:

Service Analyzer

It is obvious that everyone wants to get the most important information as clearly and quickly as possible. For this is the start page of the ITSI - Service Analyzer. By accessing the application, you can at a glance assess the overall level of operation of the entire IT system.

There are two Service Analyzer views: a tile view and a tree structure that allows you to see how some services affect others.

In the form of tiles, the performance indicators and the main KPIs associated with them are presented, having a color indicator and sorted by the current state level.
The tree view displays all services as a graph in which nodes show the level of node health. Each node can reveal and see related performance indicators.

In the example below, the Buttercup store service depends on the web store sales, mobile apps sales, web store services, mobile apps services and support services. The mobile application service depends on the middleware service and the external authorization service, which depends on the operation of the database.

Glass tables

Not all users are comfortable, understandable, and just need to understand the structure of the entire system to obtain useful information. Glass Tables is a visual and convenient visualization that will be understood by almost any user. “Tables” allow you to learn about the status of services and the values of performance indicators in the interface of work patterns or business processes. For convenience, you can use various widgets and icons to display KPI performance indicators.

Notable Events Review

After we learned how the services work, I want to know why they work like this, what events affect their work and somehow influence them. For this there is a panel Notable Events Review.

A “noticeable event” dashboard is used to see warnings about problems that currently affect services or can potentially affect services. The panel displays noticeable warning events generated by Multi KPI Alerts, correlation searches, and anomaly detection algorithms.

By "significant event" may be understood:

One of the KPIs if it exceeds a predetermined threshold;
The result of Multi KPI Alerts, which generates a warning based on the status of several KPIs;
The result of a correlational search that looks for relationships between data points.

For convenience of presentation, all events are grouped using machine learning algorithms, which determine similar events. The information panel displays information about each event group, such as the number of events in the group, the time range of events in the group, the owner, the severity level, status and description. By clicking on the group, you can get detailed information about events within the group.

We can manage events, set certain actions and scripts to implement an event, for example, send notifications to e-mail or to external systems.

Deep dives

Of course, seeing only the state of the system at the moment, we see only the tip of the iceberg. Therefore, it is very useful to be able to monitor conditions over time. How long does a decrease in work efficiency last? Now everything is fine, but were there any problems an hour or two or a day ago? Why?

In the Deep Dives section, the history of KPI indicators is added, that is, we can see the state of the system not only now, but a few hours ago and compare the results of work of different elements of the system at the same time or compare the results of one service with the day a week or a month ago.

Multi KPI Alerts

Often, only a set of factors can tell us about a problem, so the possibility of triggering alerts based on several KPIs or dependent on the duration of an indicator of a certain value is necessary.
In the Multi KPI Alerts section, it is possible to create such composite performance indicators, in which the weight of the influence of each factor on a common problem can be taken into account.

Conclusion

Summarizing the above:

Services and KPIs show which services are currently working normally and which ones have deviations.
Glass Tables allow you to group indicators into specific groups and visualize them visually.
Deep Dives allows you to compare the status of indicators in time and determine from which source a particular problem started.
Multu-KPI Alerts and Notable Events reveal some specific important events and allow you to manage them.

The examples described above do not cover 100% of the entire functionality of the system, but in general they allow companies with wide IT landscapes to solve the main tasks. Absolutely, we can say that ITSI is simply necessary for those who want to keep their IT infrastructure under control.

We are happy to answer all your questions and comments on this topic. Also, if you are interested in something specifically in this area, or in the field of machine data analysis in general, we are ready to refine the existing solutions for you, for your specific task. To do this, you can write about it in the comments or simply send us a request through the form on our website .

PS

On June 28, 2018, “ Splunk Getting Started ” will be taught in Moscow , where in 6 hours the participants will receive a theoretical base and practical skills for working in Splunk. Learn more about learning and register at this link .

Source: https://habr.com/ru/post/349242/

All Articles