Transit VoIP Monitoring, Forecasting Method

annotation

Even if you are not using VoIP in your system, or this is not your main focus, you may be interested in the monitoring method itself using prediction because it can be successfully applied not only for transit VoIP. The monitoring method is considered on the example of an application to transit VoIP because this task is a vivid example of its use. Standard methods do not solve the problem, and monitoring by the method of forecasting is implemented relatively simply. What is written below is not theoretical research, it has been successfully used in practice for several months.

Introduction

Most modern active monitoring systems for IT infrastructure use the same principle. The monitoring system in some way polls the equipment or software, gets the result and compares it either with a template or with predetermined maximum permissible values.

For example, to determine the availability of the SMTP server, the monitoring system should connect to the server on TCP port 25, send the string “helo my.monitoring.com” to receive the response line and disconnect from the SMTP server. Next, check whether the server response line at the beginning contains a three-digit code starting with 2. If this is the case, the server is live, if not, the system should give an alert. In fact, this is a pattern matching method.
Another example is a processor load check. The monitoring system, most often via SNMP, polls the server, obtains the value of the current processor load, and compares it with the maximum permissible maximum value, say 80. If the processor is more than 80% loaded, the monitoring system should give an alert. This is a method for checking the maximum permissible values of the values being polled.
In any case, in the process of configuring the monitoring system, you need to set clear criteria for what should be considered normal operation of your hardware or software, and what is a malfunction, or a situation that may lead to malfunction in the near future.
This principle of monitoring works in most cases. However, at times, it is too difficult or not at all possible to set the criteria for what is considered a malfunction and what is normal work.
One of the tasks of monitoring that is not solved by standard methods is the problem of monitoring the quantitative and qualitative parameters of transit VoIP. Before describing how a prediction-based monitoring system works, it is necessary to describe how a transit VoIP provider works.

Transit VoIP how it works

As in any case, in transit VoIP, there are many nuances that are too long to explain, and to nothing. Therefore, the description will be at a primitive level, as they say, on the fingers.
This issue should not be considered at the level of the end user - the person who is calling, but at the level of companies that provide VoIP service. From this point of view in the provision of services involved two companies. The first, company A, is a consumer of VoIP traffic whose client initiates a call, the second, company B, a supplier that terminates a call.
The supplier usually delivers one or several directions. A referral is a group of telephone codes sold at a certain price. Example directions: Russia Mobile BeeLine, Spain Madrid, Kazakhstan Almaty. At the moment in the world there are about 1,300 destinations.
A consumer company usually provides its customers with all the directions - this is called A-to-Z. Therefore, the consumer needs to conclude contracts with many suppliers of various directions, and then route the directions to different suppliers. Make it quite difficult.
On the other hand, the supplier is interested in attracting as much traffic as possible. However, this carries with it a large number of contracts and potential financial risks, for example, one or several of many companies may not pay for the traffic consumed.
Therefore, it is beneficial for both suppliers and consumers to work with a company as an intermediary - a transit customer who, strictly speaking, does not generate or deliver VoIP traffic by itself, but links them together.
In fact, instead of a much connected structure of interaction between suppliers and consumers, we get a star structure, in the center of which is a transitory.
')

Most often, the transient proxies through itself and the alarm and voice.
The company, which deployed the described monitoring system of about 250 suppliers and 250 consumers. The company provides A-to-Z directions, all in all about 1,500 names of directions are registered in billing.
It is extremely important for a transit worker to monitor how consumers, suppliers and destinations work. And to do this in the context of each consumer, supplier and directions. In the short term, in case of failure, the transitory loses margin. In the long term - all are interested in a high level of service.
Now, after the outline of the work of the transitory is outlined, I will formulate the task of the monitoring system.

Task monitoring system

When it comes to monitoring VoIP, it usually means monitoring parameters such as jitter, delays, packet loss, MOS. Cisco has a strong technology, 4 years ago I already wrote about it .
However, for transit VoIP operator, the use of such technologies and the analysis of such metrics is not suitable. Naturally, the transient analyzes both the percentage of packet loss, and jitters and delays, to partners, but does so only after a failure is detected. Themselves failures are detected by analyzing other parameters, such as:
• Total number of minutes.
• Total number of calls.
• ASR (Answer Seizure Ratio) is defined as the ratio of successful calls with a non-zero duration to the total number of calls.
• ACD (Average Call Duration) average call duration.
The first two parameters are quantitative, the last two are qualitative. It can be a long time to explain why these parameters are being analyzed, but this will go beyond the scope of the article.
It should also be noted that usually the monitoring system polls remote devices in one way or another, the described monitoring system uses CDRs (Call Details Records) as the data source. The CDR file is a text, each line of which contains information on one call or an attempt to make a call. The level of detail for a particular call may be different, it all depends on what equipment writes CDRs. A quality CDR can contain about 100 parameters, it contains not only information about who called, where the call was made, call duration and call completion code, but also, when it comes to VoIP, parameters such as jitter, codecs, packet loss , MOS and others.
Let's calculate how many total values the monitoring system should check for the presence of a potential failure. It is necessary to analyze the parameters for 250 suppliers, 250 consumers and 1500 destinations. For each supplier, consumer, and direction, 4 values need to be checked, so we’ll have:

250 * 4 + 250 * 4 + 1500 * 4 = 8000

But it's not about the relatively large number of quantities that need to be checked. The fact is, each of the partners and each direction is unique in its own way; moreover, the operating parameters of each of the partners and directions very much depend on the time of day.
Not to be unsubstantiated we consider several examples constructed according to real traffic data.

The graph shows the distribution of traffic by the minutes of two partners located in the same geographical area. The values on the chart are the number of minutes consumed by the partner in a 15-minute interval. If for the first partner after 8 to 22 hours the consumption rate is about 2500 minutes in 15 minutes, then for the second, not more than 500. If the first partner starts to consume less than 1500 or more than 4000 minutes from 8 to 18 hours, then this may mean that a failure occurred and it is necessary to analyze the cause. For the second partner, this is not the case; for him, there are other criteria for a potential failure.
Another example.

The graph shows the distribution of traffic on calls. Values on the chart - the number of calls in the 15-minute interval, made in 3 different directions with a comparable amount of traffic. The first two directions are in close geographical areas, the third is geographically remote, that is, there is a different time zone in this country.
I hope it is obvious that it is very difficult to set criteria for monitoring traffic. The task in practice is not solved by standard methods.

A method for predicting why this should work.

Despite the fact that each direction and partner is unique, quantitative and qualitative parameters of work are predictable. For example, consider the schedule of one of the directions for three consecutive days, for clarity, we depict every day on the same graph in a different color.

It is logical to assume that in the same way calls will be distributed on the direction and on the fourth day. In fact, it is possible to predict how a direction should behave, and, if a direction behaves differently than predicted, give an alert about a potential failure.
Thus, the problem can be solved by writing software that can predict. There are a lot of prediction methods currently developed. However, it is quite difficult to programmatically implement one of the forecasting methods; it is more profitable to use the finished product.
The forecasting functionality is embedded in a very well-known and popular product - rrdtool, which is distributed under the GPL.

Prediction method implemented in rrdtool

Despite the fact that I have been actively using rrdtool for more than 10 years, to admit, until recently, I did not even suspect that the forecasting functionality was built into it. The very idea of using prediction for monitoring VoIP arose after reading the article by Jake D. Brutlag “Aberrant Behavior Detection in Time Series for Network Monitoring” For which I express my gratitude to the author.
The prediction method is used in the cricket monitoring system. However, this software product has not been developed for a long time, besides, cricket polls devices via SNMP, and to monitor VoIP traffic, as I wrote above, it is necessary to process CDRs. There are several other monitoring systems that use the prediction implemented in rrdtool, but none of them fit, so I had to write my own monitoring system from scratch.
Frankly, initially there was no certainty that this would work. Therefore, first the monitoring method was tested on IP traffic. The simplest scripts were polled for interface counters, and the resulting values were filled with specially formed rrd databases. After the method of forecasting revealed several global failures, it became finally clear that the method works correctly.
In the article by Jake D. Brutlag, there are formulas that are used for forecasting and a lot of useful information, but I will explain how this works a little differently, without going into details and technical details.
In order for rrdtool to begin to predict, you must create a rrd database in a certain way. When creating a database, parameters are set for forecasting, such as the duration of the season, and various factors that affect the forecast. Further, you simply fill the rrd database with measurement values, and rrdtool takes over the work on forecasting and identifying potential failures.
In rrdtool, a Holt-Winters prediction method is implemented, in addition, a mechanism for detecting aberrations of the measured value with respect to the predicted values is implemented.
According to Wikipedia: aberration - a deviation from the norm; errors, violations, inaccuracies (lat. aberratio "evasion, removal, distraction", from lat. aberrare (lat. ab- "from" + lat. errare "to wander, err") - "to leave, deviate").
You can understand what aberration is just by looking at the picture.

Holt-Winters prediction uses the concept of seasons. For IP and VoIP traffic, the season is usually one day. However, you can specify a season of any length.
The Holt-Winters method calculates the forecast at once by two values - the first value is the actual forecast of the measured value, and the second value is the forecast of the tolerance of the measured value from the predicted values (deviation). In fact, the second value influences the width of the corridor of permissible deviations of the forecast from real values.
Another important fact, the Holt-Winters method is a method of short-term forecasting, this means that the forecast is not built immediately for the whole season (in our case, not immediately for the whole day). In fact, with each new measurement of a quantity, a forecast of its next value is built, for calculation of which the history of previous measurements and the last measurement value are used. There is a coefficient that can be varied, depends on it, to what extent the previous history and the last dimension affect the forecast.
Aberration is detected based on several measurements, using a floating window. That is, if the measurement is knocked out beyond the range of acceptable values, this does not always lead to the appearance of an aberration. By default, aberration occurs when seven of the last nine measurements are knocked out of the predicted range of acceptable values. However, you can also vary the parameters and set, say, a window length of 3 and the number of not getting into the window is 2, then aberration will occur when two of the last three dimensions are knocked out of the corridor.
If you worked with rrdtool, then you know how powerful this graphing tool is. Let me explain the last 3 paragraphs in the graphs. Such graphs are built in the described monitoring system.

The blue line of the second thickness is the real measurement values of the number of minutes consumed by the partner. Gray area - the real values that were exactly a day ago. The pink line is a prediction. The red and green lines indicate the upper and lower limits of the range of acceptable values. The black line in the area of negative values is the forecast of permissible deviations from the predicted values (deviation). In fact, the black line sets the width of the corridor. By default, the lower and upper boundaries of the corridor (displayed by the red and green lines) are obtained by adding, for the upper bound and subtracting for the lower bound, the predicted value and predicting the deviations multiplied by 2. The aberrations are shown in gold.
Consider the aberrations closer.

It can be seen that the aberration arose some time after the values were knocked out of the corridor, in this case the rrd database is configured by default with a window length of 9 and the number of non-hits in the window is 7.
Aberrations are easy to display on the graphs, but the monitoring system itself must programmatically detect the moment of aberration, in order to signal this. Software detection of the occurrence of aberrations is also easily implemented using rrdtool
Rrdtool builds graphs in the form of figures, the display area of the graph in the figure has constant coordinates, so it is not difficult to write a procedure or class for scaling graphs along the time axis by pressing the mouse buttons. If you use cacti , which is actually a WEB GUI for rrdtool, then you can see how it is implemented there by the java script.
In the described monitoring system, the core of the system is rrdtool. However, only a functional prediction and identification of potential failures is not enough. When the system has issued a failure alert, it is important to quickly access the data to analyze the situation. Therefore, it implemented additional features for analyzing VoIP traffic. In order to simply describe the additional capabilities of the monitoring system, we will review in general terms how it works.

How is the monitoring system

The scheme of the monitoring system is shown in the figure below.

Every 15 minutes, a new CDR file is processed. The file can contain up to 150,000 entries. The data is parsed and stored in the postgresql database, which stores raw CDRs for 28 days. Up to 10 million records are accumulated in the database of CDRs per day, about 250 million in total. Naturally, in order to increase the productivity of requests for retrieving information for an arbitrary period in the context of partners and directions, additional tables are also filled in, containing summary information sufficient for generating reports. About 200 thousand records a day fall into the pivot table, and on such volumes the database processes requests for sampling, even over a period of several weeks, very quickly.
Further, by querying the database of CDRs, the data is selected, which populates the values of the rrd database storing information on the ASR, ACD, number of minutes and calls in the context of suppliers, consumers and directions. In total, about 8000 rrd databases are updated. Due to the peculiarities of rrdtool, if a partner, or a direction did not have traffic, a value of 0 must be recorded in the database. After updating the rrd databases, each of them is checked by the alert module for new aberrations. In case of aberration, the alert module can send a letter by mail, in addition, when new alerts appear, users who are running the monitoring system client receive an alert by changing the color of the client application icon located in the tray.
Processing one CDR file, together with sending alerts, takes at least two and a half minutes. Server applications are written in python and are running FreeBSD.
Due to the personal preferences of the author, the monitoring system client is not a WEB application. The client application is written in PyQT . To communicate with the server, the client uses the xmlrpc protocol, which works for security over SSL. For client authentication, HTTP Basic Autentification is used. The server application serving client requests is multi-threaded.
A detailed description of the client application interface in this article is not worth it, since only the user documentation is a document on more than 80 pages. I will only describe in general terms what users get.

How is the monitoring system used

When you create or describe a product, first of all, you need to position it correctly. It was originally planned to implement a system that signals potential failures, however, during the initial operation, it turned out that only a failure detection function is not enough, a tool is still needed for effective analysis of the cause of the failure. Therefore, in the end, everything resulted in a system for monitoring and analyzing traffic. Now the product is positioned as a system for monitoring and analyzing VoIP traffic.
For monitoring, as stated above, the prediction functional implemented in rrdtool is used. It is possible to view the list of alerts in the form of a table and by clicking on the alert line, build a graph of the value by which the alert occurred.
For traffic analysis, graphs constructed using rrdtool are used. The user can build and analyze graphs according to the values on which the alert has occurred, and can also easily find the graphs of interest in the drop-down list. Charts (strictly speaking, rrd databases from which they are built) store information for the last 28 days and are conveniently time scaled with the mouse. For example, as in the figure below, you can select the area of interest with the left mouse button and get a graph for a tenth of a second in the time interval of interest.

If necessary, to analyze in more detail the VoIP parameters for the time selected on the graph, the user can with one click build a summary report on the data displayed on the graph in a table on ASR, ACD, minutes and calls by consumer, supplier, direction.

The summary report displays the parameters for the selected time interval, in addition, you can display the parameters that were exactly a day before the selected time interval.

This allows you to quickly understand what has changed in traffic in order to further analyze the cause of the failure.
For more in-depth analysis, the user can build a table that reflects information detailed at the level of specific calls, including jitter codecs, packet loss, MOS, and more.
There are many other useful things implemented in the monitoring system, the description of which will take a lot of time.
Due to this, the monitoring system is used by all levels of those support, the routing control department. Despite the fact that the monitoring system does not provide data on prices and margins, it is also used by the sales department, as they are often interested in the quality of the work of partners and directions.
Now consider the disadvantages and advantages of monitoring by forecasting.

Method disadvantages

The disadvantages include:
• Relatively large reactionary monitoring system. Yes, the method works, failures are detected, but not quickly. Even with a window length of 3 and the number of misses in window 2, a rather large interval of time passes before the system issues a failure alert.
• A relatively large number of false positives. However, if we consider this particular monitoring system, this is offset by the fact that it is very easy to determine the false positives due to the functionality for analyzing traffic.
• The system is based on the analysis of statistical data. Statistics only works great for a large amount of data. If, say, there is almost no traffic to any direction, then there is nothing to predict.

Advantages of the method

The advantages are as follows:
• The system works completely autonomously. There is no need to set criteria for identifying failures, the system puts them on its own. When the traffic profile changes, the system gives an alert. If the profile as a whole has changed, the system after a while will adjust itself to the new profile.
• Implementation with rrdtool is very intuitive. Employees, even with little experience, learn quickly.
• IMHO, currently it is the only possible method of monitoring transit VoIP, which does not require significant financial and human resources.

Conclusion

Despite the rather large volume, this is an overview article. It is planned to write a few articles on the method of forecasting implemented in rrdtool, in which to put everything on the shelves. In addition to VoIP, we use it to monitor the traffic of Internet channels and even load processors on Cisco routers.

PS
By the way, the author of the popular monitoring system, Munin, plans to implement monitoring also with the help of forecasting. Here is a phrase about this - “anomality detection with the help of rrd”. Hopefully, we will soon see this feature, in any case, you already know what it is about ...

Source: https://habr.com/ru/post/132400/

All Articles