📜 ⬆️ ⬇️

Correct moving average for real-time monitoring

Proper real-time monitoring of the system is not as simple as it might seem at first glance.

The most common example is measuring the response time of a server to a request. Suppose we have everything:
- the server for each request calculates execution time, adds to the counter
- the server is able to give the counter value on external request
- there is a monitoring server that collects values ​​every poll interval, stores, aggregates and draws graphics

Solution to the forehead - to measure the instantaneous value of the counter - it does not make much sense, if the poll interval is one or five minutes, we will get the instantaneous value of the system performance measured by the last request. If all 5 minutes before that queries were executed for 2 seconds or more, and the last one was easy for 20ms, we will see only 20ms. Or vice versa.

The standard solution is a moving average for the last N queries. The solution works fine as long as the N requests are executed during the smaller poll interval. If the load drops, it turns out like this:
')


Between midnight and 4 am, either there were no requests at all, or it was less than N. The value of the moving average did not change and the deceptive impression was created that the server processed all requests in 6ms.

Below, the same counter, only from another server where the moving average has been modified.



The picture is visible much better. It can be seen where there were requests, and where it was not.

The modification is quite simple. In addition to the parameter N - the window size for the moving average. One more parameter is entered - T, time of forgetting (expiration time), all values ​​in the window, older than T are not taken into account when calculating the average.

Selecting T (ms) for a given value of poll interval (ms) is another interesting problem.
If T << poll interval, (much less) there will be lost values
If T >> poll interval, (much longer) there will be a schedule # 1
As a first approximation, you can accept T = 2 * poll interval

Source: https://habr.com/ru/post/66812/


All Articles