Collecting advanced upstream statistics using nginx-sla

Introduction

Improving customer service consistently leads to higher loyalty. And not only in the sense of adherence to a specific online product, but also in the sense of tolerance for its disadvantages, provided, of course, that the advantages are speed, usability, functionality, etc. - they are outweighed.

Of course, we cannot measure the quality of service directly; however, even such an ephemeral value can in principle be reduced to a set of quantitative characteristics, one way or another indirectly affecting quality. Profit, number of customers, percentage of converted leads (leads - registered or interested users), etc. - all this is quite objective indicators. In addition, these values can be included in the performance monitoring system as KPIs - key performance indicators.

From our engineering point of view, similar characteristics are the response time and the HTTP code of the upstream response. Indeed, product design, product functionality, marketing efforts, and customer calling are outside our area of expertise. Therefore, we need to focus on what is in our power — accelerating the operation of the application infrastructure and processing client requests.
')
Analysis of the response and HTTP-codes is conveniently carried out on the basis of some collected statistical database, and here we smoothly approach the topic of the article.

What and why we measure

As mentioned above, our task is to assess the quality of the application, and not the actual measurement of the speed of its work (profiling) or optimization. For this there are specialized tools. The main indicators of the quality of the application are the upstream response time and the HTTP response code, since it is upstream that performs the main function of generating product content and processing user requests.

On the other hand, the analysis of the transfer rate of an upstream page through channels to the client’s computer does not interest us in this context, since tells us nothing about the quality of the product itself. A client network can have a different architecture and speed, and the low bandwidth of a client can only mean that it works, for example, via a 3G modem.

In the same sense, analysis of page rendering speed in the browser is hardly applicable at this stage. The share of upstream in the process of displaying a particular page to the user from the moment of receiving a request from him and until the end of loading of all components of the page and rendering it by the browser is about 5-10%. It turns out that, taking into account these factors, we will reduce the entire measurement result to a significantly and difficult to calculate error. However, this is a topic for a completely separate conversation.

Thus, statistics should be collected on two basic parameters:

Upstream response time to request.
Analyzing the statistics on this parameter, we will not only be able to assess the overall speed of the upstream, but also to identify certain patterns (patterns) in changing the load, varying the average response time at different times of day, etc.
HTTP response code
It is clear that it will not be possible to identify and eliminate all errors - no one has yet canceled external links. However, it is certainly worth striving to reduce the number of errors. Both the “average hospital temperature” (codes 2xx, 4xx) and the exact distribution of all HTTP codes should be saved and analyzed. The predominance, for example, of the 5xx codes in the statistics, will indicate our mistake (and therefore the reduction in the quality of the application), which needs to be corrected.

Features of measurement

When analyzing the response time, it would be nice to take into account the nature of the request, processing and analyzing statistics on various types of requests separately. Indeed, a heavy SQL query can significantly increase the page formation time, while small AJAX queries performed by users only “load” upstream slightly. In this regard, it would be most expedient to distinguish categories of requests with different response time limits, for example:

AJAX request - 200ms
Page generation - 1s
Error - 5s
Timeout or no response - 7s

Unfortunately, it is not always possible to explicitly separate these or other work scenarios. For example, the same script can generate different content, depending on the working conditions, which will lead to an incorrect interpretation of measurement results.

In this case, it is possible to formulate general rules for estimating response time based on percentage time ranges. For example:

90% <200ms
95% <500ms
99% <1s
100% <2s

Thus, the output of statistical values beyond the boundary of one of the ranges will serve as an indicator of a decrease in the quality of the application’s work.

How to evaluate the measurement result

Collect statistics (as below) and get the results of measurements of response time - this is only half the battle. Without further analysis and evaluation of the result (good, bad), all this does not make any sense. Fortunately, industry standards come to the rescue of us at the response time of the system:

ESD / MITRE (ESD-TR-86-278, “Guidelines for Designing User Interface Software”, 1986, www.dfki.de/~jameson/hcida/papers/smith-mosier.pdf )
TAFIM ("Technical Architecture Framework for Information Management", 1996, volume 8, "DoD Human Computer Interface Style Guide," www.everyspec.com/DoD/DOD-General/DISA_TAFIM_VOL8_7545 )
MIL-STD 1472 (Revision G, 2012, www.everyspec.com/MIL-STD/MIL-STD-1400-1499/MIL-STD-1472F_208 )

In accordance with these standards, the maximum user wait time should not exceed 10-15 seconds, the average request processing time should be no more than 2-5 seconds, and the system response time should be 0.2-0.5 seconds (the standards were written on the basis of the capabilities of the technology of those times which has now well increased its productivity).

Another source is “Designing and Engineering Time: The Psychology of Time Perception in Software” (2008, www.amazon.com/Designing-Engineering-Time-Psychology-Perception/dp/0321509188 ) - examines an approach based on user timeout:

The system should show the instant reaction (0.1 - 0.2 s) when pressing buttons, calling the menu, interacting with other interface elements.
Immediate response (0.5 - 1 s) should follow in response to a simple user request.
An uninterrupted reaction (2–5 s) is the time during which the user retains attention to the task, in other words, the feedback time.
Forcing reaction (7 - 10 s) - when after 7 seconds the user switches to another task or leaves the page.

Thus, comparing statistical data with the indicated values, we can carry out the conversion of a quantitative characteristic (upstream response time) into a qualitative one — compliance of the system’s response to the standards and subjective perception of the user.

We proceed to the direct measurement.

What to measure

Consider solving the problem of collecting statistics for the popular nginx server.

When using nginx as a front end, all the information we need is available in the logs. The classic way of collecting statistics through the analysis of logs involves a detailed analysis (parsing) of the contents of the logs for a specified period of time. This path has several disadvantages:

the analysis of logs in itself requires a significant investment of time and resources;
log analyzers are often expensive, and writing your own solution requires the additional involvement of developers;
exporting parsing results for further analysis is difficult.

The second approach is the use of additional modules. Consider a number of options.

nginx-statsd
A module for nginx that collects statistics and redirects it to the StatsD daemon using UDP protocol. This approach, firstly, requires a separate server, and secondly, leads to an increase in service traffic. In addition, StatsD is tied to Graphite, and the integration of other monitoring systems with it is difficult.
pinba
A module for nginx that collects statistical data at the time when the request is completed and sends them via UDP to a separate statistics collection server. In fact, the disadvantages of this solution are due to the very principle of its work: the need for a separate server, “extra” UDP traffic. In addition, the module actually fixes the time from the moment of receiving the request to its completion (shutdown), i.e. measures the execution time of the request, not the upstream response time we need.
nginx-sla
The nginx-sla module implements the collection of upstream statistics without the involvement of an additional server. These statistics are generated on request to nginx in plain text format, therefore, neither a separate server nor UDP is needed.

It should be noted that the described disadvantages of the above methods of collecting statistics, of course, in no way limit the usefulness of these approaches in general. However, it is for our task that the most simple and flexible tool is nginx-sla.

Benefits of nginx-sla

As already noted, the use of the nginx-sla module does not require a dedicated server for collecting and analyzing statistics. Statistics is collected by direct analysis of nginx logs and recorded in plain text. Obviously, in the context of further analysis, both standard tools and third-party tools (for example, zabbix), such a presentation of upstream statistics is much more profitable.

Example of statistics collected by nginx-sla, with default configuration:

main.all.http = 1024 main.all.http_200 = 914 ... main.all.http_xxx = 2048 main.all.http_2xx = 914 ... main.all.time.avg = 125 main.all.time.avg.mov = 124 main.all.300 = 233 main.all.300.agg = 233 main.all.500 = 33 main.all.500.agg = 266 main.all.2000 = 40 main.all.2000.agg = 306 main.all.inf = 0 main.all.inf.agg = 306 main.all.25% = 124 main.all.75% = 126 main.all.90% = 126 main.all.99% = 130

As you can see, the data displays the statistics of all processed HTTP responses with different statuses. For example, the value main.all.http_200 (main.all.http_404, main.all.http_500) denotes the number of processed responses with the corresponding status.
And the field main.all.http_2xx contains the number of all upstream responses that have the status “Success”.

Statistics on response time is collected in the variables main.all.300 (main.all.200, main.all.500, main.all.2000). For example, the number of requests processed by upstream in the time from 300 milliseconds to 500 milliseconds is 33. And the aggregated number of requests processed in 500 ms or less is 266.

In addition to the explicit statistics of upstream response time, nginx-sla lists information on percentiles. In particular, the time for which the Upstream is responsible for 90% of requests is 126 milliseconds in this example.

The percentile representation of the statistics, as well as the values of the average (main.all.time.avg) and moving average (main.all.time.avg.mov) response times, allow analyzing the behavior of the upstream under various load scenarios. For example:

Conclusion and conclusions

Now it is time to return to the beginning of the article and remember that our goal was to improve the quality of the product. What information can we gather from the collected statistics? Analyzing the statistics obtained for a certain period, it is natural to ask: has it become better or worse? After all, our goal, as we indicated earlier, is to convert quantitative characteristics into qualitative ones.

Here it is appropriate to recall two simple laws:

The Weber-Fechner law, which states that the intensity of the sensation is proportional to the logarithm of the intensity of the stimulus. Or otherwise: the difference in sensations appears if the stimuli differ by a certain fraction of their magnitude, and not by an absolute value.
The absolute threshold of human perception of time is approximately 15 ms.

For periods of time lasting up to 30 s, the threshold of perception of the difference is approximately 20%. In other words, users do not notice regression or acceleration within 20% of the current value. The user can detect the difference of more than 20%, but if this difference is less than the minimum perception threshold of time, then it will pass unnoticed.

findings

The quality of the application can be reduced to a set of quantitative characteristics. In our case, the quality of upstream work can be reduced to response time and HTTP responses with different statuses.
Analysis of these characteristics is conveniently carried out by modules for nginx.
As a working tool, we use nginx-sla as the most flexible and undemanding module in comparison with analogues.
nginx-sla allows you to group statistical information and, thus, convert quantitative data into values suitable for qualitative analysis (mean, floating average, quantile).
The collected statistics are analyzed in order to exceed the limit values specified by the standard and are compared with the human time perception threshold. The conclusion is made about the need for further action.

PS The module has enough time without any complaints in production. Feedback, bugs, pull-requests are welcome.

Source: https://habr.com/ru/post/177215/

All Articles