Key Performance Indicator Analysis - Part 1

Testing and performance analysis is a topic that I would like to discuss more. We are starting to publish a translation guide from the notorious Patterns & Practices team about what key performance indicators are needed with. For the translation - thanks to Igor Shcheglovitov from Kaspersky Lab, our permanent author of the materials about testing. The rest of our articles on testing can be found on the tag mstesting

Introduction

Performance analysis is a complex discipline. It examines the system for performance requirements and determines the reasons if these requirements are not achieved. The Performance Analysis Primer article from this series contains an introduction to this topic, describing the tools and approaches used in cloud development in order to achieve good performance.

The goal of this tutorial is to learn how to examine system performance. In particular, this guide describes the key performance metrics you should pay attention to, as well as how to use these metrics to determine how well your system works.

Note : Many performance problems in large applications are due to the fact that they do not reproduce under a relatively low load, while under high (stress) load they lead to a significant slowdown or system failures. An important aspect of the development of scalable systems is to prevent these situations. To help you with this task, we have published an article containing a set of anti-patterns of cloud-based anti-patterns .
')
The concept of key performance indicators

System performance is determined by the following factors:

How many operations can the system perform in a certain period of time? (bandwidth)
How many operations can be performed simultaneously? (parallelism)
How long does the system perform the operation? (response time / delay)
How much backup power does the system need to grow? (stock of resources)
How many exceptions does the system generate under load? (error rate)

These factors may be obvious, but, of course, they will vary depending on the level of load on the system. With enough resources and low load, most systems perform operations quickly. The critical moment comes when the load increases to some limit level. How will the system handle several million parallel queries or queries that consume large amounts of available resources? The graph shows a typical functioning profile of a test cloud service, which accepts and processes user requests. The graph illustrates how throughput and average response time change with increasing user load. This schedule was obtained in the process of controlled load testing. The horizontal axis is time. The test assumes that approximately the same time and amount of resources is required to complete each request. The time intervals between requests are subject to normal distribution, which simulates the realistic behavior of virtual users.

Note : Realistic user behavior is an important feature of load testing. Without this approach, the results may be wrong. For more information, see the Scenarios article.

Performance profile based on user load for a simple cloud service

At first, throughput is low, but with increasing load it grows to a certain limit, which depends on the capabilities (capacity) of the service itself. This capacity is most likely set at the design stage or determined by the resources used by the service. It may be limited to:

The volume of network traffic between clients and service
CPU load, number of threads and available memory
If the service uses a database, then you can add the number of available connections to the database, the amount of traffic between the service and the database, etc.
The capacity of external systems on which the service depends. For example, the capacity of an Azure SQL Database is defined by the term DTUs (Database Throughput Units). If you exceed the DTUs allocated to you, then subsequent queries to the database will not be executed until you acquire additional capacity. Other services have limitations on physical resources (streams, memory, connections, etc.)

Note : make sure that the tool you are using for performance testing does not produce false results. For example, if the tool is not able to provide a sufficient number of testing agents to simulate the required load, the throughput may not be correctly determined.
However, this graph does not always show the full picture. Despite the fact that it shows the maximum throughput, the latency (delay) with a further increase in performance will increase. This means that more and more requests will be in the waiting queue. At some point, these requests may begin to fall in timeout and the corresponding results will be returned to the client. In addition, if the service uses a web server, for example, IIS, then after the external connections of the web server itself are exceeded, the latter will begin to reject new connections.

Another note : the lack of service capacity on which your system depends is called “back-end pressure” (external pressure). In many systems, such services are often factors that limit performance. These services are often external, their management can be transferred (or is already controlled) to others who may affect performance.
The figure illustrates the load profile of the same test cloud service under high load.

This graph shows that as soon as the load reaches 600 users, requests start to fall in timeout and the system starts to generate exceptions. Increasing the load leads to an increase in error generation. Please note that the frequency of successful requests will also fall. I note that the average request time also falls, because the service starts to reject requests very quickly. The important point is that the measurement of response time is not an indication of how well the system works.
Another difficulty is that the system can be sufficiently stable to recover (at least temporarily) and that the number of requests waiting in the queue will disappear, and subsequent requests will be processed successfully. The system may enter the period when the frequency fluctuations of successfully executed queries will alternate cyclically with the failure rate. The figure shows a graph illustrating this phenomenon, which reflects the Improper Instantiation anti-pattern .

Frequency fluctuations of successful and unsuccessful service calls under increasing load

In this example, the system is restored periodically, and then drops again. Please note that the frequency of exceptions continues to increase with each subsequent peak, this indicates that at some point the system will completely fail.

The main purpose of these graphs is to emphasize the relationship between factors that help determine the "well-being" of the system under load: it is either healthy, is about to collapse, or something in between. Load testing in a test environment can help you determine the acceptable capacity of your services and assess whether you need scaling to meet performance requirements.

On the graphs above, it was a simple (ideal) situation where the tested services are isolated. But in the real world, there are problems of extraneous noise. "Users behave unpredictably and perform many different operations at the same time. In this case, the system performance at an arbitrary point in time will be determined by the set of point operations performed at that time.

If users begin to receive errors or the system slows down, then this may be due not to a specific piece of logic, but to the total load on the system. All this underscores the importance of continuous performance monitoring. If you have accurate and up-to-date performance information, you can quickly respond to sudden surges in user activity that require additional resources to process requests. The performance data that you collect should include a sufficient amount of contextual information so that you can reduce various metrics into a full-fledged vision of the life cycle of the system process. This information is vital during performance analysis. It helps you understand how the various parallel processes that make up your system can coexist, interact, and compete with each other. The process of collecting such information is described in detail in the Performance Analysis Primer article.

Parsing indicators by abstraction levels:

In most systems, there are many metrics that can be collected and analyzed. Without careful consideration, you can easily get lost in the set of collected data, or miss important, forgetting to collect key metrics. The classification below helps sort the metrics according to certain levels of abstraction. By making the right distribution, you can focus only on the data that determines the performance of the appropriate level. This series of articles will describe the following levels of abstraction:

Client Metrics . These metrics are focused on measuring the performance of client applications. For example, how long the client application performs actions locally and processes the response from the server side. These metrics cover data such as used memory and CPU load. On a mobile device, high CPU utilization and frequent network use can lead to a decrease in battery life, and using too much memory can prevent the application from starting.
Business Metrics . This includes data defining business processes. They relate to the activities of end users. These metrics should include key business operations that the system performs.
Application Metrics . These metrics focus on measuring application activity and performance (application source code, framework, runtime, such as the .NET Framework, ASP.NET, CLR, etc.). The purpose of these metrics is to help you research the flow of your application with a large number of parallel queries, analyze the resources consumed and estimate the likelihood of performance problems.
System Metrics . This is low-level performance data (base infrastructure level). They usually focus on key performance indicators related to memory, network utilization, disk activity, processor utilization
Service Metrics . This includes data related to the performance of external services, such as Azure Storage, messaging infrastructure, external cache, database and other external services that your application can use. These data do not reflect the net performance of external services, but merely contain information about the execution of requests that your system sends to them.

Thanks for attention. In the next part, we will look at these metrics in more detail and move on to the following topics.

Parts of the article:
Key Performance Indicator Analysis - Part 1
Analysis of key performance indicators - Part 2, analysis of user, business and application metrics
Analysis of key performance indicators - part 3, the last, about system and service metrics.

Source: https://habr.com/ru/post/271547/

All Articles

Key Performance Indicator Analysis - Part 1

More articles: