Slow Cooker: load testing network services

Linkerd , our service mesh for cloud-based applications, has a duty to cope with large volumes of network traffic for a long time. Before the release of the next release, compliance with this requirement must be carefully checked. In this article, we describe the load testing strategies and the tools we use, and also consider a few of the problems found. As a result, slow_cooker will be introduced - an open source load testing tool written on Go, which was created to perform long-term load tests and identify lifecycle issue identification.

Linkerd acts as a transparent proxy. It adds connection pooling, failover, retry, load balancing with delays and much more to service-specific requests. To be a viable and commercially viable system, linkerd must be able to cope with a very large number of requests over long periods of time in a changing environment. Fortunately, linkerd is based on netty and Finagle . Among all network programs, their code is one of the most widely tested and tested in the course of industrial operation. But code is one thing, but real performance is another.

In order to evaluate the behavior of the system under industrial operation conditions, linkerd must be subjected to the most thorough and comprehensive load testing. Moreover, since linkerd is part of the underlying infrastructure, its instances are rarely stopped or restarted, and each of them can pass through itself billions of requests in the face of the changing behavior of services and their clients. This means that we also need to test lifecycle issues. For network servers with high bandwidth, such as linkerd , life cycle problems include memory and socket leaks, bad GC pauses, as well as network and disk subsystem overload. Such things happen infrequently, but if you do not learn to work them out properly, the consequences can be disastrous.

Who tests software for testing?

At the initial stage of linkerd development , we used the popular load testing tools ApacheBench and hey . (Of course, they only work with HTTP, and linkerd proxies various protocols, including Thrift, gRPC and Mux — but. But we had to start somewhere.)

Unfortunately, we quickly realized that, despite the undoubted usefulness of these tools for quickly obtaining performance data, they are not very good at identifying life-cycle problems that we wanted to learn to identify. These tools provide a summary of the fact that the test was completed, and with this approach, problems can be overlooked. Moreover, they rely on averages and standard deviations, which is not, in our opinion, the best way to assess system performance.

To identify life cycle problems, we needed better metrics and the ability to see how linkerd behaves during long tests, running hours and days, not minutes.

To get a gentle code we prepare slowly

Since we could not find a suitable tool, we had to make our own: slow_cooker . slow_cooker is a load testing program designed specifically to perform long load tests and identify life cycle problems. We use slow_cooker extensively to look for performance issues and test changes in our products. In slow_cooker there are step-by-step reports (incremental reports) on the progress of the testing process, change detection (change detection) and all the necessary metrics.

So that other people can use slow_cooker and participate in the development, today we open its source code. See slow_cooker source on GitHub and the recently released release 1.0 .

Let's talk about the features that slow_cooker provides.

(For simplicity, we will test it directly on the web services themselves. In practice, of course, we use slow_cooker primarily to find problems with linkerd , and not with the services it serves.)

Step by Step Network Delay Report

Since slow_cooker is primarily aimed at identifying life cycle problems that occur over long time periods, it contains the idea of step-by-step reports. Too much can be missed if we analyze the averaged values of a very large amount of input data, especially when it comes to such temporary phenomena as the work of a garbage collector or the saturation of a network. With step-by-step reports, we can see changes in throughput and delays directly on a running system.

The example shows the output of slow_cooker obtained during linkerd load testing. In our test scenario, linkerd balances the load between three nginx servers, each of which distributes static content. Delays are given in milliseconds, and we derive min , p50 , p95 , p99 , p999, and max- delays, fixed at ten-second intervals.

$ ./slow_cooker_linux_amd64 -url http://target:4140 -qps 50 -concurrency 10 http://perf-target-2:8080 # sending 500 req/s with concurrency=10 to http://perf-target-2:8080 ... # good/b/ft good% min [p50 p95 p99 p999] max change 2016-10-12T20:34:20Z 4990/0/0 5000 99% 10s 0 [ 1 3 4 9 ] 9 2016-10-12T20:34:30Z 5020/0/0 5000 100% 10s 0 [ 1 3 6 11 ] 11 2016-10-12T20:34:40Z 5020/0/0 5000 100% 10s 0 [ 1 3 7 10 ] 10 2016-10-12T20:34:50Z 5020/0/0 5000 100% 10s 0 [ 1 3 5 8 ] 8 2016-10-12T20:35:00Z 5020/0/0 5000 100% 10s 0 [ 1 3 5 9 ] 9 2016-10-12T20:35:11Z 5020/0/0 5000 100% 10s 0 [ 1 3 5 11 ] 11 2016-10-12T20:35:21Z 5020/0/0 5000 100% 10s 0 [ 1 3 5 9 ] 9 2016-10-12T20:36:11Z 5020/0/0 5000 100% 10s 0 [ 1 3 5 9 ] 9 2016-10-12T20:36:21Z 5020/0/0 5000 100% 10s 0 [ 1 3 6 9 ] 9 2016-10-12T20:35:31Z 5019/0/0 5000 100% 10s 0 [ 1 3 5 9 ] 9 2016-10-12T20:35:41Z 5020/0/0 5000 100% 10s 0 [ 1 3 6 10 ] 10 2016-10-12T20:35:51Z 5020/0/0 5000 100% 10s 0 [ 1 3 5 9 ] 9 2016-10-12T20:36:01Z 5020/0/0 5000 100% 10s 0 [ 1 3 5 10 ] 10

This report shows bandwidth in the good% column: how close we got to the required number of requests per second (RPS, requests per second).

This report looks good: the system is fast and the response time is stable. At the same time, we should be able to clearly see where and when the trouble started. The slow_cooker output was configured to facilitate visual search for problems and outliers using vertical alignment, as well as an indicator of the change that occurred. Let's look at an example where we got a very slow server:

 $ ./slow_cooker_linux_amd64 -totalRequests 100000 -qps 5 -concurrency 100 http://perf-target-1:8080 # sending 500 req/s with concurrency=10 to http://perf-target-2:8080 ... # good/b/ft good% min [p50 p95 p99 p999] max change 2016-11-14T20:58:13Z 4900/0/0 5000 98% 10s 0 [ 1 2 6 8 ] 8 + 2016-11-14T20:58:23Z 5026/0/0 5000 100% 10s 0 [ 1 2 3 4 ] 4 2016-11-14T20:58:33Z 5017/0/0 5000 100% 10s 0 [ 1 2 3 4 ] 4 2016-11-14T20:58:43Z 1709/0/0 5000 34% 10s 0 [ 1 6987 6987 6987 ] 6985 +++ 2016-11-14T20:58:53Z 5020/0/0 5000 100% 10s 0 [ 1 2 2 3 ] 3 -- 2016-11-14T20:59:03Z 5018/0/0 5000 100% 10s 0 [ 1 2 2 3 ] 3 -- 2016-11-14T20:59:13Z 5010/0/0 5000 100% 10s 0 [ 1 2 2 3 ] 3 -- 2016-11-14T20:59:23Z 4985/0/0 5000 99% 10s 0 [ 1 2 2 3 ] 3 -- 2016-11-14T20:59:33Z 5015/0/0 5000 100% 10s 0 [ 1 2 3 4 ] 4 -- 2016-11-14T20:59:43Z 5000/0/0 5000 100% 10s 0 [ 1 2 3 5 ] 5 2016-11-14T20:59:53Z 5000/0/0 5000 100% 10s 0 [ 1 2 2 3 ] 3 FROM TO #REQUESTS 0 2 49159 2 8 4433 8 32 8 32 64 0 64 128 0 128 256 0 256 512 0 512 1024 0 1024 4096 0 4096 16384 100

As you can see, the system works quickly and responsively, except for one hitch in 2016-11-14T20: 58: 43Z, during which the throughput fell to 34%, and then returned to normal. As the owner of this service, you probably want to see the logs or performance indicators to find out the cause of the incident.

Example of a life cycle problem: GC-pause

In order to demonstrate the advantages of step-by-step reports in comparison with the usual reports that only show summary data, let's simulate a situation in which the garbage collector is running on the server. In this example, we will directly test a single nginx process that distributes static content. To simulate the delays caused by the garbage collector, we will pause and resume nginx in a cycle at five-second intervals (using kill -STOP $PID and kill -CONT $pid ).

For comparison, let's start with a report from ApacheBench :

 $ ab -n 100000 -c 10 http://perf-target-1:8080/ This is ApacheBench, Version 2.3 <$Revision: 1604373 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking perf-target-1 (be patient) Completed 10000 requests Completed 20000 requests Completed 30000 requests Completed 40000 requests Completed 50000 requests Completed 60000 requests Completed 70000 requests Completed 80000 requests Completed 90000 requests Completed 100000 requests Finished 100000 requests Server Software: nginx/1.9.12 Server Hostname: perf-target-1 Server Port: 8080 Document Path: / Document Length: 612 bytes Concurrency Level: 10 Time taken for tests: 15.776 seconds Complete requests: 100000 Failed requests: 0 Total transferred: 84500000 bytes HTML transferred: 61200000 bytes Requests per second: 6338.89 [#/sec] (mean) Time per request: 1.578 [ms] (mean) Time per request: 0.158 [ms] (mean, across all concurrent requests) Transfer rate: 5230.83 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.2 0 3 Processing: 0 1 64.3 0 5003 Waiting: 0 1 64.3 0 5003 Total: 0 2 64.3 1 5003 Percentage of the requests served within a certain time (ms) 50% 1 66% 1 75% 1 80% 1 90% 1 95% 1 98% 1 99% 2 100% 5003 (longest request)

Here we see a delay of 1.5 ms, with several emissions with more delays. Such a report is easy enough to mistakenly consider normal, even though the service being tested did not respond to requests for exactly half the time spent on the test. If the target SLA value is 1 second, then the service has exceeded it for more than half of the run, but from the report this may not be noticed!

With slow_cooker step-by-step reports , we see that there is a constantly emerging problem with bandwidth. It is also much more obvious that P99.9 has consistently high values throughout the test:

 $ ./slow_cooker_linux_amd64 -totalRequests 20000 -qps 50 -concurrency 10 http://perf-target-2:8080 # sending 500 req/s with concurrency=10 to http://perf-target-2:8080 ... # good/b/ft good% min [p50 p95 p99 p999] max change 2016-12-07T19:05:37Z 2510/0/0 5000 50% 10s 0 [ 0 0 2 4995 ] 4994 + 2016-12-07T19:05:47Z 2520/0/0 5000 50% 10s 0 [ 0 0 1 4999 ] 4997 + 2016-12-07T19:05:57Z 2519/0/0 5000 50% 10s 0 [ 0 0 1 5003 ] 5000 + 2016-12-07T19:06:07Z 2521/0/0 5000 50% 10s 0 [ 0 0 1 4983 ] 4983 + 2016-12-07T19:06:17Z 2520/0/0 5000 50% 10s 0 [ 0 0 1 4987 ] 4986 2016-12-07T19:06:27Z 2520/0/0 5000 50% 10s 0 [ 0 0 1 4991 ] 4988 2016-12-07T19:06:37Z 2520/0/0 5000 50% 10s 0 [ 0 0 1 4995 ] 4992 2016-12-07T19:06:47Z 2520/0/0 5000 50% 10s 0 [ 0 0 2 4995 ] 4994 FROM TO #REQUESTS 0 2 19996 2 8 74 8 32 0 32 64 0 64 128 0 128 256 0 256 512 0 512 1024 0 1024 4096 0 4096 16384 80

Percentile-based latency reports

As can be seen from the example of ApacheBench, some load testing tools display only the mean value and the standard deviation. However, these metrics are usually irrelevant in estimating delays that do not follow the law of normal distribution and often have very long tails.

In slow_cooker, we do not use the mean and standard deviation, but instead display the minimum, maximum, and several percentiles of high order (P50, P95, P99, and P99.9). This approach is increasingly used in modern software, where one request can generate dozens or even hundreds of calls to other systems and services. In such situations, metrics like the 95th and 99th percentiles provide the dominant delay value.

Conclusion

Although nowadays writing a load testing tool is not too difficult (especially when using modern programming languages that have built-in support for parallelism and are focused on networking, such as Go, for example), the implementation of the measurement system and report structure can significantly affect on the utility of such a program.

We are currently using slow_cooker extensively to test linkerd and other services (for example, nginx). Linkerd is tested in 24x7 mode in terms of interaction with various services, and slow_cooker helped us not only prevent code deployment with serious errors, but also find performance problems in existing releases . Buoyant's use of slow_cooker has become so ubiquitous that we began to call load testing programs "sounding."

You can start working with slow_cooker by visiting the releases page on Github . Download the tool and start testing your favorite server to see if it has performance problems. When testing linkerd slow_cooker helped us a lot, and we hope that you will find it equally useful.

Materials for further reading (in English)

Source: https://habr.com/ru/post/318158/

All Articles