📜 ⬆️ ⬇️

We improve testing by using real traffic

TL; DR The closer to reality your test data, the better. Try Gor - automatic redirection of production traffic to the test site in real time.

Here at Granify we process a huge amount of user-generated data, our business is built on this. We need to make sure that data is collected and processed correctly.

You can not even imagine how strange the data may come from users. The source can be proxy servers, browsers about which you have never heard of, errors on the client side, and so on.

No matter how many tests and fixtures you have, they simply cannot cover all cases. Traffic from production will always be different from expected.

Moreover, we can just break everything when updating the version, even if all the tests have passed. In my practice, this happens all the time.

There is a whole class of errors that are very difficult to find automated and manual testing: concurrency, errors in setting up servers, errors that occur when calling commands only in a certain order, and much more.
')
But we can do several things to simplify the search for such bugs and improve system stability:

Always test for staging


The presence of a staging environment is mandatory, and it must be identical to production. Using tools such as Puppet or Chef makes it much easier.

You should require that developers always manually test their code on staging. This helps to find the most obvious mistakes, but it is still very far from what can happen in production traffic.

We test on real data


There are several techniques for testing your code on real data (I recommend using both):

1. Update only 1 of the production servers, so some of your users will be processed with new code. This technique has several drawbacks: some of your users may see errors, and you may have to use sticky sessions. This is pretty similar to A / B testing.

2. Reproduction of production traffic (log replay)

Ilya Grigorik wrote a wonderful article about load testing using the log replay technique.

All articles that I read on this topic mention log replay as a means for load testing using real data. I want to show how to use this technique for daily testing and finding errors.

Programs such as jMeter, httperf or Tsung have support for log replay, but it is either in its infancy or focused on load testing and not emulation of real users. Feel the difference? The real user is not only a set of requests, the correct order and time between requests, various HTTP headers and so on are very important. For load testing, this is sometimes not important, but it is critical for finding errors. In addition, these tools are difficult to configure and automate.

Developers are very lazy. If you want your developers to use any program / service, it should be as automated as possible, and even better so that it works so no one notices.


Reproduce production traffic in automatic mode.



I wrote a simple program Gor

Gor - allows automatic reproduction of production traffic at staging in real time, 24 hours a day, with minimal costs. Thus, your staging environment always gets a portion of real traffic.

Gor consists of 2 parts: Listener and Replay server. Listener is installed on the production web servers, and duplicates all traffic to the Replay server on a separate machine, which already sends it to the correct address. The principle of operation is shown in the diagram below:
image

Gor supports limiting the number of requests. This is a very important setting, as staging correctly uses fewer machines than production, and you can set the maximum number of requests per second that your staging environment can withstand.

You can find detailed documentation on the project page .

Since Gor is written in Go, we can simply use the already compiled distribution for download.

In Granify , we have been using Gor in production for some time, and are very pleased with the results.

Have a good test!

Source: https://habr.com/ru/post/182202/


All Articles