
This article is written based on the report of the same name on
Highload ++ '2012 . It is intended for managers who, looking at our testing, can compare it with testing in their project, for programmers and system administrators who will have the opportunity to look at testing as a very interesting job, and, of course, for testers.
In the article I will tell you about how testing can actually be, how we did testing with productive and interesting work, what tasks we solve, and why we work well.

First of all, let's see what testing is. Testing is a software development process that provides quality information. Everyone knows that testing improves product quality. However, not everyone is aware that this is due to the provision of information.
Testing: boring tasks and interesting tasks
What you need to do to test the first version of the program?

First, we will need to configure the configuration (install and configure the operating system, all programs and libraries that are required by our program). Then we will need to install and configure the program itself. After all this, we will launch the program and for each of the requirements placed on it by the customer, we will transfer the previously prepared input data to the program. This can be done through the user interface, input files, a database, over a network or otherwise, depending on the function being tested. Next, we will wait for the processing to finish, and then we will get some output data and compare them with the standard. This is a typical general description of testing. With manual testing, all these actions are performed by engineers manually.
')

As the next versions of the program are released and new features are added, we will need to check more and more functions each time. That is, roughly speaking, labor costs will increase linearly with the number of functions. And at some point, the manager will be faced with a choice: either start partially skipping checks on the old functionality (regression testing), or constantly increase staff.
The situation may be particularly difficult if the program needs to be tested in several different configurations. In this case, the same actions will have to be repeated for each configuration.

It is important to note that only testing new functions will be an interesting task, and repeated testing of old functions (regression testing) is a monotonous repetitive work, a repetition of the operations that have been done more than once. Therefore, even with the constant expansion of staff, employees will be demotivated, which will negatively affect attentiveness and responsibility, and in the end will lead to missed errors.
Also, sometimes a previously prepared configuration can be accidentally lost (for example, due to a breakdown of the test server). Then it is necessary to carry out the configuration configuration step anew, which is also a repetition of the previous work and may take a long time. It is especially difficult if configuration actions have not been recorded anywhere.
With such an organization of work, testing becomes ineffective, and the work in it is truly sad and uninteresting.
How do we solve this problem?
Getting rid of boring work
Configuration setting
To organize test benches, we use
Xen virtual machines - thus, parallel testing works do not interfere with each other, and we are able to quickly create new booths. We could do for each test configuration on the reference disk image of the virtual machine, but this approach has its drawbacks. Firstly, changing such images requires manipulations on mounting them to some virtual machines, which is inconvenient. Secondly, to create reference images for each stand requires a large disk volume. Thirdly, versioning with this approach can be organized only through storing the entire version of the image for each stand.
To solve all these problems, we use the automatic configuration system
Opscode Chef .

Chef stores “roles” in its database, i.e. a set of configuration scripts, and these scripts themselves (“recipes”). For each test bench, we have a “role” and the “recipes” it needs. When creating a virtual machine, we assign a “role” to it, and when it is first turned on, the machine is registered on the Chef server, receives “recipes” from it and executes them. Thus, we can store only one reference image for each operating system required in testing. In addition, thanks to the storage of "recipes" and "roles" in git, we get versioning. "Recipes" are programs in Ruby, which makes it possible to always clearly understand exactly what actions are performed to set up the stand. Plus, Chef gives you the opportunity to use ready-made "recipes" written by the community.
Fast booth creation

In order to quickly create new virtual machines, assign roles to them, as well as enable, disable and delete them, we set up
ConVirt Open Source - a web interface for managing virtual machines. We chose it because it is a free open source solution in Python, which gives us the opportunity to make corrections and add new functionality to it. For example, we added the ability to prescribe the “role” of a machine when it was created.
Thus, the creation of the machine is made using just one web form. In this form, you need to select the reference disk image, depending on what operating system is needed for this machine, assign a unique name and “role” and click “OK”.
So, test bench management is automated, and you can proceed to automate the testing itself.
Test Automation

Testing automation is necessary not only because it is important to save resources and get rid of boring and monotonous work. Another reason is that most testing programs do not have a user interface; this makes manual testing impossible and forces them to write programs to interact with them.
How do we automate tests? We write fully automated tests (that is, tests that automatically prepare the program for the beginning of the test and generate input data, perform a test action and automatically compare the result with a reference one) based on the
UnitTest , a standard unit-testing framework for Python. It is important to note that based on this framework, we are not writing unit tests, but functional tests. Choosing UnitTest gives us the opportunity to use the full power of Python and its libraries in tests. We also have the opportunity to use Python-libraries developed by our programmers. And the Python extensions written in C allow us to use the libraries of our C-programmers.
Auto run tests
After automating tests, it remains to automate their launch as well. We use the
Jenkins continuous integration system for this. When a new commit appears in the git repository of the program, or when a new distribution kit appears in the Jenkins rpm package repository, it creates a new virtual machine with the desired “role”. Next, Jenkins waits for its settings, compiles the program (if changed in git) or downloads the necessary rpm-packages, installs the program and tests, runs tests and publishes the results in the web interface, sending notifications to the mail. After running the tests, the virtual machine is deleted. To speed up, we created a “virtual machine pool” in which there are pre-created machines for each program under test; as a result, Jenkins takes the already configured machine out of the pool.

Thus, we have automated configuration configuration, test execution, analysis of results and publication of reports. Now all the boring and uninteresting work is done by robots, and we have time for really interesting tasks.
Examples of interesting tasks
Parallel Selenium
One of the interesting tasks that we solved is not a testing task, so the story about it is rather a lyrical digression. There is such a tool for automating web-interface testing -
Selenium . It allows automated tests to open browser windows, load test pages into them, enter text into forms, click on page elements, perform other user actions and perform necessary checks. We also use this tool, although testing web interfaces is not the largest part of our work. Our tests of web interfaces work through Selenium 2 (webdriver) configured remotely. We have a Selenium-hub server that accepts connections from tests, and several Selenium-nodes, on which the browsers are installed, and on which all actions with web pages are actually performed.
It was important for us that we could run different tests in parallel, and that these parallel tests did not interfere with each other. Unfortunately, however, Selenium in the official delivery does not always allow it. Parallel running of several tests in IE or Opera browser on one node is especially bad.
We solved the problems of parallel work of tests on one node, having made corrections in Java and C ++ code of Selenium itself. We added locks to perform single-threaded actions, switching the focus of the window before those actions that need it. We also fixed the multithreading of upload files in IE, and added this function for Opera. At the time of this writing, all of these fixes work with Selenium version 2.26.
Anticipating a possible question, I want to say that we really want our corrections to become part of the official Selenium. We posted our patches on github (for example,
https://github.com/wladich/operadriver ) and sent them to the developers. However, for various reasons, none of the patches has yet fully become part of Selenium, although we see part of our lines in the code of the latest versions of Selenium. The freshest portion of our fixes has not yet been opened, and we will be happy if Selenium developers have an interest in it.
Emergency Triggers

Sometimes in automated tests it may be necessary to check the program's behavior in unexpected situations (for example, in the case when the program could not successfully write data to a file). How to create such a situation for verification in a test?
We solved this problem by modifying the
ltrace tool.
This tool allows you to track software calls to library and system functions, as well as program reception of signals. By slightly modifying the source code of ltrace, you can teach it to replace the returned values after calling the library function. For example, if there is an error writing to a file, the write function returns
-1 . In the process of use, it turned out that, as a rule, the replacement of the result is not necessary in all cases, but only in one definite one, which can be identified by the values of the call parameters of the library function. To make this possible, we added to ltrace the ability to do smart substitution of the result of a function from the Python language.
Time Machine

Sometimes you need to check that the program performs some actions at a certain time. For example, the program resets data to the database at 3 am. However, this function should be checked for each version of the program, regardless of the current time. Immediately comes to mind the option to translate the system time on a test bench, but this method will affect the entire system, including the tests themselves, and not just the program under test, which leads to difficult debugging problems.
Since we have learned how to substitute the result of library function calls using ltrace, why not add functionality that replaces the results of time function calls? As a result, we added functionality in ltrace to replace the returned values of the functions time, gettimeofdate, clock_gettime. Since for normal operation of the program under test, it is necessary for time to go forward, we have implemented this feature: time goes forward relative to the initial moment specified in the ltrace parameters.
Ltrace with functions to substitute the result of calls and the time machine is available on github at
https://github.com/zenovich/ltrace .
Check Bloom filter
One of the interesting problems that we solved was to check the
Bloom filter .
The Bloom filter is a probabilistic data structure that allows compactly storing the set of elements and verifying that a given element belongs to this set. At the same time, it is possible to get a false positive response (there is no element in the set, but the data structure reports that it exists), but not a false negative. The Bloom filter can use any amount of memory predetermined by the user, and the larger it is, the less chance of a false positive.
Usually, the Bloom filter is used to reduce the number of requests for non-existent data in a data structure with more expensive access (for example, located on a hard disk or in a network database), that is, to “filter” requests to it.

The structure is a bitmap of
m bits. Initially, when the data structure stores an empty set, all
m bits are set to zero. The user must define
k independent hash functions
h 1 , ..., h k , mapping each element to one of the
m positions of the bitmap in a fairly uniform way.
To add the element
e, it is necessary to write down the units for each of the positions
h 1 ( e ), ..., h k ( e ) of the bit array.
To check whether the
e element belongs to the set of stored elements, it is necessary to check the status of the bits
h 1 ( e ), ..., h k ( e ) . If at least one of them is zero, the element cannot belong to the set. If they are all equal to one, then the data structure reports that
e belongs to the set. In this case, two situations can arise: either the element really belongs to the set, or all of these bits were set randomly when adding other elements, which is the source of false positives in this data structure.
The probability of false positives decreases with increasing
m (the size of the bitmap), and increases with increasing
n (the number of inserted elements). For fixed
m and
n, the optimal number
k (the number of hash functions) that minimizes this probability is (assuming that the set of hash functions is chosen randomly, and for any element
x each hash function
h i assigns it one of the places in the bit array with equal probability, and the values of
h i (
x ) are independent random variables in the aggregate):

,
In this case, the very probability of a false positive is equal to

.
In reality, the hash functions are chosen by programmers, so the probability of a false positive response can be very different from the theoretical one. Therefore, bloom filters need to be tested.
To verify that the Bloom filter complies with the requirements, we fill it with a large number of test data (hundreds of millions of elements) that have the same distribution as production data, and compare the percentage of collisions (false positives) with the percentage specified in the requirements.
Random check

Another interesting task is to check the random issue. Suppose we need to check a program that, in response to each request, returns one of the predefined elements, and for each element the probability with which it should be given is given. This is a classic task of mathematical statistics - testing the statistical hypothesis.
To test such a functional, we use the Pearson criterion or the
χ 2 criterion . We make
N requests, and for each element of the set we count how many times it was returned (
O i ). At the same time, we know how many times each element should have been returned when the hypothesis is perfectly fulfilled (
E i ). From this data, we calculate the value

.
When the hypothesis is fulfilled, this value is random (if the program issues elements randomly) and must obey
the χ -square distribution . Thus, for a given level of significance
α, our value
χ 2 must be greater than the quantile

where
k is the number of elements of the set. If this is not the case, then the issue is not accidental. On the other hand, our
χ 2 value must be less than a quantile.

otherwise the hypothesis is not satisfied.
It is widely believed that for such a test, you need to make a lot of requests to the program. In fact, it is not. Minimum number of
N queries to apply the criterion

.
That is, for example, if the most "rare" element of a set is theoretically issued with a probability of 10%, we need to make only 50 requests.

It's funny that this method of checking random number generators is recommended in the book by Donald Knuth, which programmers often use as a monitor stand.
Consider a simple example. Suppose we need to test a program that selects one of 4000 elements with equal probability:

.
For verification, we need to make
N = 5 * 4000 = 20,000 requests to the program. As requests are executed, we save for each element the number of hits in the counted array. This test can be made very fast if implemented in a multi-process.
For Python, there is an excellent
SciPy library that
makes it very easy to calculate the
χ 2 value and the
P-value (that is, the probability that a random variable with the
χ 2 distribution takes a value no less than the actual value) for it. If the hypothesis assumes a uniform distribution of values, as in our simple case, then
χ 2 and the P-value are calculated using this line of Python code:
chi_square, p_value = scipy.stats.chisquare(counted)
It remains only to verify that p_value lies in the range from 0.05 to 0.95 (for a significance level of 5%). It's funny that when we wrote this test, the P-value turned out to be orders of magnitude less than 0.05. Moreover, the rejection of multiprocessing led to the correct result. It turned out that in the program under test, which also works in several processes, in each process the random number generator was initialized with the same number. After fixing the program, the multiprocess test began to pass successfully.
Conclusion
In the course of our work, we automated the creation of test configurations, the performance of regression tests, the analysis of test results and the publication of reports. Since now all this boring and monotonous work is performed by robots, we can devote all the time to interesting tasks - creating new tests, automating them, and developing new tools.

So, as you can see, getting rid of boring and monotonous work makes testing an interesting and exciting job, in which there is room for programming, and hacking, and math.
Author: Dmitry Zenovich, Head of Testing for Mail.Ru Group.