Testing distributed systems - an interview with Andrei Satarin, Yandex

Testing of distributed systems differs significantly from testing centralized. Few testers can boast of serious knowledge and experience in this area.

I talked to the speaker of the Heisenbug 2016 Moscow conference, Andrei Satarin ( twitter.com/asatarin ). Andrei participated in testing projects at Mail.ru, at Kaspersky Lab, at Deutsche Bank, and now he is testing distributed systems in Yandex. The article will be useful not only to people who are engaged in testing, but also to developers. If you have never touched the question of testing distributed systems, welcome under the hood.

Andrey Satarin:
')
... they kill the nodes right during business hours and the developers are watching ...

Methods and features of testing distributed systems

- What methods and strategies for testing distributed systems exist? What is the difference?

- In addition to the well-known classical approaches (unit testing, system testing, integration testing) for distributed systems there are additional approaches that are designed to detect complex defects.

The approach of fault injection is very popular. When the system works, we add failures with the help of special programs and mechanisms: failures of disks or entire machines, maybe network failures, failures of internal components of the system under test. Since the vast majority of distributed systems must be resistant to this kind of failure, at least in some limited scale, the system should not stop its work or show any anomalies in the work. In essence, this is testing for fault tolerance, since this is one of the most important non-functional requirements for distributed systems. The more machines that work, the higher the likelihood of an individual problem for one of them. For example, if a thousand cars are involved, then, conventionally speaking, the disks will fly out once a week. The system is obliged to survive such situations in no way noticing them.

There are more academic approaches, for example, formal verification (formal verification). In distributed systems, there are internal algorithms and protocols that allow it to work. They themselves are quite complex, but guarantee some invariants that must always be achieved, regardless of any system failures, reordering of packets in the network and anything else. The essence of the approach lies in the fact that, based only on the description of the algorithm in a special language, its correctness is verified. This gives confidence that the algorithm that is used, provided that it is correctly implemented, will work.

In 2015, an academic article from Microsoft Research " Proving Practical Distributed Systems Correct " was published, where they described the model of the distributed storage system, then using special tools, checked this model for correctness, and then generated the code, which immediately worked.

- What features need to be considered when testing distributed systems?

- The peculiarity is that it is important to understand which particular invariants are guaranteed by the system under test. For example, now popular nosql databases, which may be more high-performance, but they do not support transactions. That is, their consistency level is lower than that of classical (MySql, PostgreSQL, Oracle). And, when testing of such a distributed system, such as a nosql database, occurs, it is important that there is an understanding of which invariants it supports. From this depend on the anomalies that will be observed in the tests. In complex tests, for example, when there are several competitive writers and readers, you can see many different states. In other words, you need to understand what effects can be observed in the system, and what - not.

Non-functional requirements play the most significant role.

- What are the typical mistakes people make when testing distributed systems?

- The most common mistake is not to check all the guarantees that the system should provide, in which case the system becomes untested. The second error, which can be costly, is not testing for failure of some part of the system. By experience, if in a distributed system some subsystem was not tested for fault injection, then there are a little more than a lot of bugs there.

- What metrics and characteristics of a distributed system is important to test and why?

- From the non-functional requirements, firstly, fault tolerance (fault tolerance), and secondly, this performance (perfomance). For distributed systems, non-functional requirements play the most significant role compared to functional requirements. Fault tolerance comes first, because first the system should work, and if it does not work, the rest is not so important.

- How important is test performance? Is it necessary to take into account possible network delays when developing tests for a distributed system?

- It depends on the types of tests in question. If these are unit tests, then performance is important. In the general case, of course, it is better to have quick tests (as they say, it is better to be healthy and rich). This is true for functional tests. For non-functional tests that check, for example, consistency or resistance to failure, the performance of tests is important for the more frequent manifestation of defects. For example, if a defect manifests itself once per million operations, the more often these operations occur, the more often the defect manifests itself. If it takes an hour, then this is perfectly acceptable. If it takes several days, then the search for such defects becomes a problem.

98% of all defects can be reproduced only on 3 nodes

- Is it necessary to create special clusters for testing, or can we use “combat” clusters that are in production? How to determine the optimal size of the test cluster?

- Most often it is used test cluster. If we talk about testing on combat servers, the most widely known example is the company Netflix, which actively promotes its approach, called the " simian army ", that is, the army of macaques. It lies in the fact that they do fault injection in production. They kill the nodes right in working hours and the developers are watching that the system does not degrade. But here we must understand that such an opportunity appears only from a certain scale. If the system works on 10-20 nodes, then testing in this way means that there will be a degradation of 5-10%. In production, not everyone is ready for such sacrifices. In addition, there may be some kind of service level agreement (SLA) and such testing may be expensive due to its violation. In any case, even if there is a practice of testing in production, there is a huge test infrastructure that catches most of the defects. The advantage of testing in production is that there is no need to repeat the productive environment.

Regarding the size of the test cluster. If the system is distributed, then it must be greater than one - this is a restriction from below. On the topic of limitations above, there is an article “ Simple Testing Can Prevent Most Critical Failures, ” which examines the question of what errors exist in distributed systems. According to the article, the researchers came to the conclusion that 98% of all defects can be reproduced on just 3 nodes. Specifically, in our work we use more, usually the test cluster consists of 8 nodes, but this is due to the internal structure of our system.

- How to deal with complete or partial failures of a distributed system during testing?

- Probably, there are no special ways to deal with this, because in the test environment the scales are much smaller. If bad iron strongly interferes, it can be simply excluded from the test environment. We had a case when a failure occurred in the test gland, but we were rather glad about this, since it allowed us to find some unusual defects. Since the distributed system must be fault tolerant, even in tests it should not cause any problems.

- What specific technologies and tools are used to create a test environment? To automate testing?

- The test environment depends on the technologies that are used for development, and on technologies that are familiar to the team. We, for example, actively use Python, because it is well suited for such tasks and our testers know it. It is simple in terms of writing tests, high enough so that you can write on it is understandable. In my opinion, he has a little "trouble" with concurrency, but this problem is solved. The system itself is developed in C ++, but it is rather difficult to use it for high-level tests, since it will not work out quickly and easily to develop it, and in tests it is important that the speed of development is important.

Regarding test automation. A test repository is usually built, which automatically runs on a dedicated server. We use TeamCity and some of our internal developments for this.

- Do you have anything else you would like to add on the topic?

- I would like to add that on testing of distributed systems there is a huge amount of materials, both academic and those close to the industry, a huge variety of approaches and methods of testing. The search for methods and their improvement does not stop for a day. This topic is constantly evolving - and this is precisely what makes it interesting.

You can listen to more reports on testing on December 10 at the Radisson Slavyanskaya Hotel at the Heisenbag conference. Registration is still open.

Topics for presentations:

Source: https://habr.com/ru/post/313908/

All Articles

Testing distributed systems - an interview with Andrei Satarin, Yandex

Methods and features of testing distributed systems

Non-functional requirements play the most significant role.

98% of all defects can be reproduced only on 3 nodes

More articles: