Andrei Satarin, Yandex: “The biggest mistake is the lack of understanding of the system”

When testing distributed systems, nonfunctional requirements come out on top, and special methods have to be applied to detect complex defects. We already talked about them with Andrei Satarin in a previous interview and today we will try to develop this topic.

Andrey Satarin is engaged in testing distributed systems in Yandex. He took part in completely different projects: he tested the game at Mail.ru, the cloud detection system at Kaspersky Lab, and the system for calculating currency prices at Deutsche Bank.

- Fault tolerance is one of the most important non-functional requirements for distributed systems. How is fault tolerance testing conducted?
')
Andrei Satarin: Crashes can be emulated in a test environment, this is how the famous Jepsen tool, created by Kyle Kingsbury, works. The second approach involves the introduction of failures in the productive environment and is usually associated with Netflix's Chaos Monkey, from which the whole movement has grown - chaos-engineering . It saves us from the problems with the repetition of the food environment and gives a high confidence in the performance of the system, but is more dangerous and requires a certain maturity of the product.

There is a third approach that allows testing the functionality of algorithms even before writing code using special tools, such as TLA + , for example. The two most famous examples of its use are the development of Amazon Web Services and Azure Cosmos DB .

- Do you think it is better to use a test cluster or a working system? What are the advantages and disadvantages of each approach?

Andrei Satarin: All approaches are permissible, but in the test cluster we can create anything and create situations more dangerous for the system, and we should not make many failures in the working environment, because there is a high risk of serious consequences and there are quite tough SLAs that are highly undesirable to break . Here we need to re-allocate, for example, by hardware, or by other cluster resources. On the other hand, testing in production adds confidence in the performance of the configuration we use: in many cases, the working environment is extremely difficult to emulate not only by hardware, it’s not always possible to achieve similar behavior of your system in a test cluster.

The tests typically use synthetic loads, but your system can work with others that have a fairly complex behavior. It is difficult to emulate this interaction, and in production you can achieve greater coverage in terms of the breadth of using the functionality. But, I repeat, this approach is more risky and, in order to use it, you need a mature product. Developed response systems are needed, i.e., it is impossible to introduce disruptions in production, for example, at 3 o'clock in the morning - this is hard enough for people. This is usually done during business hours when the entire team is available and even complex problems can be resolved relatively quickly.

- What are the most common mistakes when testing the fault tolerance of distributed systems?

Andrei Satarin: In my opinion, the most important mistake is the lack of understanding of the system. It is necessary to clearly imagine what the system does, what restrictions it has and how we test them. If we, for example, have uptime requirements, how to check them in case of failures?

Even a minor problem that does not result in data loss can disrupt our SLA. You need to understand what failures will occur in reality, because they can potentially be implemented by a great many, but your system may not be ready for certain classes of failure by design - you should separate one from the other and not try to test the system in situations in which it is principle should not work.

- What are the main methods of testing the performance of a distributed system? What are the subtleties, pitfalls, what mistakes make when testing?

Andrei Satarin: Approaches here are about the same as in testing systems consisting of one node, only the behavior is slightly more complex. To analyze the results of performance testing, you need to understand the behavior of the system well and cannot keep it in mind of one person.

A very important point - to inform colleagues the results of the experiments. In our practice, it often happens when someone examines the performance of one part of the system, and then sends their data and conclusions to the general newsletter - other team members analyze the experiment conducted from their side and add some aspects. It is important to work in conjunction with colleagues who know other parts of the system better, and to look at it from different angles.

Another important point when testing the performance of distributed systems - an indicator of scalability. There is often a situation where the system works fine, say, on ten nodes in a test cluster, but when you try to run it in production on one hundred nodes, it turns out that there is a bottleneck and the performance of the working cluster turns out to be just twice as high. It is very difficult to assess this problem a priori on a small scale, usually the product environment is much larger in size than the test one. It is most often impossible to make a test cluster of the same scale because of the high costs and to test the scalability of the system one has to come up with special approaches.

- What else are the nuances when testing the implementation of non-functional requirements?

Andrei Satarin: Some nodes of the system may work slowly due to hardware problems or, for example, due to too much resource consumption by other services. Often, such machines significantly slow down the entire system — simply turning off the problematic nodes results in performance recovery, but in the product environment it can be difficult to automatically detect them.

Another important point is the testing of the system configuration. Your code may work fine, but in a complex distributed system there are many configuration parameters and improperly configured them can lead to a drop in performance, for example, or even data loss. A good example of such a situation is discussed in the article Paxos Made Live - An Engineering Perspective . It is about the situation with Google Chubby, when the cluster configured to work on five nodes worked on four. Due to the initial resiliency incorporated into the system, the service functioned, but could no longer withstand the loss of two nodes.

- How important is the performance of the tests themselves and how can it be improved?

Andrei Satarin: If we talk about the classic tests (modular, system, integration), then they should be fast enough. Higher-level tests in distributed systems take much longer. If this is a random implementation of failures, for example, you need to search for a long time in a search of possible failures and there are no fundamental ways to reduce the time - usually these are hours or even days.

You can not emulate a bunch of different combinations of failures in a short time, especially if you do this on real hardware. But the more load you give on the system, the higher the probability of finding defects. The load generating part must work quickly - this is really important. Here you can advise only to run tests in parallel or invent some tricks to intensively load the system.

- Are there any tools to automate the testing of non-functional parameters of distributed systems?

Andrei Satarin: There is a well-known tool Jepsen , which is used for a rather wide class of various systems: Apache Cassandra, MongoDB, etc. Unfortunately, it cannot be simply launched out of the box - you have to program it. Available tools need to be sharpened under the system under test and their entry threshold is high enough. If we talk about performance, there are a variety of benchmarks, such as Yahoo Cloud Server Benchmark , which checks various storages, like those already mentioned by Cassandra and MongoDB.

- What problems in the field of testing distributed systems remain unsolved? Tell us about the main trends in this area.

Andrei Satarin: This area is becoming more mature; many companies are beginning to use complex tests like Jepsen to introduce failures with checking the consistency of their systems. Recently, formal methods have been actively used, which I have already mentioned - TLA + and formal verification of algorithms embedded in a distributed system.

Of course, there are pitfalls: even in a fully verified distributed system, the code of which was generated from the formal specification (for example, Microsoft Research has such developments) there are defects that affect, among other things, the fault tolerance and security of the system.

At Heisenbag 2017, Andrei Satarin will talk about the use of sanitizers, great tools for finding complex defects in C ++ programs. In Yandex, they are actively used for testing distributed systems.

Wash hands before eating, or sanitizers in testing

The full conference program is available on the website .

Source: https://habr.com/ru/post/329974/

All Articles

Andrei Satarin, Yandex: “The biggest mistake is the lack of understanding of the system”

More articles: