Hello, Habr! Each SRE in our team once dreamed of sleeping peacefully at night. Dreams come true. In this article, I will talk about this and how we achieve the performance and stability of our Dodo IS system.

A series of articles about the collapse of the Dodo IS * system :
1. The day when Dodo IS stopped. Synchronous script.
2. The day Dodo IS stopped. Asynchronous script.
')
* Materials were written based on my performance at DotNext 2018 in Moscow .
In a previous article, we looked at blocking code issues in the Preemptive Multitasking paradigm. It was supposed that it was necessary to rewrite the blocking code on async / await. So we did. Now let's talk about what problems arose when we did this.
We introduce the term Concurrency
Before you get to async, you must enter the term Concurrency.
In queuing theory, Concurrency is the number of clients that are currently inside the system. Concurrency is sometimes confused with Parallelism, but it's actually two different things.
For those new to Concurrency for the first time,
I recommend Rob Pike's video . Concurrency is when we deal with many things at the same time, and Parallelism is when we do many things at the same time.
In computers, not many things happen in parallel. One such thing is computing on multiple processors. The degree of parallelism is limited by the number of CPU threads.
In fact, Threads is part of the concept of Preemptive Multitasking, one way to model Concurrency in a program when we rely on the operating system in the Concurrency question. This model remains useful as long as we understand that we are dealing specifically with the Concurrency model, and not with concurrency.
Async / await is syntactic sugar for State Machine, another useful Concurrency model that can run in a single-threaded environment. In essence, this is Cooperative Multitasking - the model itself does not take parallelism into account at all. In combination with Multithreading, we get one model on top of another, and life is greatly complicated.
Comparison of the two models
How it worked in the Preemptive Multitasking model
Let's say we have 20 Threads and 20 requests in processing per second. The picture shows a peak - 200 requests in the system at the same time. How could this happen:
- requests could be grouped if 200 clients pressed the button at the same time;
- the garbage collector could stop requests for several tens of milliseconds;
- requests could be delayed in any queue if the proxy supports the queue.
There are many reasons why requests for a short period of time have accumulated and come in a single bundle. In any case, nothing terrible happened, they stood in the Thread Pool queue and slowly completed. There are no more peaks, everything goes on, as if nothing had happened.
Suppose that the smart Thread Pool algorithm (and there are machine learning elements there) decided that there is no reason to increase the number of Threads so far. The Connection Pool in MySql is also 20, because Threads = 20. Accordingly, we need only 20 connections to SQL.

In this case, the Concurrency level of the server from the point of view of the external system = 200. The server has already received these requests, but has not yet completed it. However, for an application running in the Multithreading paradigm, the number of simultaneous requests is limited by the current size of Thread Pool = 20. So, we are dealing with the degree of Concurrency = 20.
How everything now works in the async model

Let's see what happens in an application running async / await with the same load and distribution of requests. There is no queue before creating a Task, and the request is immediately processed. Of course, Thread from ThreadPool is used for a short time, and the first part of the request, before contacting the database, is executed immediately. Because Thread quickly returns to Thread Pool, we donât need a lot of Threads to process. In this diagram we do not display Thread Pool at all, it is transparent.

What will this mean for our application? The external picture is the same - the level of Concurrency = 200. At the same time, the situation inside has changed. Previously, requests were "crowded" in the ThreadPool queue, now the degree of application Concurrency is also 200, because we have no restrictions on the part of TaskScheduler. Hooray! We have achieved the goal of async - the application "cope" with almost any degree of Concurrency!
Consequences: nonlinear degradation of the system
The application has become transparent in terms of Concurrency, so now Concurrency is projected onto the database. Now we need a connection pool of the same size = 200. The database is the CPU, memory, network, storage. This is the same service with its problems, like any other. The more requests we try to execute at the same time, the slower they run.
With a full load on the database, at best, Response Time degrades linearly: you gave twice as many queries, it began to work twice as slow. In practice, due to query competition, overhead will necessarily occur, and it may turn out that the system will degrade non-linearly.
Why it happens?
Reasons for the second order:
- Now the database needs to be simultaneously stored in the memory of data structures to serve more requests;
- Now the database needs to serve larger collections (which is algorithmically disadvantageous).
First order reason:
In the end, async fights against limited resources and ... wins! The database does not stand up and starts to slow down. From this, the server further increases Concurrency, and the system can no longer get out of this situation with honor.
Server Sudden Death Syndrome
Sometimes an interesting situation occurs. We have a server. He works for himself like that, everything is in order. There are enough resources, even with a margin. Then we suddenly get a message from clients that the server is slowing down. We look at the chart and see that there was some surge in customer activity, but now everything is normal. Thinking of a DOS attack or coincidence. Now everything seems to be fine. Only now the server continues to stupid, and it gets tougher until timeouts begin to pour in. After some time, another server that uses the same database also begins to bend. Common situation?
Why did the system die?
You can try to explain this by the fact that at some point the server received a peak number of requests and âbrokeâ. But we do know that the load was reduced, and the server after that didnât get better for a very long time, until the load completely disappeared.
The rhetorical question: was the server supposed to break due to excessive load? Do they do that?
We simulate a server crash situation
Here we will not analyze graphs from a real production system. At the time of the server crash, we often cannot get such a schedule. The server runs out of CPU resource, and as a result, it cannot write logs, give metrics. On the diagrams at the time of the disaster, a break in all graphs is often observed.
SREs should be able to produce monitoring systems that are less prone to this effect. Systems that in any situation provide at least some information, and at the same time, are able to analyze post-mortem systems using fragmentary information. For educational purposes, we use a slightly different approach in this article.
Let's try to create a model that mathematically works just like a server under load. Next, we will study the characteristics of the server. We discard the nonlinearity of real servers and simulate a situation where linear deceleration occurs when the load grows above nominal. Twice as many requests as needed - we serve twice as slow.
This approach will allow:
- consider what will happen at best;
- take accurate metrics.
Scheduled Navigation:
- blue - the number of requests to the server;
- green - server responses;
- yellow - timeouts;
- dark gray - requests that wereted on server resources because the client did not wait for a timeout response. Sometimes a client may report this to the server by disconnecting, but in general, such a luxury may not be technically feasible, for example, if the server does CPU-bound work without cooperation with the client.

Why did the clientâs request graph (blue in the diagram) turn out to be so? Usually the schedule of orders in our pizzerias smoothly grows in the morning and decreases in the evening. But we observe three peaks against the background of the usual uniform curve. This form of the graph was not chosen for the model by chance, but rather. The model was born during the investigation of a real incident with the server of the pizzeria contact center in Russia during the World Cup.
Case "World Cup"
We sat and waited for more orders. Prepared for the Championship, now the servers will be able to pass a strength test.
The first peak - football fans go to watch the championship, they are hungry and buy pizza. During the first half they are busy and cannot order. But people who are indifferent to football can, so on the chart everything goes on as usual.
And then the first half ends, and the second peak comes. Fans became nervous, hungry and made three times more orders than in the first peak. Pizza is bought at a terrible rate. Then the second half begins, and again not to pizza.
Meanwhile, the contact center server begins to slowly bend and serve requests more and more slowly. The system component, in this case, the Call Center web server, is destabilized.
The third peak will come when the match is over. Fans and the system awaits a penalty.
We analyze the reasons for the server crash
What happened The server could hold 100 conditional requests. We understand that it is designed for this power and will not stand it anymore. A peak arrives, which in itself is not so big. But the gray area of ââConcurrency is much higher.
The model is designed so that Concurrency is numerically equal to the number of orders per second, so visually on the graph it should be of the same scale. However, it is much higher because it accumulates.
We see a shadow from the graph here - these are requests that began to return to the client, executed (shown by the first red arrow). The time scale here is conditional to see the time offset. The second peak has already knocked out our server. He crashed and began to process four times less requests than usual.

In the second half of the graph, it is clear that some requests were still executed at first, but then yellow spots appeared - the requests stopped fulfilling completely.

Once again the whole schedule. It can be seen that Concurrency is going wild. A huge mountain appears.

Usually we analyzed completely different metrics: how slowly the request was completed, how many requests per second. We donât even look at Concurrency, we didnât even think about this metric. But in vain, because it is precisely this quantity that best shows the moment of server failure.
But where did such a huge mountain come from? The biggest peak load has already passed!
Little Law
Little's law governs Concurrency.
L (number of customers within the system) = Îť (speed of their stay) â W (time they spend inside the system)This is an average. However, our situation is developing dramatically, the average does not suit us. We will differentiate this equation, then integrate. To do this, look at the book of John Little, who invented this formula, and see the integral there.

We have the number of entries in the system and the number of those who leave the system. The request arrives and leaves when everything is complete. Below is a region of the growth graph corresponding to the linear growth of Concurrency.

There are few green requests. These are the ones that are actually being implemented. The blue ones are those that come. Between times, we have the usual number of requests, the situation is stable. But Concurrency is still growing. The server can no longer cope with this situation itself. This means that he will fall soon.
But why is concurrency increasing? We look at the integral of the constant. Nothing changes in our system, but the integral looks like a linear function that grows only up.
Will we play?
The explanation with integrals is complicated if you do not remember mathematics. Here I propose to warm up and play the game.
Game number 1
Prerequisites : The server receives requests, each requires three processing periods on the CPU. The CPU resource is divided evenly between all tasks. This is similar to how CPU resources are consumed during Preemptive Multitasking. The number in the cell means the amount of work left after this measure. For each conditional step, a new request arrives.
Imagine that you received a request. Only 3 units of work, at the end of the first processing period 2 units remain.
In the second period, another request is layered, now both CPUs are busy. They did one unit of work for the first two queries. It remains to complete 1 and 2 units for the first and second request, respectively.
Now the third request has arrived, and the fun begins. It would seem that the first request should have been completed, but in this period three requests already share the CPU resource, so the degree of completion for all three requests is now fractional at the end of the third processing period:

Further more interesting! The fourth request is added, and now the degree of Concurrency is already 4, since all four requests required a resource in this period. Meanwhile, the first request by the end of the fourth period has already been completed, it does not go to the next period, and it has 0 work left for the CPU.
Since the first request has already been completed, letâs summarize for him: it ran a third longer than we expected. It was assumed that the length of each task horizontally ideally = 3, by the amount of work. We mark it with orange, as a sign that we are not completely satisfied with the result.

The fifth request arrives. The degree of Concurrency is still 4, but we see that in the fifth column the remaining work is more in total. This happens because the fourth column left more work to do than the third.
We continue another three periods. Waiting for answers.
- Server, hello!
- ...

âYour call is very important to us ...â

Well, finally came the answer to the second request. The response time is twice as long as expected.

The degree of Concurrency has tripled already, and nothing portends that the situation will change for the better. I did not draw further, because the response time to the third request will no longer fit into the picture.
Our server has entered an undesirable state, from which it will never exit on its own. Game over
What is the GameOver state of the server characterized by?
Requests are accumulated in memory indefinitely. Sooner or later, memory will simply end. In addition, with an increase in scale, the CPU overhead for servicing various data structures increases. For example, the connection pool should now track timeouts for more connections, the garbage collector should now recheck more objects on the heap, and so on.
Exploring all the possible consequences of the accumulation of active objects is not the goal of this article, but even a simple accumulation of data in RAM is already enough to fill up the server. In addition, we have already seen that the client server projects its Concurrency problems onto the database server, and other servers that it uses as a client.
The most interesting: now even if you submit a lower load to the server, it still will not recover. All requests will end with a timeout, and the server will consume all available resources.
And what did we actually expect ?! After all, we knowingly gave the server an amount of work that it could not handle.
When dealing with distributed system architecture, itâs useful to think about how ordinary people solve such problems. Take, for example, a nightclub. It will stop functioning if too many people enter it. The bouncer copes with the problem simply: it looks how many people are inside. One came out - the other launches. A new guest will come and appreciate the size of the queue. If the line is long, he will go home. What if you apply this algorithm to the server?

Let's play again.
Game number 2
Prerequisites : Again we have two CPUs, the same tasks of 3 units, arriving each period, but now we will set the bouncer, and the tasks will be smart - if they see that the queue is 2, then they go home right away.


The third request came. In this period, he stands in line. He has the number 3 at the end of the period. There are no fractional numbers in the residuals, because two CPUs perform two tasks, one for a period.
Although we have three requests stacked, the degree of Concurrency inside the system = 2. The third is in the queue and does not count.

The fourth came - the same picture, although more work has already been accumulated.

...
...
In the sixth period, the third request was completed with a third lag, and the degree of Concurrency is already = 4.

The degree of concurrency has doubled. She canât grow anymore, because we have set a clear ban on this. With maximum speed, only the first two requests were completed - those who came to the club first, while there was enough space for everyone.
The yellow requests were in the system longer, but they stood in line and did not drag out the CPU resource. Therefore, those inside were quietly entertained. This could continue even until a man came and said that he would not stand in line, but rather he would go home. This is a failed request:

The situation can be repeated endlessly, while the query execution time remains at the same level - exactly twice as long as we would like.

We see that a simple restriction on the Concurrency level eliminates the server viability problem.
How To Increase Server Viability Through Concurrency Level Limit
The simplest âbouncerâ you can write yourself. Below is the code using the semaphore. There is no limit to the length of the line outside. The code is for illustration purposes only, no need to copy it.
const int MaxConcurrency = 100; SemaphoreSlim bulkhead = new SemaphoreSlim(MaxConcurrency, MaxConcurrency); public async Task ProcessRequest() { if (!await bulkhead.WaitAsync()) { throw new OperationCanceledException(); } try { await ProcessRequestInternal(); return; } finally { bulkhead.Release(); } }
To create a limited queue, you need two semaphores. For this,
the Polly library , which Microsoft recommends, is suitable. Pay attention to the Bulkhead pattern. Literally translated as "bulkhead" - a structural element that allows the ship not to sink. To be honest, I think the term âbouncerâ is better suited. Importantly, this pattern allows the server to survive in hopeless situations.
First, we squeeze out everything that is possible on the load bench from the server until we determine how many requests it can hold. For example, we determined that it is 100. We put bulkhead.
Then the server will skip only the required number of requests, the rest will be queued. It would be wise to choose a slightly lower number so that there is a margin. I have no ready-made recommendation on this subject, because there is a strong dependence on the context and the specific situation.
- If the server behavior stably depends on the load in terms of resources, then this number may approach the limit.
- If the medium is subject to load fluctuations, a more conservative number should be chosen, taking into account the size of these fluctuations. Such fluctuations can occur for various reasons, for example, the performance environment with GC is characterized by small peaks of load on the CPU.
- If the server performs periodic tasks on a schedule, this should also be considered. You can even develop an adaptive bulkhead that will calculate how many queries can be sent simultaneously without degrading the server (but this is already beyond the scope of this study).
Query Experiments
Take a look at this post-mortem last, we wonât see this again.

All this gray heap unambiguously correlates with server crash. Gray is death for the server. Let's just cut it off and see what happens. It seems that a certain number of requests will go home, simply will not be fulfilled. But how much?
100 inside, 100 outside

It turned out that our server began to live very well and fun. He constantly plows at maximum power. Of course, when a peak occurs, it kicks him out, but not for long.
Inspired by success, we will try to make sure that he is not bounced at all. Let's try to increase the length of the queue.
100 inside, 500 outside

It got better, but the tail grew. These are the requests that are executed for a long time later.
100 inside, 1000 outside
Since something has become better, let's try to bring it to the point of absurdity. Letâs resolve the queue length 10 times longer than we can serve simultaneously:

If we talk about the metaphor of the club and the bouncers, this situation is hardly possible - no one wants to wait at the entrance for longer than spending time in the club. We will also not pretend that this is a normal situation for our system.
Itâs better not to serve the client at all than to torment him on the site or in the mobile application by loading each screen for 30 seconds and spoiling the company's reputation. Itâs better to immediately honestly tell a small part of customers that now we canât serve them. Otherwise, we will serve all customers several times slower, because the graph shows that the situation persists for quite some time.
There is one more risk - other system components may not be designed for such server behavior, and, as we already know, Concurrency is projected onto clients.
Therefore, we return to the first option â100 per 100â and think about how to scale our capacities.
Winner - 100 inside, 100 outside

ÂŻ \ _ (ă) _ / ÂŻ
With these parameters, the greatest degradation in runtime is exactly 2 times the âface valueâ. At the same time, it is 100% degradation in query execution time.
If your client is sensitive to runtime (and this is usually true both with human clients and server clients), then you can think about further reducing the queue length. In this case, we can take some percentage of the internal Concurrency, and we will know for sure that the service does not degrade in response time by more than this percentage on average.
In fact, we are not trying to create a queue, we are trying to protect ourselves from load fluctuations. Here, just as in the case of determining the first parameter of the bulkhead (quantity inside), it is useful to determine what fluctuations in load the client may cause. So we will know in which cases, roughly speaking, we will miss the profit from potential service.
It is even more important to determine what Latency fluctuations can withstand other components of the system interacting with the server. So we will know that we are really squeezing the maximum out of the existing system without the danger of losing service completely.
Diagnosis and treatment
We are treating Uncontrolled Concurrency with Bulkhead Isolation.
This method, like the others discussed in this series of articles, is conveniently implemented by
the Polly library .
The advantage of the method is that it will be extremely difficult to destabilize an individual component of the system as such. The system acquires very predictable behavior in terms of time to complete successful requests and much higher chances of successful complete requests.
However, we do not solve all the problems. For example, the problem of insufficient server power. In this situation, you must obviously decide to âdrop the ballastâ in the event of a jump in load, which we assessed as excessive.
Further measures that our study does not address may include, for example, dynamic scaling.