Two simple rules to prevent deadlocks on mutexes

Hello, dear Habrayuzer!

It so happened that this is the third post in the blog of our company, and, like the first two, it is devoted to issues of multi-threaded programming and the problems that arise. It turned out so by chance, because we have experienced in our own skin that situations that arise when writing multi-threaded programs are incredibly difficult to debug, since they are largely determined by the dynamics of the program on a specific hardware platform. I am sure that most programmers have come across a situation where a program that works fine on one computer, on the other, suddenly starts to slow down almost “out of the blue”.

When writing our previous posts, we made an unforgivable mistake by starting to consider deadlocks from complex, but, as it seemed to us, the most interesting cases. Unfortunately, the respected audience Habra did not appreciate this. We are in a hurry to improve, and, if not the first time, but still approach the issue of blocking from the very beginning. However, we still expect that you know why and what synchronization tools are needed when writing multi-threaded programs.
')
The recipes described in this post are quite simple, but not all even professional programmers use consciously, being guided rather by sensations “for some reason it seems that you should not capture that mutex from under this captured one”. So we lived for many, many years, when at one terribly crucial moment we didn’t find that our favorite software couldn’t live an hour without a deadlock on the “piece of hardware” that needs to be urgently sent to the customer. After killing several days of our leading programmers to solve this problem, we made a decision that changed our life — we took up the severe formalization of interlocking situations, including rigorous mathematical proofs of why this could be done, and this was not possible. I must say that our research has degenerated into a candidate thesis of one of the employees, but I’m not sure that the presentation format used there will be interesting here ...

So, about blocking from the very beginning ...

The nature of interlocks

We will not go into complex mathematically verified definitions of interlocking situations and let's just say: interlocking is a state of the system in which one thread is waiting for something to happen, and this is something that cannot happen because another thread is waiting for something to happen. then from the first stream.

Traditionally, it is assumed that the cause of locks is always mutexes (mutex), but this is not entirely accurate. Blocks can be caused by any synchronization tools and mechanisms that imply waiting for one thread from another, for example, waiting for a signal on a conditional variable or, much less obviously, waiting for another thread to wait (join / join thread). . In theory, in fact, the second case is the same “waiting for a signal”, however, due to the implicitness of this synchronization operation, when searching for deadlocks, it is simply forgotten as a potential source of threat and is often not noticed in the code.

Before proceeding to the consideration of specific situations of the simplest interlocks, I would like to say a few words about the model that we use to describe multi-threaded programs and the situations that arise in them.

A few words about the model of multi-threaded programs

We called this model “the transition model” and it represents a set of directed graphs, where each graph is a stream (subject). Each graph has one initial vertex corresponding to the state when no synchronization tool is involved yet, and has one final vertex corresponding to the state when none synchronization tool is already activated. It is assumed that when the end point is reached, the flow automatically starts over. All other vertices of the graphs represent an operation with respect to one or another synchronization tool, for example, L (lock) - mutex capture, U (unlock) - mutex release, etc. For the proofs of the statements, it is important that the model ignores the time between the execution of separate operations with respect to the means of synchronization, thereby expanding the possible range of speakers to infinity. If the audience is interested in the mathematical and physical essence of the model, then I am ready to write a separate post on this topic, but here ... just the beginning of a long but interesting story about multi-threaded programming.

An example of a single stream (subject) model:

Picture 1.

According to this picture, a subject can go through two branches: 0, L1, U1, 0 or 0, L1, L2, U2, U1, 0. This scheme can be considered as a state machine, the grammar of which includes two phrases {0, L1, U1, 0} and {0, L1, L2, U2, U1, 0}. We assume that the transition time between actions with respect to means of synchronization is, of course, i.e. algorithmically correct. We will not consider the synchronization error to capture and hold a mutex while waiting for any user action that could potentially never occur.
In order to study a program for the potential occurrence of a situation of mutual blocking, it is necessary to make chains for the implementation of all possible subjects in the program.

The simplest interlock with mutexes

Suppose that in our program, in addition to the stream (subject) shown in Figure 1, there is one more:

Figure 2.

I dare to say that our program has a potential deadlock and even if your testers claim that everything works fine, then you are not immune to the situation, that on another “hardware” your program will behave like this:

Figure 3.

Nothing prevents our two independent threads from running like this: thread 1 managed to capture mutex 1, then the scheduler switched execution to thread 2 that captured mutex 2, and after that both our threads try to capture the mutexes that are already captured - deadlock!

Let us call the system of flows (subjects) incompatible if there is at least one variant of imposing chains of their execution, in which a situation of mutual blocking occurs. Accordingly, a system of subjects is compatible for which there is no dynamics, in which a situation of mutual blocking is possible.
The two subjects considered are incompatible, since there is an overlap option, shown in Figure 3, in which a deadlock situation occurs. Note that such subjects will not necessarily lead to blocking (freezing) of the program. The dynamics of a particular work (the time of transitions between calls to synchronization tools) may be such that the found overlay variant will never manifest itself in reality. In any case, the software code described by such a model is potentially dangerous and blocking may occur when porting to another software or hardware platform, or simply when the operating conditions change.

Figure 4

It is important to remember that threads can be pairwise compatible with each other, but not compatible in a larger combination. Figure 5 shows the program model, where all subjects are pairwise compatible with each other, but all together lead to a situation of mutual blocking.

Figure 5

Figure 6 shows another version of the alignment of chains of execution for the model shown in Figure 5, in which a deadlock occurs.

Figure 6

Just and I want to say: How scary to live!

And this would be true if there were not two very simple rules that professional programmers intuitively follow, often without even wondering why they do that.

First rule

Always release the captured mutexes in the reverse order of the capture, i.e. follow the logic of the "first captured - the last released . "

Second rule

Always follow the same mutex capture order.
If you are capturing mutex 1 in one thread, and then mutex 2, then it is unacceptable to capture them in a different order in another thread.
In fact, the rule is not as simple as it seems at first glance. Look carefully again at Figure 6. This rule is violated there, but this is somewhat unclear. Looking at the first stream, we fix that we capture mutex 2 after mutex 1. Looking at the second stream, we fix that we capture mutex 3 after mutex 2. Combining these observations means that we capture mutex 3 after mutex 1, which does not executed in thread 3. The result of this failure is a deadlock, which is shown in the figure.

Despite the simplicity of these rules, I dare to say that if your program uses only mutexes as a means of synchronization and you follow these rules, then you are not afraid of mutual locks! We, of course, do not take into account the situation of an attempt to double capture a non-recursive mutex in one stream, but this is more stupid than deadlock.

In conclusion, I would like to ask the question: is this topic interesting for the Habr audience? If so, then there is the desire and opportunity to talk about a little more interesting situations that arise when using signal variables and, perhaps, a little deeper plunge into the evidence base - this is already for sophisticated programmers of multi-threaded applications.

Hope this post was helpful.

Read the sequel - Recipes against interlocks on signal variables .

Source: https://habr.com/ru/post/176541/

All Articles