We struggle with deadlock: pattern unlocked callbacks

Interlock situations

Wikipedia gives the following definition of interlocking: "Mutual blocking (eng. Deadlock) is a situation in a multitasking environment or a DBMS, in which several processes are in a state of endless waiting for resources occupied by these processes themselves."

Interlocks are usually dynamic in nature: their manifestation depends on such factors as user actions, availability of network services, positioning of the hard disk head, switching tasks in a system with preemptive multitasking, etc.

A classic example of interlocking: the first thread (A) captures the M1 mutex and then the M2 mutex. The second thread (B) captures the M2 mutex, and after that the M1 mutex. Mutual blocking of these two streams can occur as follows: flow A captures M1, flow B captures M2, then both flows are "doomed": neither flow A can capture M2, nor flow B can capture M1; attempts to capture mutexes will block both threads.
')
The described mutual blocking will occur only if both threads manage to capture exactly one mutex. Otherwise, the threads will continue.

This situation is very common in complex multi-threaded systems. As a rule, participating mutexes are located far from each other (in various components of the system), and it turns out to be quite difficult to identify participants of interlocking.

Common situation in the real system

One of the particular cases of the interlocking described above is as follows:

some object (Worker) performs some activity in a separate thread; at the beginning of its existence, the Worker object creates this thread, and terminates it itself before its destruction;
the flow activity consists in cyclically repeating certain actions (for example, retrieving data from the network);
the results of these individual actions are provided to consumers via callback functions or interfaces;
at any time during the life of the Worker object, a new consumer may appear or the existing one disappear; The Worker object maintains a consumer list and provides functions for registering and deleting consumers (registerCallback and unregisterCallback, respectively);
the consumer list is a shared resource: it is accessed by the internal flow of the Worker object, as well as by consumers who make registerCallback and unregisterCallback calls from the context of their own flow;
since the consumer list is a shared resource, it is protected by a mutex owned by the Worker object; with a captured mutex, the consumer list is modified; with a captured mutex, data is transmitted to consumers from the Worker object.

How dangerous is the situation described above? By the fact that during the development phase of the Worker object, the developer did not yet know exactly which functions of the system would be called through the callback interface. He only made demands on the interface: the function must have such and such parameters through which such and such data will be transmitted. And this call “in an unknown direction” is made with a captured mutex.

It is enough to add a few strokes to the picture described (this happens in complex systems):

inside one of the Worker object's consumers, a mutex M is captured (and then released, of course);
another system object with a captured mutex M registers itself as a consumer of the Worker object.

Everything. It turned out a situation in which interlocking is possible. It will not occur in 100% of cases (a certain dynamics is necessary so that each of the participating threads has time to capture only one mutex), and this significantly complicates the search for such errors.

The following are two ways to solve this problem.

Method 1: change the lock order

The Worker object will provide separate functions for blocking and unlocking its internal mutex, and the user will be registered as follows:

First, the internal mutex of the Worker object will be blocked (foreseen).
After that, the consumer will perform actions that require the capture of the mutex M.
The consumer will register with the Worker object; no internal mutex is captured inside the Worker object.

The disadvantages of this method are obvious:

each consumer has the burden of preliminary actions before working with the Worker object; the probability of error on the part of the programmer increases;
this is not always possible at all: what if a secure access to the pointer to the Worker object was provided with the help of the M mutex?
"Gut out" - it's just ugly.

Method 2: Do not block the mutex when transferring data to consumers

This method sounds promising: if data is transferred from the internal flow of the Worker object to consumers when the mutexes are not locked, this will correct all possible deadlocks when working with the Worker object.

Why not just make a callback when the mutex is not captured? Because the flow of the Worker object must go through the list of registered consumers and call the interface function of each consumer. If the list is not protected by the mutex, and the contents of the list change during this cycle, it is likely that the program will loop or even crash due to incorrect memory access.

Why not make a copy of the list of consumers (when creating a copy, capture a mutex), and then go through a cycle on the copy? Because you need to guarantee the consumer that after calling unregisterCallback data will not be transferred to him. If the consumer calls unregisterCallback from his destructor, the subsequent transfer of data to the callback interface of this consumer will cause the program to crash.

Thus, we almost came to the decision:

Worker's object flow must create a copy of the list of consumers before calling interfaces of registered consumers; a copy is created when a captured mutex is captured; to transfer data to consumers, the thread makes a cycle on a copy of the list;
it must be ensured that after the completion of the unregisterCallback call, no data will be passed to the consumer; Since a copy of the list of consumers is created each time a call is received from a stream to consumers, the risk of a “late call” exists only once, if unregisterCallback is called during the passage through the copy of the list of consumers;
and thus, it is necessary to somehow identify the state when the flow of the Worker object transmits data to consumers and works with a copy of the list of consumers; if unregisterCallback is called when the stream is in this state, the return from unregisterCallback must be delayed until the data transfer to consumers is completed.

Here is a turnkey solution. To implement it, you need another synchronization object - the “condition variable” (eng. Condition variable):

Worker object contains:
- consumer list;
- a flag indicating that the stream is currently transmitting data to consumers;
- Mutex for differentiation of access to the list of consumers and the flag;
- conditional variable for organizing the expectation and awakening of the consumer that caused the unregisterCallback "at the wrong time";
initially the flag is initialized to false;
registerCallback captures the mutex, adds a consumer to the list and releases the mutex;
the stream before the data transfer cycle to consumers performs the following actions:
- captures the mutex;
- creates a copy of the list of consumers; a copy is created on the thread stack;
- sets the flag to true;
- frees up mutex;
the stream transfers data to consumers, passing in a cycle over a copy of the list of consumers; the mutex is not captured, but the flag is set;
after transmitting data to consumers, the stream performs the following actions:
- captures the mutex;
- resets the flag to false;
- using a condition variable wakes up all pending threads;
- frees up mutex;
and - the most interesting - unregisterCallback:
- captures the mutex;
- removes a consumer from the list;
- as long as the flag is set, waits on a condition variable; atomically with a transition to the idle state, the mutex is released, and atomically with awakening, the mutex is captured;
- frees up mutex.

Important note: if unregisterCallback can be called from the callback interface implemented by the consumer, then the described algorithm will result in a 100% hang inside unregisterCallback. This is easily solved: if unregisterCallback is called in the context of the internal flow of the Worker object, there is no need to check the flag and wait for the change of the condition variable.

Implementation using the Qt library synchronization tools

Header file:

class ICallback { public: virtual void dataReady(QByteArray data) = 0; }; class Worker : public QThread { public: Worker(); void registerCallback(ICallback *callback); void unregisterCallback(ICallback *callback); protected: virtual void run(); private: QMutex _mutex; QWaitCondition _wait; bool _callingNow; QLinkedList<ICallback *> _callbacks; };

Implementation:

 Worker::Worker() : QThread(), _mutex(QMutex::NonRecursive), _callingNow(false) { ... } void Worker::registerCallback(ICallback *callback) { QMutexLocker locker(&_mutex); _callbacks.append(callback); } void Worker::unregisterCallback(ICallback *callback) { QMutexLocker locker(&_mutex); _callbacks.removeOne(callback); if(QThread::currentThread()!=this) { while(_callingNow) _wait.wait(&_mutex); } } void Worker::run() { while(...) { QByteArray data; ... QLinkedList<ICallback *> callbacksCopy; _mutex.lock(); _callingNow=true; callbacksCopy=_callbacks; _mutex.unlock(); for(QLinkedList<Callback *>::const_iterator it=callbacksCopy.begin(); it!=callbacksCopy.end(); ++it) { (*it)->dataReady(data); } _mutex.lock(); _callingNow=false; _wait.wakeAll(); _mutex.unlock(); } }

Source: https://habr.com/ru/post/175401/

All Articles