Raymond Chen wrote an interesting series of blogposts about lockless synchronization. I would like to publish these notes for hack readers. This post is an introduction to a series compiled from three old Chen posts.
- Lazy initialization with built-in C ++ tools
- Lockless synchronization
- Bezamochnaya thread-safe lazy initialization
Initializing static local variables in C ++ is not thread safe, and intentionally!
The specification establishes that static local variables (as opposed to global ones) are initialized when they first execute a block of code in which they are declared.
')
Try to find flight-condition yourself in a code fragment that calculates the result of a certain function “lazily” - at the first call, and then saves this result for subsequent calls.
int ComputeSomething()
{
static int cachedResult = ComputeSomethingSlowly();
return cachedResult;
}
(Approximately this code is advised in the popular
C ++ FAQ , so as not to depend on the order of initialization of
global static variables chosen by the compiler.)
The thing is, the compiler turns the initialization of static local variables into something like this:
int ComputeSomething()
{
static bool cachedResult_computed = false;
static int cachedResult;
if (!cachedResult_computed) {
cachedResult_computed = true;
cachedResult = ComputeSomethingSlowly();
}
return cachedResult;
}
Now the flight-condition is visible to the naked eye. Imagine that two threads called
ComputeSomething()
at the same time. The first stream only has time to perform
cachedResult_computed = true
before the system switches to the second stream. Now the second thread sees the set
cachedResult_computed
flag, skips the initialization of the variable, and takes its uninitialized value.
And this is not a compiler bug -
it was prescribed by the C ++ standard . (Then, in the TC1 edition, the thread safety requirement was removed, leaving “undefined behavior” in case of simultaneous initialization.)
Multi-threaded initialization of static local variables can lead to more significant problems:
class Something { ... };
int ComputeSomething()
{
static Something s;
return s.ComputeIt();
}
This code will unfold in
class Something { ... };
int ComputeSomething()
{
static bool s_constructed = false;
static uninitialized Something s;
if (!s_constructed) {
s_constructed = true;
new(&s) Something; // s
atexit(DestructS);
}
return s.ComputeIt();
}
// s
void DestructS()
{
ComputeSomething::s.~Something();
}
Now flight-kondishnov perhaps several. As before, one of the threads can get ahead of the other, and grab the value of
s
before it is initialized. But in addition to that, if the first thread managed to check the value of
s_constructed
, but
did not manage to set it to
true
, then the object
s
will be created twice and destroyed twice . (Rather, one object will “leak”, and the second one will be destroyed twice.) No jokes!
But that's not all. Look what happens if there are
two initialized static local variables:
int ComputeSomething()
{
static Something s(0);
static Something t(1);
return s.ComputeIt() + t.ComputeIt();
}
The compiler, to save memory, will save both flags in one variable:
class Something { ... };
int ComputeSomething()
{
static char constructed = 0;
static uninitialized Something s;
if (!(constructed & 1)) {
constructed |= 1;
new(&s) Something; // s
atexit(DestructS);
}
static uninitialized Something t;
if (!(constructed & 2)) {
constructed |= 2;
new(&t) Something; // t
atexit(DestructT);
}
return s.ComputeIt() + t.ComputeIt();
}
Great, now we have several threads perform logical operations on a common variable
without any synchronization . Let's see what happens if one stream attempts to
constructed |= 1
at the same time as another stream performs
constructed |= 2
.
In x86 / x64 processors there are commands that work with the operand in memory directly. Compiler will generate machine code
or constructed, 1
...
or constructed, 2
without
lock
prefixes, so that on multiprocessor machines both operations may have time to read the same old value, and one of the set bits will be “lost”.
In ia64 / alpha processors, logical commands work only with registers, so this can happen even on a single-processor machine. The compiler splits each of the
|=
operations into three machine commands:
ldl t1,0(a0) ;
addl t1,1,t1 ;
stl t1,1,0(a0) ;
If the execution of a thread is interrupted between reading the old value and writing a new one, then the new value can “wipe out” changes made from other threads.
It may be worth protecting the initialization of static variables by the critical section:
int ComputeSomething()
{
EnterCriticalSection(...);
static int cachedResult = ComputeSomethingSlowly();
LeaveCriticalSection(...);
return cachedResult;
}
Now initialization can be performed only by one thread at a time.
But what if
ComputeSomething()
is called again from
the same thread?
(“We've traced the call; it's coming from inside the thread!”) For example, if
ComputeSomethingSlowly()
explicitly or implicitly calls
ComputeSomething()
. (Sounds far-fetched? And imagine that
ComputeSomethingSlowly()
shows a dialog box, and the message loop through
DispatchMessage()
calls an event handler, and this handler uses
ComputeSomething()
)
In this case, the calling thread will already own the critical section, will go inside without any difficulty, and catch the same uninitialized value.
So, when you see the initialization of a static local variable at run time - be vigilant.
InterlockedXxx
operations are an efficient implementation of atomic operations on 32-bit values ​​and pointers. But atomicity alone does not guarantee flow-safety.
Suppose you have a shared variable protected by a critical section, and somewhere in another function you want to increase this variable by one. “Great, I can do without the critical section, and just call
InterlockedIncrement
.”
Is there anything that the goal of the critical section is to ensure that no one changes the value of a variable simultaneously with the execution of the protected code? You just took it and "quietly" changed it.
Another misconception is the implementation of complex atomic operations using the critical section. For example, this is how they try to implement
InterlockedMultiply
:
// !
LONG InterlockedMultiply(volatile LONG *plMultiplicand, LONG lMultiplier)
{
EnterCriticalSection(&SomeCriticalSection);
LONG lResult = *plMultiplicand *= lMultiplier;
LeaveCriticalSection(&SomeCriticalSection);
return lResult;
}
Yes, this code prevents the conflict between two
InterlockedMultiply
calls, but does not guarantee atomicity with respect to other operations on the variable. Take this example:
int x = 2;
Thread1()
{
InterlockedIncrement(&x);
}
Thread2()
{
InterlockedMultiply(&x, 5);
}
If the multiplication is atomic, then as a result of two operations, the result can be
x
= 15 or
x
= 11, depending on which of the operations has time to be executed earlier. But our “atomic” multiplication allows us to obtain other values, rather unexpected ones:
Stream 1 | Stream 2 |
---|
initial: x = 2 |
| InterlockedMultiply(&x, 5) |
| EnterCriticalSection |
| read x (read: 2) |
InterlockedIncrement(&x); now x = 3 | |
| multiply by 5 (result: 10) |
| write x (written: 10) |
| LeaveCriticalSection |
result: x = 10 |
It is not so atomic! How to fix it?
If the operation being performed is a “pure function” that does not depend on global data, then for its atomic implementation you can use
InterlockedCompareExchange
:
LONG InterlockedMultiply(volatile LONG *plMultiplicand, LONG lMultiplier)
{
LONG lOriginal, lResult;
do {
lOriginal = *plMultiplicand;
lResult = lOriginal * lMultiplier;
} while (InterlockedCompareExchange(plMultiplicand,
lResult, lOriginal) != lOriginal);
return lResult;
}
The operation consists of three steps.
The first is the “capture” of parameter values:
lOriginal = *plMultiplicand;
The second is to perform an operation on the captured values:
lResult = lOriginal * lMultiplier;
The third is to save the result only if the parameter values ​​have not changed:
InterlockedCompareExchange(plMultiplicand, lResult, lOriginal)
If they have changed, it means that some other operation has overtaken our operation; and this means that our operation needs to be performed again.
Returning to the example above: after an interlocked
InterlockedIncrement
call, the operation will notice that the value of
x
has changed and will try to multiply in a new way. The atomicity of the comparison in
InterlockedCompareExchange
ensures that the result will be recorded only if the parameters have not changed.
Do not forget that the operation should not use any other data!
InterlockedCompareExchange
unable to check if the values ​​of the global variables used by the operation have changed. Otherwise, a situation known as “ABA problem” is possible: the first time the operation was performed, the parameter value was A; then, imperceptibly for the running thread, its value has changed to B, and then again to A.
InterlockedCompareExchange
will not notice that something has changed, and will save the value calculated for the first time. If, together with the parameter, other data influencing the result have changed, the recorded result will be incorrect.
If all the operations on the common variable can be implemented as atomic, then it will not need to be protected by the critical section; gain is achieved in performance, and in scalability, and in the consumption of system resources. This approach to the design of multi-threaded programs is called "lockless synchronization." Its advantage lies mainly in the fact that if one of the threads captures a
lock (critical section, mutex, semaphore, etc.) and then for one reason or another “gets stuck”, then
all the streams using the same lock, “ get stuck after him. With lockless synchronization, none of the "stuck" threads can stop the entire system: no matter how many threads are stuck, at least one of the remaining ones will continue to work. Lockless synchronization is required in real-time systems, where the task execution time should not depend on the vicissitudes of the allocation of operating time by the operating system to each thread.
Suppose we decided to lazily initialize a couple of variables. In a single-threaded program, this does not cause difficulties:
// a b : a ≥ 0 b ≥ a.
// -1 ,
// " "
int a = -1, b = -1;
void LazyInitialize()
{
if (a != -1) return; //
a = calculate_nominal_a();
b = calculate_nominal_b();
//
a = max(0, a);
b = max(a, b);
}
As in the examples above, in a multithreaded program, such initialization can lead to flight-condition:
Stream 1 | Stream 2 |
---|
if (a != -1) [not satisfied] | |
a = calculate_nominal_a(); // returns 2 |
| if (a != -1) return; // too early! |
The first thread interrupted execution before it could calculate
b
, so the second thread takes its uninitialized value.
“Ah, well, it's easy to fix! To find out if initialization is done, we will check not
a
, but
b
. ”
void LazyInitialize()
{
if ( b != -1) return; //
a = calculate_nominal_a();
b = calculate_nominal_b();
//
a = max(0, a);
b = max(a, b);
}
Still possible flight-condition:
Stream 1 | Stream 2 |
---|
if (b != -1) [not satisfied] | |
a = calculate_nominal_a(); // returns 2 |
b = calculate_nominal_b(); // returns 1 |
| if (b != -1) return; // too early! |
The second thread will take the values
a
and
b
that do not meet the imposed restrictions.
“You can protect yourself from this! I will calculate the values ​​of
a
and
b
in local variables, and write them to global variables only when the calculation is completed - so that the second thread cannot see the “semi-ready” values. ”
void LazyInitialize()
{
if (b != -1) return; //
//
int temp_a = calculate_nominal_a();
int temp_b = calculate_nominal_b();
//
temp_a = max(0, temp_a );
temp_b = max( temp_a, temp_b );
//
a = temp_a;
b = temp_b;
}
Almost perfect; but flight-condition
is still possible:
Stream 1 | Stream 2 |
---|
if (b != -1) [not satisfied] | |
temp_a = calculate_nominal_a(); // returns 2 |
temp_b = calculate_nominal_b(); // returns 1 |
temp_a = max(0, temp_a); // temp_a = 2 |
temp_b = max(temp_a, temp_b); // temp_b = 2 |
a = temp_a; // written to processor cache |
b = temp_b; // written to processor cache |
b value is stored in memory |
| if (b != -1) return; // too early! |
the value of a stored in memory |
The compiler does not guarantee that the new value of
b
will be available to other processors earlier than the new value of
a
. Although the processor performs the assignments in this order, the shared memory control circuits in a multiprocessor system can write them in reverse order, so that the other processor will see
a
= -1 and
b
= 2.
Record
b
in memory must be issued as a
Release operation ; then all previous operations have to be reflected in memory before other processors see the new value of
b
.
void LazyInitialize()
{
if (b != -1) return; //
//
int temp_a = calculate_nominal_a();
int temp_b = calculate_nominal_b();
//
temp_a = max(0, temp_a);
temp_b = max(temp_a, temp_b);
//
a = temp_a;
// b
// , ,
// Release
InterlockedCompareExchangeRelease(
reinterpret_cast<LONG*>&b, temp_b, -1);
}
The resulting code is similar to a non-locked atomic operation, only without input parameters: first, all calculations are performed in local variables, and then written to memory in a single
InterlockedCompareExchangeRelease
operation. Assuming that
calculate_nominal_a()
and
calculate_nominal_b()
always return the same values, we do not need to perform the operation again in case of failure: if the variables are not initialized by us, it means that the other thread has already done our work for us. (Note: if
InterlockedCompareExchangeRelease
fails, the variable
a
will still be overwritten by our thread!)
Finally, Raymond offered a riddle: can the
temp_a
variable be dispensed with? Indeed, in any case, the value of
a
used only after the thread has made sure that initialization is complete.