
The construction of a lock-free data structure is based on two pillars — atomic operations and ways to streamline memory access. This article focuses on atomicity and atomic primitives.
Announcement . Thank you for the warm welcome.
Beginning ! I see that the lock-free theme is interesting for the habrasoobschestvu, it makes me happy. I planned to build a cycle according to the academic principle, smoothly moving from fundamentals to algorithms, simultaneously illustrating the text with the code from libcds. But some readers require
spectacles without delay to show how to use the library, especially not rassusolivaya. I agree, there is a reason for this. Ultimately, and I'm not so interested in what's inside the
boost , - describe how to use it! Therefore, I will divide my epic cycle into three parts:
Basics ,
Inside and
Outside . Each article of the epic will be related to one of the parts. The
Basics will talk about low-level things, up to the structure of modern processors; This is the part for how many people like me.
Inside it will highlight interesting algorithms and approaches in the world of lock-free, it’s rather a theory about how to implement a lock-free data structure, libcds will be an inexhaustible source of C ++ code. In
Outside there will be articles on the practice of using libcds, - software solutions, tips and FAQ.
From the outside will eat your questions / comments / suggestions, dear habrazhiteli.
In the meantime, I frantically prepare the beginning of the
Outside - the first part of the
Basics . The article is largely not about C ++ (although about it too) and not even about lock-free (although algorithms are not functional without atomic lock-free), but about the implementation of atomic primitives in modern processors and the basic problems encountered when using such primitives.
Atomicity is the first
circle of hell, a low level of two.
Atomic operations are divided into simple - read and write - and atomic change operations (read-modify-write, RMW). An atomic operation can be defined as an indivisible operation: it has either already taken place or not yet; we can not see any intermediate stages of its implementation, no partial effects. In this sense, even simple read / write operations, in principle, can be non-atomic; for example, reading unaligned data is non-atomic: in the x86 architecture, such reading will result in an internal exception that causes the processor to read the data in parts, in other architectures (Sparc, Intel Itanium) - in a fatal error (segmentation fault), which, in principle, can be intercepted and process, but about the atomicity of speech here can not be. In modern processors, atomicity of reading / writing of only
aligned simple (integral) types - integers and pointers is guaranteed. A modern compiler ensures correct alignment of variables of base types, although it is not difficult to write an example of unaligned conversion, for example, to the whole. If you want to work atomically with a structure (4-4 byte size), then you should take care of proper alignment using compiler directives (attributes) (each compiler supports its own unique way of specifying data / type alignment). By the way, the library
libcds contains a set of auxiliary types and macros that hide the compiler-dependent part when declaring aligned data.
')
Compare-and-swap
It is rather difficult, if at all possible, to invent an algorithm for a lock-free container using only read / write (I don’t know such data structures for an arbitrary number of threads). Therefore, the developers of processor architectures have implemented RMW operations that can atomically read a aligned memory cell and write to it: compare-and-swap (CAS), fetch-and-add (FAA), test-and-set (TAS), etc. In the academic environment, the compare-and-swap operation (CAS) is considered as the base operation; her pseudocode is simple:
bool CAS( int * pAddr, int nExpected, int nNew ) atomically { if ( *pAddr == nExpected ) { *pAddr = nNew ; return true ; } else return false ; }
In words: if the current value of the variable at the address
pAddr
is equal to the expected
nExpected
, then assign the variable the value
nNew
and return
true
, otherwise return
false
, the value of the variable does not change. All this is done atomically, that is, indivisibly and without visible partial effects. Through CAS, you can express all other RMW operations, for example, fetch-and-add will look like this:
int FAA( int * pAddr, int nIncr ) { int ncur ; do { ncur = *pAddr ; } while ( !CAS( pAddr, ncur, ncur + nIncr ) ; return ncur ; }
The “academic” version of the CAS operation is not always convenient in practice, because often when CAS fails, we wonder what the value is in the cell now. Therefore, such a CAS variant is often considered (the so-called
valued CAS , also performed atomically):
int CAS( int * pAddr, int nExpected, int nNew ) atomically { if ( *pAddr == nExpected ) { *pAddr = nNew ; return nExpected ; } else return *pAddr }
In C ++ 11, the
compare_exchange
function (strictly speaking, in C ++ 11 there is no such function, there are its variants
compare_exchange_strong
and
compare_exchange_weak
, but I will tell about them later) covers both of these options:
bool compare_exchange( int volatile * pAddr, int& nExpected, int nNew );
The
nExpected
argument
nExpected
passed by reference, and the input contains the expected value of the variable at
pAddr
, and the output is the value before the change. Returns the function success sign:
true
if the address is
nExpected
(in this case it changes to
nNew
),
false
if it fails (then
nExpected
will contain the current value of the variable at
pAddr
). Such a universal construction of the CAS operation covers both variants of the “academic” definition of CAS, but in practice the application of
compare_exchange
fraught with errors, since it must be remembered that the
nExpected
argument
nExpected
passed by reference and can be changed, which is not always acceptable.
Using
compare_exchange
fetch-and-add primitive shown above can be written like this:
int FAA( int * pAddr, int nIncr ) { int ncur = *pAddr; do {} while ( !compare_exchange( pAddr, ncur, ncur + nIncr ) ; return ncur ; }
ABA problem
Primitive CAS is good for everyone, but its application can
cause a serious problem, called the
ABA-problem . To describe it, consider the typical pattern of using the CAS operation:
int nCur = *pAddr ; while (true) { int nNew = if ( compare_exchange( pAddr, nCur, nNew )) break; }
In fact, we are “dug” in the cycle until the CAS works; the cycle is necessary because between reading the current value of a variable at
pAddr
and calculating the new value of
nNew
variable at address
pAddr
can be changed by another thread.

The ABA problem is described as follows: for example, a stream X reads the value A from some shared cell containing a pointer to some data. Then another thread, Y, changes the value of this cell to B and again to A, but now the pointer A points to other data. When flow X using the CAS primitive tries to change the cell value, the pointer is compared with the previous (read) value of A successfully, and the CAS result is successful, but A now points to completely different data! As a result, the flow can break the internal connections of the object (which, in turn, can lead to a crash).
Here is the implementation of a lock-free stack subject to the ABA problem [Mic04]:
// Shared variables static NodeType * Top = NULL; // Initially null Push(NodeType * node) { do { NodeType * t = Top; node->Next = t; } while ( !CAS(&Top,t,node) ); } NodeType * Pop() { Node * next ; do { NodeType * t = Top; if ( t == null ) return null; next = t->Next; } while ( !CAS(&Top,t,next) ); return t; }
The following sequence of actions leads to a violation of the stack structure (note that this sequence is not the only one illustrating the ABA problem):
Thread x | Thread y |
Causes Pop() .
Pop4 string Pop4 , variable values: t == A
next == A->next
| |
|
NodeType * pTop = Pop()
pTop == top of stack, i.e. A
Pop()
Push( pTop ) Now top of the stack - again a Note that A->next has changed.
|
The string Pop5 is Pop5 . CAS is successful, but the field Top->next is assigned a value nonexistent in the stack since stream y pushed out from stack A and A->next , and local variable next stores the old value of A->next
| |
The ABA problem is the scourge of all lock-free containers based on the CAS primitive. It can occur only in a multi-threaded code due to the removal of element A from the container and replacing it with another (B), and then again with A (hence the name “
ABA Problem”), while another thread holds a pointer to the element to be deleted. Even if the thread physically deletes A (
delete A
) and then calls
new
to create a new element, there is no guarantee that the allocator will not return Address A (good allocators will do just that). Often it manifests itself in a more sophisticated way than described above, it affects not two, but more streams (in this sense we can talk about the ABCBA problem, the ABABA problem, etc., how much fantasy is enough), and its identification is always a non-trivial task. . The ABA problem control tool is deferred
physical removal (
deferred deallocation , or
safe memory reclamation ) of an element at the moment when there is a complete guarantee that no one (none of the competing streams) has any local or global references to the element to be deleted.
Thus, deleting an element from the lock-free structure is two-phase:
- The first phase is the exclusion of an element from the lock-free container;
- The second phase (deferred) - the physical removal of the element when there is no reference to it.
I will discuss in detail the various pending removal schemes in one of the following articles.
Load-Linked / Store-Conditional
Probably, the presence of an ABA-problem when using CAS prompted the processor designers to search for other RMW-operations that are not subject to the ABA-problem. And such operations were found - a pair of
load-linked / store-conditional (LL / SC). These operations are very simple - here is their pseudocode:
word LL( word * pAddr ) { return *pAddr ; } bool SC( word * pAddr, word New ) { if ( pAddr LL) { *pAddr = New ; return true ; } else return false ; }
A pair of LL / SC works as operator brackets. The
load-linked (LL) operation simply returns the current value of the variable to the address
pAddr
. The
store conditional (SC) operation saves to the address previously read by the LL operation
pAddr
only if the data in
pAddr
not changed since it was read. In this case, a change is
any modification of the cache line to which the address
pAddr
. To implement the LL / SC pair, the processor developers had to change the cache structure: roughly speaking, each cache line must have an extra status bit. When reading by the LL operation, this bit is set (link), during any write to the cache line (or it is preempted from the cache), the bit is cleared, and the SC operation checks before storing whether the bit is set to the cache line: if the bit = 1, then there is no one changed the cache line, the value at the address
pAddr
changes to a new one and the operation SC is successful, otherwise the operation fails, the value at the address
pAddr
does not change.
The CAS primitive can be implemented using the LL / SC pair like this:
bool CAS( word * pAddr, word nExpected, word nNew ) { if ( LL( pAddr ) == nExpected ) return SC( pAddr, nNew ) ; return false ; }
Note that, despite the multistepness of this code, it executes an atomic CAS: the contents of the target memory cell either do not change or change
atomically . The implementation of the LL / SC pair on architectures that support only the CAS primitive is possible, but not so primitive; I will not consider it here, referring to those interested in the article [Mic04].
Modern processor architectures are divided into two large camps - some support the CAS primitive in their command system, while others support the LL / SC pair. CAS is implemented in x86, Intel Itanium, Sparc; the primitive first appeared in the IBM System 370 architecture. The LL / SC pair is PowerPC, MIPS, Alpha, ARM; first proposed by the DEC. It is worth noting that the LL / SC primitive is implemented in modern architectures not in the most ideal way: for example, you cannot make nested LL / SC calls with different addresses, false flag resets are possible (so called false sharing, see below), there is no flag validation primitive linked inside a pair of LL / SC.
From the point of view of C ++, the C ++ 11 standard does not consider a pair of LL / SC, but describes only the atomic primitive
compare_exchange
(CAS) and atomic primitives derived from it -
fetch_add
,
fetch_sub
,
exchange
, etc. The standard implies that you implement CAS using LL / SC is quite simple, but the reverse implementation - LL / SC via CAS - is very nontrivial. Therefore, in order not to complicate the lives of the developers of the standard C ++ library, the standardization committee has introduced only
compare_exchange
in C ++, which, in principle, is sufficient to implement lock-free algorithms.
False sharing
In modern processors, the cache line length L (cache line) is 64 - 128 bytes and tends to increase in new models. Data is exchanged between the main memory and the cache in L byte blocks (usually L is a power of 2). When changing at least one byte in the cache line, the entire line is considered invalid and requires synchronization with the main memory. This is governed by the cache coherence support protocol in the multiprocessor / multi-core architecture.
MESI cache cache support protocolFor example ([Cha05]), the MESI (Modified - Exclusive - Shared - Invalid) protocol for supporting the coherence of Intel processor caches marks the cache line as invalid for each write to a shared line, which results in an intensive exchange with the memory. When data is first loaded into the cache line, MESI marks this line as E (Exclusive), which allows the processor to produce an unlimited number of reads / writes of data into the cache line. If another processor requires access to the same data that is located in the cache of that other processor, MESI marks the cache line as S (Shared). Now, any instruction to write to such a line in any cache leads to marking the line as M (Modified), which, in turn, results in marking this line as I (Invalidate) in the caches of other processors and, as a result, loading data from the main memory . Now I will not dwell on the MESI protocol, as I plan in one of the following notes to translate a remarkable article on the internal organization of modern processors, where MESI will also be considered.

If different shared data falls into one cache line (that is, located in adjacent addresses), then changing one data leads, from the point of view of the processor, to the invalidation of others that are on the same cache line. This situation is called
false sharing . For LL / SC primitives, false sharing is destructive, since the implementation of these primitives is performed in terms of cache lines: the Load-Linked operation marks (link) the cache line, and the Store-Conditional operation checks whether the link flag is not reset for this line ; if the flag is cleared, no entry is performed and the SC returns false. Since the length of the L line is quite large, the false
non- triggering of the SC primitive (that is, the reset of the link flag) happens with any change in the cache line that is not associated with the target data. As a result, we can get
livelock : a situation where the processors / cores are 100% occupied, but there is no progress.
The fight against false sharing is fairly simple (but wasteful): each shared variable must occupy the entire cache line; for this, padding is usually used:
struct Foo { int volatile nShared1; char _padding1[64]; // padding for cache line=64 byte int volatile nShared2; char _padding2[64]; // padding for cache line=64 byte };
It should be noted that the physical structure of the cache affects all operations (including CAS), and not just the LL / SC primitives. In some papers, data structures that are specially constructed taking into account the cache structure (mainly taking into account the length of the cache line) are investigated; with proper (“under the cache”) data structure, performance can increase by an order of magnitude.
CAS Varieties
I will briefly dwell on two more useful atomic primitives -
double-width CAS (double-word CAS, dwCAS) and
double CAS (DCAS).
Double-width CAS is similar to the usual CAS, but it operates with a double-sized memory cell: 64-bit for 32-bit architectures and 128-bit (more precisely, at least 96-bit) for 64-bit. For architectures that provide a pair of LL / SC instead of CAS, LL / SC should also operate with words of double length. Of all the architectures I know, only x86 supports dwCAS. Why is this primitive so useful? It allows you to organize one of the first schemes for solving ABA-problems -
tagged pointers . This scheme was proposed by IBM just to solve an ABA-problem and consists in comparing each divided pointer tag - an integer number; The tagged pointer can be described by the following structure:
template <typename T> struct tagged_pointer { T * ptr ; uintptr_t tag ; tagged_pointer() : ptr( new T ) , tag( 1 ) {} };
To ensure atomicity, this type must be double-aligned: 8 bytes for 32-bit architectures and 16 bytes for 64-bit architectures. The tag contains the “version number” of the data pointed to by
ptr
. I will talk in detail about tagged pointers in one of the following articles on safe memory management (safe memory reclamation, SMR), but here I’ll just note the fact that tagged pointers implies that the memory once allocated for data of type T (and the corresponding
tagged_pointer
) should not be physically deleted (
delete
), but should be placed in the free-list (separate for each type T), from which further data will be re-distributed with the increment of the tag. This is exactly what the ABA problem gives a solution: in fact, our pointer is composite and contains the version number in the tag (the ordinal number of the distribution); in this case, dwCAS will not succeed if in its arguments of the
tagged_pointer
type
tagged_pointer
pointers are equal and the tag values are different.
The second atomic primitive -
dual CAS (DCAS) - is currently purely hypothetical and is not implemented in any of the modern processor architectures. The pseudocode of DCAS is:
bool DCAS( int * pAddr1, int nExpected1, int nNew1, int * pAddr2, int nExpected2, int nNew2 ) atomically { if ( *pAddr1 == nExpected1 && *pAddr2 == nExpected2 ) { *pAddr1 = nNew1 ; *pAddr2 = nNew2 ; return true ; } else return false }
DCAS operates on two arbitrary aligned memory cells and changes the values in both if the current values coincide with the expected ones.
Why is this primitive so interesting? The fact is that with its help it is quite easy to build a lock-free doubly linked list (deque, double-linked list), and such a data structure is the basis of many interesting algorithms. There are a large number of academic papers dedicated to DCAS-based data structures. Although this primitive is not implemented “in hardware”, there are works (for example, [Fra03] is the most famous of them) describing the DCAS generation algorithm (and multi-CAS in general for any number of addresses
pAddr1
...
pAddrN
) based on the usual CAS.
Performance

And what about the performance of atomic primitives?
Modern processors are so complex and unpredictable that manufacturers do not give any unequivocal guarantees of the duration of certain instructions, especially atomic ones, which involve internal synchronization, signaling of processor buses, etc. There are quite a large number of works in which the authors try measure the duration of the processor instructions. I will give the measurements from [McKen05]. In this paper, the authors, among other things, compare the duration of the atomic increment and the CAS primitive with the duration of the operation
nop
(no-operation). So, for Intel Xeon 3.06 GHz (sample of 2005), the atomic increment has a duration of 400 nop, CAS - 850 - 1000 nop. IBM Power4 1.45 GHz processor: 180 nop for atomic increment and 250 nop for CAS. The measurements are quite old (2005). Since then, the processor architecture has taken a few more steps forward, but the order of numbers, I think, has remained the same.
As you can see, atomic primitives are pretty heavy. Therefore, it is quite expensive to apply them everywhere: for example, if a binary tree search algorithm uses CAS to read the current tree node, I do not expect anything good from such an algorithm (I have seen such algorithms). In fairness it should be noted that, according to my feelings, with each new generation of the Intel Core architecture, the CAS primitive is becoming faster, apparently, Intel engineers put a lot of effort into improving the micro-architecture.
Volatile and atomicity
C ++ has a mysterious
volatile
keyword. Sometimes
volatile
associated with atomicity and ordering - this is wrong, but has some historical basis.
In fact,
volatile
guarantees only that the compiler will not cache the value in the register (optimizing compilers love to do this: the more registers - the more caches). That is, reading a volatile variable
always means reading from memory, writing a volatile variable is
always writing to memory. Therefore, if someone simultaneously changes volatile data, we will immediately see it.
In fact, we will not see. Because of the lack of memory barriers. In some languages (Java, C #),
volatile
assigned a magic status that provides ordering, but not in C ++ 11. And no special alignment of
volatile
does not guarantee, and we now know that a necessary condition for atomicity is just the correct alignment of data.
Thus, for a C ++ 11-compatible compiler, the indication
volatile
for an atomic variable is redundant. But for the old compiler it is necessary if you implement
atomic
yourself. In this ad:
class atomic_int { int m_nAtomic;
the compiler has the full right to “optimize” the calls to
m_nAtomic
(despite the fact that the call goes indirectly through
this
). Therefore, it is
int volatile m_nAtomic
to specify
int volatile m_nAtomic
.
volatile methods in C ++Studying the
atomic
interface, you will surely notice that many of its methods have the
volatile
specifier:
void store( T val ) volatile ;
What is it? Is the
this
pointer
volatile
(that is, of type
T * volatile
)? In general, this is not possible -
this
often passed in the register, and it is specified in some ABIs. In fact, this is type
T volatile *
.
By analogy with
const
methods, this specifier says that
this
points to volatile data, that is, all data members of an object are
volatile
in such a method. That, in turn, forbids the compiler to optimize calls to data members. Basically, the same as if we declared this data as
volatile
.
Well, let me remind you that the
const
and
volatile
specifiers are not opposites, they can exist together.
atomic<T>
:
T load() const volatile ;
:
const
— -volatile
— - - ,
libcds
In conclusion, what do we have in the libcds library ? And we have another bike implementation of atomic primitives in the spirit of C ++ 11 for x86 / amd64, Intel Itanium and Sparc architectures. If the compiler does not support C ++ 11 (and only the newest versions of compilers support it - GCC 4.7+, MS Visual C ++ 2012, Clang 3.0+ available to me), then libcds
use its implementation of atomic primitives. In this case, the main atomic primitive for constructing lock-free data structures, in addition to the usual atomic read / write, is CAS, very rarely where CAS is used over a double word (dwCAS). DCAS implementations (and multi-CAS in general) inlibcds
[not yet], but it is quite possible that it will appear in the future - very interesting data structures can be built on it. It stops only that, according to many studies, the implementation of DCAS using the [Fra03] algorithm is rather inefficient. But I already noted that the criterion of efficiency is a purely individual thing in the world of lock-free: what is ineffective now on this gland and for this task can be extremely effective then either on another gland or for another task!In the next article on Basics, we’ll talk about ordering memory access (memory ordering) - about memory barriers (memory fence, memory barrier).Links:
[Cha05] Dean Chandler
Reduce False Sharing in .NET , 2005, Intel Corporation
[Fra03] Keir Fraser
Practical Lock Freedom , 2004; technical report is based on a dissertation submitted September 2003 by K.Fraser for the degree of Doctor of Philosophy to the University of Cambridge, King's College
[McKen05] Paul McKenney, Thomas Hart, Jonathan Walpole
Practical Concerns for Scalable Synchronization[Mic04] Maged Michael
ABA Prevention using single-word instruction , IBM Research Report, 2004