Lock-free data structures. 1 - Start

I hope that this article will be the beginning of a cycle of notes about lock-free data structures. I want to share with the community about my experience, observations and reflections on what lock-free data structures are, how to implement them, whether the concepts of containers of the standard STL standard library are suitable for lock-free containers, and when it is worth (and is it worth it at all) to use lock -free data structures.

Talking about lock-free data structures is meaningless without covering topics such as atomic operations, the memory model implemented in a particular programming language (unless, of course, the language is old enough to think about your memory model), safe memory release, compilers and the optimizations they use, the modern processor architectures — all of these topics will be more or less covered in this cycle. I take the liberty to tell about all these things, although I do not consider myself an absolute expert in any of them. On the other hand, having no idea about them, I could not write and develop the library libcds , - open source C ++ library of lock-free containers and safe memory reclamation algorithms. Cds is the abbreviation of Concurrent Data Structure, and the prefix “lib” is, oddly enough, “library”.
')
I will begin with the history of the library. It was back in 2006.

2006

Then I worked in a rather large company writing software for one telecommunication operator from the Big Three. We developed quite complex server applications for the zoo of hardware platforms, in which the first place was (and always will be) the problem of performance. It was solved, among other things, by paralleling data processing. As usual, parallelization led to the emergence of common (shared) data, access to which was required to synchronize. Somehow, in one of the discussions, my colleague asked in passing: “have you heard anything about the lock-free queues?” At that time I did not know anything about it. But, having asked Google, I found several articles in which the pseudo code of the lock-free queue was given. After reading them several times, I did not understand. More precisely, I switched to the state of “did not understand anything” after having rolled up my sleeves and said “right now!” To the whole world (they say, you are all fools, I’m the only one smart), I tried to “simplify” the algorithms by matching them with common sense. After a month of dealing with the segmentation fault, my common sense gave up. That's when I “understood nothing”. I did not understand how IT works at all, and even if IT somehow works, how it can be implemented in C ++. But somehow it should work, otherwise smart people would not write these articles, and other smart people would not refer to them (the articles were scientific, and at the end of each, lists of references were cited). Following these links, over the year I have read several tens of megabytes of useful and not very information, ranging from the software developer guide on processor architectures and ending with overview articles on general approaches to the implementation of lock-free algorithms. Along the way, I wrote something (in C ++, of course) on this topic, implementing certain primitives. It should be noted that at this time (year 2006–2007) the C ++ 11 standard was still optimistically called C ++ 0x, the STL did not yet have atomic primitives, and the interface was still only outlined, and the compilers sometimes were capricious at my atomic primitives and issued a non-working code on particularly critical areas. By 2008, I began to outline the hazy contours of the libcds library. The first tests on different platforms gave encouraging, sometimes stunning (“it became 50 times faster !!!”), the results, and I completely plunged into the world of lock-free. In 2010, I posted the first (0.5.0) version of the library on SourceForge. Today, the latest version of the library is 1.4.0, work is under way on version 1.5.0.

Let me turn to the general overview of lock-free data structures. So, the main task of a programmer in designing and developing complex software projects, especially server-based, is to most effectively use all available resources of the target platform. A modern computer, even a smartphone or tablet, is a multiprocessor hardware, so program parallelization is one of the ways to improve performance. Parallel threads (threads) process some shared data (shared data); Therefore, our main task is to organize parallel access to such shared data.

In the 80s of the last century, so-called structured programming was popularized, positioned as a method of writing good programs. An apologist for structured programming was Nicholas Wirth, author of the Pascal language, who wrote the bestseller “Algorithms + Data Structures = Programs”. Interestingly, this old formula points out the weak point of modern APIs like pthreads, Win32 APIs offered by operating systems: APIs provide a means for building parallel programs (this means for threads, threads), but they do not provide means for building parallel data structures that provide shared access. Instead, the API provides a means of restricting access to data in the form of synchronization primitives. However, synchronization is a bottleneck of parallel programs. At its core, synchronization is the opposite of parallelism: parallelizing algorithms, we work with sequential data structures, ensuring their operation by synchronization primitives — critical sections, mutexes (mutex), conditional variables (condvar); as a result, we line up all our threads to access the data structure, thereby killing parallelism. Synchronization primitives are sometimes objects of the OS kernel; therefore, calling such an object can be very expensive: context switching may be required, switching to the OS kernel level, supporting waiting queues for access to the synchronization protected primitive, etc. And all this, perhaps, only for In order to change the value of one single pointer in the data, that is, to execute one or two assembly instructions. Overhead costs can be (in fact and there is) very high. In addition, you should not forget that the OS kernel object is a resource limited in quantity.

Another disadvantage of synchronization is weak scalability. As the number of threads increases, synchronized data access becomes the bottleneck of the program; Often with an increase in the degree of parallelism instead of a proportional increase in productivity, we get its deterioration under high load conditions (high contention).

Guided by the Wirth formula “Algorithms + data structures = programs”, in my work on libcds I focused only on data structures: you will not find parallel sorting algorithms or parallel for-each in the library. The library contains only competitive data structures - a queue, a list, a map, a set, etc., - and the necessary lock-free data support algorithms, first of all, safe memory reclamation algorithms. Often this or that data structure is represented by several implementations. It was originally intended: as a rule, there are several interesting algorithms that implement a queue or map, and I don’t know in general which one is “better”: first, the concept of “better” or “worse” is relative and depends on the specific hardware and a specific task, secondly, until you implement the algorithm and compare it with others, you will not understand whether it is better or not. Well, if the algorithm is implemented and debugged, then why should it not take its place in the library (and provide a hard choice to the user)? ..

Lyrical digression

By the way, in this regard, I have a claim to STL: why, for example, the map in all implementations of STL known to me is made as a red-black tree? There are many other data structures suitable for the implementation of the map, for example, AVL tree, which is the same binary tree, but with a stronger balance criterion (besides, our development). Apparently, not only I am tormented by such questions: I met articles that compared the implementation of the red-black tree and AVL tree, and the conclusions in these articles were not uniquely in favor of the red-black tree: in many problems (and on many architectures) AVL tree was faster.

In the academic environment, the study of ways to implement competitive data structures that provide simultaneous access to shared data has led to the creation of several main areas:

Lock-free data structures;
Fine-grained algorithms;
Transactional memory

There are no implementations based on transactional memory in libcds. Transactional memory is a vast area of research focused mainly on the future. Algorithms based on transactional memory imply, roughly speaking, that memory supports atomic transactions that can be atomically accepted (commit) or rejected (rollback). Obviously, such a memory must be implemented in hardware; existing software implementations, as acknowledged by the researchers themselves, do not have sufficient performance. For the sake of justice, it is worth noting that Intel's Haswell architecture processors already have transactional support in their instruction set, so it should be expected that the flourishing of algorithms built on the principles of transactional memory is not far off.

Fine-grained algorithms are sophisticated synchronization methods, as a rule, not built on the use of synchronization primitives provided by OC, but on the use of “light” atomic primitives, for example, on spin-lock. Based on such primitives, data structures are constructed that allow parallel reading or even parallel writing, in which synchronization is performed at the node (page) or page (page, bucket) level of the data structure and is embedded in the algorithms of operations on this structure themselves. Often fine-grained containers show performance comparable to that of lock-free containers with a relatively small load. The libcds library does not shy away from such data structures.

Lock-free data structures I will call data structures that do not require external access synchronization. This is an informal, purely technical definition, reflecting the internal structure of the container and operations on it. The word “external” here is deliberately highlighted: it must be understood that, without the use of special processor support of the lock-free data structure, it is most likely not possible to build a data structure. But support of this kind in lock-free containers is not provided by synchronization of access to consecutive container methods , but by an atomic modification mechanism, wired inside container methods, or by internal synchronization at the level of container components (nodes, buckets, pages).

The formal definition of a lock-free object sounds like [Her91]: a shared object is called a lock-free object (non-blocking, non-blocking object) if it ensures that some stream finishes the operation on the object in a finite number of steps, regardless of the result of the work of others flows (even if these other threads ended in failure). A more rigorous concept of a wait-free object is: an object is a wait-free if each thread completes an operation on an object in a finite number of steps. The lock-free condition guarantees the advancement of at least one thread, while a stronger wait-free condition guarantees successful execution for all threads. In the theory of competitive data structures, there is also the concept of linearizability [Her90], which plays an important role in the formal proofs of the correctness of lock-free algorithms. Roughly speaking, an algorithm is linearizable if its result is visible at the end of the algorithm. For example, the result of insertion into the list is visible at the end of the insertion function. It sounds idiotic, but it is possible to come up with an algorithm that inserts into the list and is not linearizable. For example, any kind of buffering can violate this property: instead of inserting, we can put a new element in a buffer, give another thread a command to “insert an element from the buffer into the list” and not wait for the element to be inserted; or we will insert only when a sufficient number of elements have accumulated in the buffer. Then at the end of the insert function in the list, we cannot guarantee that the item is in the list; all we can guarantee is that the element will certainly be (future time!) inserted into the list now or later.

These concepts are widely used in research papers; My article does not claim to be a research project, so I will use the term lock-free in the “philistine” sense to designate a class of competitive containers built without the use of traditional synchronization patterns, or not requiring external synchronization.

Lyrical digression 2 - About time

It should be noted that the concept of time (cause-effect relationships) is somehow blurred in lock-free data structures. In the traditional approach, we act like this: we block access to the container, do something with it (for example, add an item), unblock it. At the time of unlocking, we know that the inserted element is in the container. In a lock-free container, everything is wrong. We do not need to block access, we simply call the add method. If it returns true, then the insert was successful. But this does not mean that the element is in the container - it can already be removed by a competing stream. This demonstrates the important difference between lock-free containers and traditional a la STL, - we cannot pull the insides of the container outwards; for example, the concept of iterators, widely used in STL, is not applicable to competitive data structures. I hope to discuss in detail the construction of interfaces for classes of competitive containers in one of the following articles.

What characterizes lock-free algorithms? Perhaps the first thing that catches your eye is their complexity. Can you imagine what a regular queue is like that implemented on a single-linked list? Very simple code:

Show very simple code

struct Node { Node * m_pNext ; }; class queue { Node * m_pHead ; Node * m_pTail ; public: queue(): m_pHead( NULL ), m_pTail( NULL ) {} void enqueue( Node * p ) { p->m_pNext = m_pTail ; m_pTail = p ; if ( !m_pHead ) m_pHead = p ; } Node * dequeue() { if ( !m_pHead ) return NULL ; Node * p = m_pHead ; m_pHead = p->m_pNext ; if ( !m_pHead ) m_pTail = NULL ; return p ; } };

You can write and even shorter. And this is what the enqueue / dequeue methods (synonyms of push / pop) implement the classic lock-free algorithm of Michael & Scott's queue ([Mic96], the code is taken with minimal abbreviations from the cds::intrusive::MSQueue class of the cds::intrusive::MSQueue library):

Show the same, but written very cumbersome

 bool enqueue( value_type& val ) { node_type * pNew = node_traits::to_node_ptr( val ) ; typename gc::Guard guard ; back_off bkoff ; node_type * t ; while ( true ) { t = guard.protect( m_pTail, node_to_value() ) ; node_type * pNext = t->m_pNext.load(memory_model::memory_order_acquire) ; if ( pNext != null_ptr<node_type *>() ) { // Tail is misplaced, advance it m_pTail.compare_exchange_weak( t, pNext, memory_model::memory_order_release, CDS_ATOMIC::memory_order_relaxed ) ; continue ; } node_type * tmp = null_ptr<node_type *>() ; if ( t->m_pNext.compare_exchange_strong( tmp, pNew, memory_model::memory_order_release, CDS_ATOMIC::memory_order_relaxed )) { break ; } bkoff() ; } ++m_ItemCounter ; m_pTail.compare_exchange_strong( t, pNew, memory_model::memory_order_acq_rel, CDS_ATOMIC::memory_order_relaxed ); return true ; } value_type * dequeue() { node_type * pNext ; back_off bkoff ; typename gc::template GuardArray<2> guards ; node_type * h ; while ( true ) { h = guards.protect( 0, m_pHead, node_to_value() ) ; pNext = guards.protect( 1, h->m_pNext, node_to_value() ) ; if ( m_pHead.load(memory_model::memory_order_relaxed) != h ) continue ; if ( pNext == null_ptr<node_type *>() ) return NULL ; // empty queue node_type * t = m_pTail.load(memory_model::memory_order_acquire); if ( h == t ) { // It is needed to help enqueue m_pTail.compare_exchange_strong( t, pNext, memory_model::memory_order_release, CDS_ATOMIC::memory_order_relaxed ) ; continue ; } if ( m_pHead.compare_exchange_strong( h, pNext, memory_model::memory_order_release, CDS_ATOMIC::memory_order_relaxed )) { break ; } bkoff() ; } --m_ItemCounter ; dispose_node( h ) ; return pNext ; }

The algorithm is complicated. The same singly-connected list, but even a simple visual comparison of the usual and lock-free queue already says something. In the lock-free queue, we see:

infinite loop - we try to perform the operation until we succeed. This is a typical pattern of using the atomic operation compare_exchange ;
protection of local variables ( guards ) using one of the safe methods of working with memory (pointers) in lock-free algorithms, in this case it is the Hazard Pointers method;
use of atomic primitives C ++ 11 - load, compare_exchange , memory barriers memory_order_xxx ;
helping is a very common technique in lock-free algorithms, when one thread helps another to do its work;
use of back-off strategies ( bkoff functor), which are not mandatory, but allow in some cases to significantly unload the processor under heavy load conditions, when many threads turn to the queue at the same time

Now I will not explain anything, since all these things are very extensive and require a separate conversation. Let the intrigue remain. I hope to cover these topics in future articles.

The next article will be devoted to the cornerstone concept on which all lock-free data structures are based — atomicity and atomic primitives.

And finally - useful books (not articles), which, from my point of view, most fully covered the issues of competitive programming.
At the moment, I know two good work:

1. Nir Shavit, Maurice Herlihy The Art of Multiprocessor programming . Translation into Russian, as far as I know, no. The book is very famous in the world of lock-free authors describes many parallel algorithms and ways to build them. All the examples are in Java, so the authors did not have to worry about freeing the memory, the C ++ memory model (the Java memory model is more strict), and other things that are embedded in the language in Java, and in C ++ you have to write all this yourself. Despite this, the book is very useful.

2. Anthony Williams C ++ Concurrency in Action (there is a Russian translation ). The book known in the world of C ++ author is devoted to the issues of multithread programming in C ++, it describes the new standard C ++ 11 and the new features provided by the standard for the implementation of parallel algorithms. Strongly recommended reading.

References:

[Her90] M. Herlihy, JM Wing Linearizability: A Correcting Condition for Concurrent Objects , ACM Transactions on Programming Languages and Systems, 1990

[Her91] M. Herlihy A Methodology for Implementing Highly Concurrent Data Object , 1991

[Mic96] M. Michael, M. Scott Simple, Fast, and Practical Non-Blocking and Concurrent Queue Algorithms

Lock-free data structures

Start
Basics:

Inside:

Outside:

Introduction to libcds

Source: https://habr.com/ru/post/195770/

All Articles

Lock-free data structures. 1 - Start

More articles: