Lock-free data structures. Concurrent map: warm up

I was honored - invited to speak at the first C ++ 2015 Russia conference on February 27-28. I was so impudent that I asked for 2 hours to speak instead of one, and declared the topic that interests me most - competitive associative containers. This hash set / map and trees. The organizer of the sermp went to meet him, for which he thanks a lot.
How to prepare for such a responsible ~~test~~ statement? The first is to draw a presentation, that is, a bunch of pictures, preferably close to the subject. But it is also necessary to voice the pictures for two hours, - how to remember all this? How to avoid profound "eeemmmmm", "here we see", "this slide shows", incoherent jumping of the narration and other things that characterize the speaker is not very good in terms of fluency in the native language (I’m talking about Russian, I quickly understood C ++ - no code in the presentation, only pictures)?
Of course, you need to write down your thoughts, looking at the slides. And if something is written, then it would not be bad to publish. And if you publish - then on the habr.
So, in the wake of C ++ 2015 Russia ! The author's statement, I hope, without the author's inert language, without banknotes and with deviations on the topic, written before the onset of the event, in several parts.

I, as a C ++ programmer, most of the standard container library had to use not queues / stacks / decks / lists, but associative containers - hash set / map, they are std :: unordered_set / map, and trees, that is, std :: set / map. As we know, the C ++ 11 standard guarantees our thread-safe operation with such containers, but very weak - only read-only, any change - only under external locking! In practice, such a guarantee is almost nothing, because if I see a read-only map, then most likely I will make a sorted array and look for it with a binary search - experience shows that this is faster, and memory is used carefully, and the processor cache nicely.
Is it possible to implement a concurrent map container without the need for external synchronization? Let's explore this question.
To warm up, let's look at the internal structure of the hash map.

Hash table

Internally, the device of the simplest hash map with a list of collisions is extremely primitive: there is an array T[N] , each element of which is a list. The entry to the table (index) is the hash value of the key. All keys with the same hash (modulo N ) fall into the same list, called the collision list.

')
In general, the most important thing here is hidden inside the hash function. A hash map will work well if and only if the correct hash function is selected for a set of keys. "Correct" - this means, first of all, "well spreading" keys on the table cells. But from the point of view of increasing the level of competition, it does not matter to us at all how the hash function is arranged inside, so that we will not talk more about it.
Looking at the picture, you can immediately guess how to make the usual hash map competitive: instead of blocking access to the entire hash map, you can block at the level of each cell of the T[N] table. Thus, work with each specific list of collisions T[i] will be serialized, but you can work with different lists T[i] and T[j] ( i != j ) in parallel. This technique is described in the work of M. Herlihy and Nir Shavit "The Art of Multiprocessor Programming" and is called striping.

Claim to publishers

Why is this work not yet translated, although the original has already sustained two editions? !!! This fundamental work from the world's most famous authors is an analogue of Knuth's monumental monograph, but for parallel algorithms, an invaluable guide for students and anyone who wants to figure out "how it works" is a launching pad for growing domestic cadres.
No, we will publish "C ++ waste paper for 35 minutes for dummies"! Teapots do not need C ++ !!! And without such books, we will only have teapots. This disclaimer does not concern readers of a habr.

Striped map

So, in the striping technique, we have two arrays - the Lock[N] lock array and the T[N] hash table itself. Initially, the size of these two arrays is the same (this is an important requirement, then we will see why). The internal operation algorithm is simple: for any operation (insert / erase / find), we calculate the hash value i key modulo N , block the Lock[i] mutex and go to perform the operation in the T[i] list. But there is one “but”: the hash table needs to be expanded from time to time. The criterion for expansion is the so-called load factor: the ratio of the number of elements in the hash map to the size N hash table. If we do not expand the table, the collision lists will become very large, and all the advantages of the hash map will turn into a pumpkin: instead of evaluating O (1) for the complexity of the operation, we get O (S / N), where S is the number of elements in the table, that with S >> N and unchanged N is equivalent to O (S), that is, the complexity of a linear search! This is unacceptable. To combat this phenomenon, rehashing is required: increase N, if the load factor exceeds a certain threshold value, and rebuild the hash table to the new hash values std::hash(key) % N' .
The rehashing algorithm for the striped map is as follows:

First, we block all mutexes from left to right; then increment N by an integer number (integer is important, see below), rearrange the hash table, and finally unlock all mutexes in the reverse order - from right to left. Do not forget that the main cause of deadlock is non-observance of the order of locking / unlocking several mutexes.

Locking all mutexes here is important so that no one stops us from doing rehashing. Rehashing is quite a hard operation and redistributes the keys to the cells of the new T[N'] table.
Note that above we talked about increasing the size of the hash table, but did not say anything about increasing the size of the Lock[N] lock array - this is no accident: the fact is that the size of the lock array does not change .

If the sizes of the arrays Lock and T different, then in principle the situation is possible when the same cell T[k] corresponds to different Lock[i] and Lock[j] mutexes, which breaks our entire slender algorithm. For this to be unacceptable, it is required:

Initially, the sizes of arrays Lock and T coincide and are equal to N
When rehashing, the new size N' array T calculated as N' = k * N , where k is a positive integer (usually 2).

Under these conditions, we always have the size of the array T is a multiple of the size of the array Lock (which does not change); based on the properties of modular arithmetic, it is easy to prove that in this case the same T[k] cell cannot correspond to different Lock[i] and Lock[j] mutexes.

Refinable hash map

The authors of “The Art of Multiprocessor Programming” go further by solving the rehashing problem for the Lock array, that is, by proposing an algorithm that allows you to increase the Lock[] size along with the increase in the hash table. They call this algorithm a refinable hash map. Interested refer to the original.

The striping algorithm easily spreads to other data structures, such as trees:

The idea is very simple: each element of the Lock[L] mutex array protects its own tree. The input to Lock[L] is the hash value of the key (modulo L ). By locking Lock[i] , we gain access to the corresponding tree, in which we perform the required action.

libcds

The libcds library has striped and refinable hash set / map implementations, as well as adapters for all associative containers of STL and boost, including intrusive ones.

The striping technique is not bad, but it has a significant flaw in terms of a series of articles on lock-free data structures - it is blocked. Is there a lock-free algorithm for hash map? ..

Lock-free ordered list

Take another look at the internal structure of the hash map:

As long as we leave the questions of rehashing overboard, we will assume that we a priori know the number of elements in the hash table and have a good hash function for our set of keys. Then, to build a lock-free hash map, we need a bit - the lock-free list of collisions. And the list of collisions is nothing more than an ordered (by key) single-linked list. We will try to build it, after T.Harris and M.Michael .
Searching a lock-free list is rather trivial: a linear pass through the list, starting with its head. Considering that we use atomic operations and one of the safe deletion schemes (Hazard Pointer, user-space RCU or similar), the search algorithm is not much different from the usual search in an ordered list (in fact, if you look at the code, it is much this is a fee for lock-free).
Inserting a new item is also not a problem:

We are looking for the insertion position (remember that our list is ordered), create a new element and insert it with a single CAS primitive (compare-and-swap, aka std::compare_exchange ).
Removing is also quite trivial:

Linearly we look for a deleted element and atomically, with CAS, we throw the pointer from its predecessor to the next element.
Problems begin when all this is done together and in parallel: flow A removes the element with key 3, flow B inserts key 4:

Threads A and B simultaneously search; at the same time, they store in local variables the position found for deletion (A) and for insertion (B). Then flow A safely removed 3 from the list; flow B at that time, for example, was ~~drunk~~ expelled. Further, flow B is planned by the operating system and continues its work on inserting key 4. It already has a local position in the local variables — the iprev and inext in the list — and it executes the atomic CAS primitive to insert the element — where? After the item with key 3, which is already excluded from the list! (Hazard Pointer / RCU guarantee that the element 3 is not physically deleted yet, that is, we will not receive “access to garbage”, but 3 is already excluded from the list by stream A). Everything was completed without errors, as a result, the element with key 4 was lost and in addition we got a memory leak.

T.Harris found an elegant solution to this problem in 2001. He proposed a two-phase deletion of an element: the first phase - logical deletion - marks the element as deleted, not excluding it from the list, the second phase - physical exclusion - excludes the element from the list (and the third phase - physical deletion - is performed by the secure memory deletion scheme - Hazard Pointer, RCU or similar). The point of logical deletion is to mark the element to be deleted in such a way that the CAS does not work on it. For this, Harris suggested using the low bit of the next element's pointer. Indeed, in modern architectures (both processors and OS), data is aligned on the border of 4 (for 32bit) or 8 (for 64bit) bytes, so the lower 2 or 3 bits of any pointer are always zero and can be used as bit flags. Such a trick received the proper name of the marked Pointer and is widely used in lock-free programming.
Using marked pointer, parallel insert / delete looks like this:

Stream A marks element 3 as logically removed — sets the low bit of the found->next pointer to one. Now stream B, having tried to insert 4 after 3, will fail - CAS will not work, since we believe that the 3.next pointer 3.next not marked.
The price of the marked pointer method is an additional call to CAS when the item is deleted.
It may seem that all of the above described is unnecessarily and much faster and easier simply to assign the value of nullptr to the next pointer of the element to be nullptr - then the CAS on the insert will not work. Yes, it is simpler, but these are two independent operations: one is an exception, the other is zeroing next . And since there are two operations, then someone can interpose between them and insert a new element after the already excluded one. Thus, we only reduce the probability of error, but do not exclude it. Lock-free grimaces: everything should be done as atomically as possible, but in any case - without disturbing the internal structure of the container.

Historical excursion

The evolution of the Harris algorithm for the lock-free ordered list is interesting. The original algorithm is distinguished by the fact that in principle the Hazard Pointer scheme is not applicable to it. The fact is that the Harris algorithm assumes the physical removal of a chain of related list nodes, in principle, an infinite one. As we know, the Hazard Pointer scheme first requires protecting the deleted elements by declaring them to be hazard, and only then do something with them. But the number of hazard pointers is limited , we can not protect the entire dimensionless chain of elements!
M. Michael, developing Hazard Pointer, revealed this fundamental inapplicability of his scheme to Harris's elegant algorithm and proposed his own modification, which also uses a two-phase deletion — the reception of a marked pointer — but removes the elements one by one , thereby making it permissible to use Hazard Pointer.
There is another modification of this algorithm that solves the problem of resuming the search . As I have repeatedly stressed, all the lock-free algorithms are an infinite loop of “swotting until it works out”. Describing the interesting situations above, I did not mention what needs to be done when the CAS did not work. In fact, everything is simple: if the CAS on insertion / deletion did not work, we must again try to perform insertion / deletion from the very beginning - from the search position - until we either perform the operation or realize that it is impossible to perform (for insertion - because the key is already in the list, for deletion - because the key is not in the list). Resuming the operation from the beginning of the list is an expensive thing if the list is long (and we will see later that, based on the list described here, you can build a rather efficient ordered container algorithm that supports millions of elements), so the thought comes: what if you start the repeated search not from the beginning , and from some previous element? But the concept of “previous” in the ever-changing world of lock-free is very fragile. In his thesis, Fomichev suggests adding back-reference to some previous element in each element of the lock-free list and describes techniques for working with these pointers and keeping them up-to-date, rather complex.

Well, to the heap - there is an original algorithm of an ordered list based on fine-grained locks at the level of each element - lazy list . It is also implemented in libcds. Despite its lock-based nature, the hash map, built on it, is not much inferior to the lock-free analogue disassembled in this article. It uses spin-lockes as mutexes by default.

So, we have analyzed the lock-free algorithm of an ordered list, with which you can implement a lock-free hash map. The disadvantage of this algorithm is that it does not provide for rehashing, that is, we need to know a priori the estimated number of elements S in such a container and the optimal load factor. Then the size N of the hash table T[N] is equal to S / load factor . Nevertheless, despite these limitations, more precisely, thanks to them, such an algorithm is one of the fastest and scalable, since there is no overhead for rehashing.
According to the results of my experiments, a good load factor is 2 - 4, that is, each lock-free list of collisions contains on average 2 - 4 items. When load factor = 1 (the list of collisions consists of one element on average), the results are slightly better, but there is more memory overhead for actually degenerate lists of collisions.
In the next article, we will look at an algorithm that quite originally supports rehashing and is based on the lock-free ordered list described here.

Lock-free data structures

Start
Basics:

Inside:

Outside:

Introduction to libcds

Source: https://habr.com/ru/post/250383/

All Articles

Lock-free data structures. Concurrent map: warm up

Hash table

Striped map

Lock-free ordered list

More articles: