Lock-free algorithms and stack implementation

In this article I want to raise a somewhat holivar topic - the topic of the slab-free algorithms, and in particular the implementation of the slab-free stack. More precisely, the stack is conditionally bezlokovy, why - it will be clear further. I want to immediately warn you that all the examples will be given in C.

To begin with, for those who are not very much in the subject, I want to briefly tell you what the slab-free algorithms are and why they are needed. Often, multi-threaded applications use access to the same data from multiple threads, as an example I can cite a processing queue. In order for these data to remain consistent (holistic), methods are needed to protect them from simultaneous uncoordinated changes. Typically, these methods are all kinds of locks, (spinlock, mutex), which completely prevent simultaneous access to data, closing before accessing data for reading or writing, and opening after the necessary operation is completed.

Without going into the differences between the spinlock and mutexes, I want to say that this principle remains the same, no matter what locks you use.

In turn, the lock-free algorithms use atomic operations like CAS (compare and swap) to negotiate simultaneous access so that the data maintain integrity. As a result, lock-free algorithms usually have significantly better performance. Everything would be fine if it were not for the high complexity of their implementation, starting with the ABA problem that has filled everyone with disbelief, ending with the danger of access to remote memory segments, and as a consequence of the program crash. And precisely because of their high complexity, lock-free algorithms are still very little used.
')
But this is all the lyrics, and now let's get to the bottom line. I have met quite a lot of stackless implementations of the stack: some unconditionally non-working, some naive, i.e. Do not take into account the problem of ABA, many and workers. For example, one of these stacks with a very good description of the problems that can occur when implementing a lock-free stack I found here: blog.memsql.com/common-pitfalls-in-writing-lock-free-algorithms .

For those who want to understand this is a pretty good article that shows the problems of lock-free algorithms and methods for solving them.

One of the problems of the lock-free stack is dynamic memory management: stack nodes must be allocated and (if we don’t want memory leaks) removed. In this case, the problem arises with the removal. Take the “naive” stack implementation:

void push(void *data) { struct stack_node_s *node = malloc(sizeof(struct stack_node_s)); node->data = data; do { node->next = head; } while(!compare_and_swap(&head, node->next, node); } void *pop() { struct stack_node_s *node; void *result = NULL; do { node = head; } while(node && !compare_and_swap(&head, node, node->next); if(node) { result = node->data; free(node); } return result; }

This implementation is referred to as “naive” because it believes that if CAS was a success, then we found exactly the data that we had in mind. In fact, there are many scenarios in which the head may be the same as at the beginning of the iteration, but it will mean something completely different.

For example, suppose a situation in which one of the reading threads managed to keep the head at the node, and take the node-> next, but compare_and_swap hadn’t been called up yet. Then the thread is interrupted, and gives control to other threads, one of which completely manages to remove and remove the head, the other - to remove and remove head-> next, and the third to place the element on the stack, and the memory for this element was allocated from the same address where old head. Control then returns to the first thread.

In this case, the node will coincide with the head, but the node-> next which we managed to take will not only be irrelevant, but also point to a remote memory location.

This problem is called ABA ', or ABA' problem, in many sources. It is a true scourge of beginner enthusiasts involved in lock-free algorithms. It is solved most often by introducing in addition to the pointers of the counter-tags, which change with each placement on the stack, so that even with the same value of the pointer indicator, the tag-pointer pair will differ.

The second problem that this implementation will face is the release of memory. To demonstrate this problem, let's assume a situation in which one of the reading threads retained the head at the node and transferred control to other threads. The other reading thread managed to carry out the entire procedure of extracting from the stack and deleting memory. When control returns to the first thread, an attempt to take a node-> next will refer to the freed memory, which, of course, is not very good, and may even lead to a program crash. This problem is solved in different ways. In this article I used the so-called hazard pointers: a list of pointers that can not be deleted. Before deleting a chunk of memory, the stream is checked against this list. There are other ways: for example, temporary replacement of the stack head with a conditional value preventing its removal in another thread. The implementation I want to propose is based on one simple fact: access to the stack is done through a single element: the head of the list, which still cannot be changed at the same time, therefore, compared to other containers, such as, for example, a queue where there is a head reading and writing in the tail, the gain from the lock-free approach is rather low.

Therefore, I want to propose a combined approach to the stack. In this algorithm, the recording is performed in the lock-free mode, i.e. CAS th, reading, except CAS uses also spinlock to avoid simultaneous reading in several flows. Due to the fact that reading will be carried out only in one thread, the problem of access to the freed memory immediately disappears. Moreover, since during the read operation, the option with deletion and subsequent insertion is impossible (you can only insert), then the ABA problem disappears by itself. In other words, I propose an algorithm that uses locks in only one of the parts that can cause the greatest problems.

In my case, the implementation will look like this:

 void thread_buffer_push(struct thread_buffer_s *head, void *data) { struct thread_buffer_s *tb, *oldhead; tb = malloc(sizeof(struct thread_buffer_s)); tb->data = data; oldhead = head->next; tb->next = oldhead; while(!__sync_bool_compare_and_swap(&head->next, oldhead, tb)) { usleep(1); oldhead = head->next; tb->next = oldhead; } } void* thread_buffer_pop(struct thread_buffer_s *head) { struct thread_buffer_s *current; void *result = NULL; // Spinlock acquire while(!__sync_bool_compare_and_swap(&head->list_mutex, 0, 1)) { usleep(1); } current = head->next; // NOONE can pop and delete the same element, since it's read-locked // But it can change since the stack can be pushed without locking while(current && !__sync_bool_compare_and_swap(&head->next, current, current->next)) { usleep(1); current = head->next; } if(current) { result = current->data; free(current); } // Spinlock release while(!__sync_bool_compare_and_swap(&head->list_mutex, 1, 0)) { usleep(1); } return result; }

This implementation was tested for speed along with the lock-free algorithm that takes into account the above errors, as well as algorithms using spinlock and mutex (pthread_mutex). Usleep (1) was added after the experiments in accordance with the recommendations given in the article on the link I indicated.

The estimated execution time for each of these algorithms is the following (during all tests it has changed only slightly):

mutex:
real 0m1.336s
user 0m1.173s
sys 0m3.628s

lock-free:
real 0m0.533s
user 0m0.792s
sys 0m0.046s

spinlock:
real 0m0.520s
user 0m0.630s
sys 0m0.018s

semi-locked:
real 0m0.353s
user 0m0.360s
sys 0m0.075s

The minor differences in the lock-free and spinlock algorithm are explained by the fact that, as I wrote above, the stack has only one common access point (head). The fact that these differences are not in favor of the lock-free algorithm is a side effect of adding protection against access to remote pointers.

I would like to draw the following conclusions: before implementing the lock-free algorithm, it is necessary to analyze the data structures used for the possibility of parallel access of several threads simultaneously. It is possible that it is in your case (as, for example, in the case of the stack) that the lock-free approach will only add hemorrhoids to you, without significantly improving performance. This article also demonstrates the possibility of combined approaches to the implementation of parallel algorithms.

Source: https://habr.com/ru/post/271549/

All Articles

Lock-free algorithms and stack implementation

More articles: