Accelerate std :: shared_mutex 10 times

In this article, we will examine in detail the atomic operations and memory barriers of C ++ 11 and the assembler instructions generated by them on x86_64 processors.

Next, we show how to speed up the work of the contfree_safe_ptr <std :: map> to the level of complex and optimized lock-free data structures with similar functionality to std :: map <>, for example: SkipListMap and BronsonAVLTreeMap from the libCDS library (Concurrent Data Structures library): github. com / khizmax / libcds

And we will be able to get such multithreaded performance for any of your natively safe class T used as contfree_safe_ptr <T>. We are interested in optimizations that increase productivity by ~ 1000%, so we will not pay attention to weak and dubious optimizations.

Three related articles:

')

Making any object thread safe
Accelerate std :: shared_mutex 10 times
Thread safe std :: map with lock-free map performance

→ My article in English

→ Examples and tests from all three articles

High performance lock-based data structures

contfree_safe_ptr <T> is the safe_ptr <T class, contention_free_shared_mutex>, where contention_free_shared_mutex is its own optimized shared-mutex. And safe_ptr is a thread safe pointer from the previous article.

In order, we show how to implement our own high-performance contention-free shared-mutex , almost non-conflicting on readings. We implement our own active spinlock and recursive-spinlock locks to block rows (rows) on update operations. Create RAII-blocking pointers to avoid the cost of multiple blocking. We present the results of performance tests.

And as a bonus “Just for fun”, we will show how to implement our own class of simplified partitioned type partitioned_map, even more optimized for multithreading, consisting of several std :: map, by analogy with the partition table from the DBMS, when the boundaries of each section are initially known.

Basics of atomic operations

Consider the basics of multithreading, atomicity and memory barriers. If we change the same data from multiple threads, i.e. if we run the thread_func () function simultaneously in multiple threads:

int a; void thread_func() { a = a + 1; } // wrong: data-race

Then each thread calling the thread_func () function adds 1 to the usual shared variable int a; Such code in general will not be thread safe, since compound operations (RMW - read-modify-write) on ordinary variables consist of a set of small operations between which another thread can change data. The operation a = a + 1; consists of at least three mini-operations:

Load the value of the variable "a" in the register processor
Add 1 to the value in the register
Write the value of the register back to the variable "a"

For example, if int a = 0; and 2 threads perform the operation a = a + 1; then the result should be a = 2; But the following can happen - step by step:

 int a = 0; // register1 = ?, register2 = ?, a = 0 Thread 1: register1 = a; // register1 = 0, register2 = ?, a = 0 Thread 2: register2 = a; // register1 = 0, register2 = 0, a = 0 Thread 1: register1++; // register1 = 1, register2 = 0, a = 0 Thread 2: register2++; // register1 = 1, register2 = 1, a = 0 Thread 1: a = register1; // register1 = 1, register2 = 1, a = 1 Thread 2: a = register2; // register1 = 1, register2 = 1, a = 1

Two threads added to the same global variable +1, but in the end the variable a = 1 - this problem is called Data-Races.

To avoid this problem, there are 3 ways:

Using atomic instructions over atomic variables - but there is one drawback, the number of atomic functions is very small - therefore it is difficult to implement complex logic using them: en.cppreference.com/w/cpp/atomic/atomic
Develop your own complex lock-free algorithms for each new container.
Use locks (std :: mutex, std :: shared_timed_mutex, spinlock ...) - they allow for locked code in turn one by one thread, so data-races problems do not occur and we can use arbitrarily complex logic using any normal thread-unsafe objects.

For std :: atomic <int> a; if initially a = 0; and 2 threads perform the operation a + = 1; then the result will always be a = 2;

 std::atomic<int> a; void thread_func() { a += 1; } // correct: no data-race

The following will always occur on x86_64 processors (step by step):

 std::atomic<int> a = 0; // register1 = ?, register2 = ?, a = 0 Thread 1: register1 = a; // register1 = 0, register2 = ?, a = 0 Thread 1: register1++; // register1 = 1, register2 = ?, a = 0 Thread 1: a = register1; // register1 = 1, register2 = ?, a = 1 Thread 2: register2 = a; // register1 = 1, register2 = 1, a = 1 Thread 2: register2++; // register1 = 1, register2 = 2, a = 1 Thread 2: a = register2; // register1 = 1, register2 = 2, a = 2

On processors with LL / SC support, for example, on ARM or PowerPC, other steps will occur, but with the same result a = 2.

Atomic variables are introduced in C ++ 11 standard: en.cppreference.com/w/cpp/atomic/atomic

The member functions of the std :: atomic <> class template are always atomic.

We give 3 fragments of the correct code of identical meaning:

1. Example:

 std::atomic<int> a; a = 20; a += 50;

2. This is identical to example 1:

 std::atomic<int> a; a.store( 20 ); a.fetch_add( 50 );

3. And this is identical to example 1:

 std::atomic<int> a; a.store( 20, std::memory_order_seq_cst ); a.fetch_add( 50, std::memory_order_seq_cst );

This means that for std :: atomic:

load () and store () are the same as operator T and operator = and
fetch_add () and fetch_sub () are the same as operator + = and operator- =
Sequential Consistency (std :: memory_order_seq_cst) is the default memory barrier (strictest and most reliable, but also the slowest relative to others).

Note the std :: atomic and volatile differences in C ++ 11: www.drdobbs.com/parallel/volatile-vs-volatile/212701484

There are 5 main differences:

Optimizations : For std :: atomic <T> a; there are two possible optimizations that are impossible for volatile T a;

• Merger optimization: a = 10; a = 20; can be replaced by a compiler with a = 20;

• Optimization of constant replacement: a = 1; local = a; can be replaced with a = 1 compiler; local = 1;
Reordering : Operations on std :: atomic <T> a; may limit the reordering around itself for operations with ordinary variables and operations with other atomic variables in accordance with the memory barrier used std :: memory_order_ ... In contrast, volatile T a; does not affect the order of ordinary variables (non-atomic / non-volatile), but calls to all volatile variables always maintain strict mutual order, i.e. The order of execution of any two volatile operations cannot be changed by the compiler, but not by the processor.
Spilling : Memory barriers std :: memory_order_release, std :: memory_order_acq_rel, std :: memory_order_seq_cst specified for operations on std :: atomic <T> a; initiate spilling of all ordinary variables until the execution of an atomic operation. Those. These barriers unload normal variables from processor registers to RAM / cache, unless the compiler can guarantee 100% that this local variable cannot be used in other threads.
Atomicity / alignment : Operations on std :: atomic <T> a; visible to other streams, either completely or not at all. For integral types T, this is achieved by aligning the location of the atomic variables in the memory by the compiler — at least the variable must lie in one cache line, so the atomic variable can be changed or read by just one CPU operation. Conversely, the compiler does not guarantee the alignment of volatile variables and the atomic nature of operations on them. Volatile variables are typically used to access device memory or in other cases , but not to exchange data between threads. The device driver API returns a pointer to volatile variables, and if necessary, this API provides alignment.
Atomicity of RMW operations (read-modify-write): Operations on std :: atomic <T> a; such as (++, -, + =, - =, * =, / =, CAS, exchange) are executed atomically, i.e. if two threads perform an operation ++ a; then this variable is guaranteed to be increased by 2. This is achieved by blocking the cache lines (x86_64) or by marking the absence of changes in the cache lines on processors supporting LL / SC (ARM, PowerPC) for the entire duration of the RMW operation. Volatile variables do not provide atomicity for compound RMW operations.

There is one general rule for std :: atomic and volatile variables: each read or write operation always accesses the memory / cache, i.e. values are never cached in processor registers.

Also note that for ordinary (ordinary) variables and objects (non-atomic / non-volatile) - any optimizations are possible and any reordering of independent instructions relative to each other by the compiler or processor.

Recall that the write operations into memory over atomic variables with memory barriers std :: memory_order_release, std :: memory_order_acq_rel and std :: memory_order_seq_cst guarantee spilling (writing from registers to memory) of all non-atomic / non-volatile variables that are present moment in processor registers: en.wikipedia.org/wiki/Register_allocation#Spilling

Re-order instructions

The compiler and processor change the order of instructions to optimize the program and improve performance.

Here is an example of the optimizations that the GCC compiler and the x86_64 processor do: godbolt.org/g/n91hpt

Full size picture:

hsto.org/files/3d2/18b/c7b/3d218bc7b3584f82820f92077096d7a0.jpg

What are the optimization on the picture:

Reordering with the GCC 7.0 compiler:

• Swaps the entry in memory b = 5; and loading from memory into the register int tmp_c = c ;. This allows you to request the value “c” as early as possible, and while the processor is waiting for this long operation, the processor’s pipeline allows you to simultaneously start the operation b = 5; These 2 operations are independent of each other.

• Combines loading from memory into register int tmp_a = a; and the addition operation tmp_c = tmp_c + tmp_a; - as a result, instead of two instructions, you get one add eax, a [rip]
Reordering the x86_64 processor:

The processor can interchange the actual write to memory mov b [rip], 5 and reading from memory combined with addition add eax, a [rip].

By initiating the entry into memory with the instruction mov b [rip], 5, the following occurs: first, the value 5 and the address b [rip] are placed in the hardware store-buffer queue, the cache lines containing the address b [rip] in all processor cores are expected to become invalid and receive from response, then CPU-Core-0 sets the status “eXclusive” for the cache line containing b [rip], and only after that the actual value 5 from the Store-buffer is written to this cache line at b [rip]. Learn more about x86_64 cache coherence protocols MOESI / MESIF - changes in which are visible to all cores instantly: en.wikipedia.org/wiki/MESIF_protocol

In order not to wait all this time - immediately after 5 is placed in the Store-Buffre, without waiting for the actual write to the cache, we can start executing the following instructions: reading from memory or operations with registers. This is what the x86_64 processor does.

Intel 64 and IA-32 Architects Software Developer's Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide: www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/ 64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf

8.2.3.4 Loads May Be Reordered with Ear Stores to Different Locations

The Intel-64 Memory-Ordering Model. However, loads are not reordered with the same location.

Processors of the x86_64 family have a strong memory model. And processors with weak memory models, for example, PowerPC and ARM v7 / v8 can perform even more reorders.

Let us give some examples of possible reordering of a record in the memory of ordinary ordinary- variables, volatile- variables and atomic- variables.

Reordering ordinary variables:

This code with ordinary variables can be reordered by the compiler or processor, since within one flow its meaning does not change. But within the framework of multiple threads, such reordering may affect the logic of the program.

If 2 variables are volatile, then the following reorders are possible. The compiler cannot reorder operations on volatile variables during compilation, but the compiler allows the processor to perform this reordering at runtime.

To prevent all or only some of these reorders - atomic operations exist (recall that by default, atomic operations use the strictest memory barrier std :: memory_order_seq_cst ):

Another thread can see the variable changes in memory in exactly this changed order.

If we do not specify a memory barrier for an atomic operation, the std :: memory_order_seq_cst barrier is used by default, and no atomic or non-atomic operations can be reordered with such operations (but there are exceptions - which we will look at next).

In the above case, we first write to the normal variables a and b, and then to the atomic variables a_at and b_at, and this order cannot be changed. Also, a b_at write to memory cannot occur earlier than a_at write to a memory. But writing to variables a and b can be reordered relative to each other.

When we say “they can be reordered,” that means they can, but not necessarily. It depends on how you decide to optimize the C ++ code - the compiler at compile time or the CPU at run time.

Below we will look at weaker memory barriers, which allow instructions to be reordered in permitted directions - this allows the compiler and processor to better optimize the code and improve performance more.

Barriers to reordering memory accesses

The memory model of the C ++ 11 standard provides us with 6 types of memory barriers that correspond to the capabilities of modern CPUs for extraordinary execution of operations (speculative execution), using them we do not completely prohibit reordering, but only in the necessary directions. This leaves the compiler and processor to optimize the code as much as possible. And the forbidden reordering directions allow us to preserve the correctness of our code. en.cppreference.com/w/cpp/atomic/memory_order

 enum memory_order { memory_order_relaxed, memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_seq_cst };

memory_order_consume. Immediately, we note that we practically will not use the memory_order_consume barrier, since in the standard there are doubts about the expediency of its use - a quote from the standard: www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606.pdf

§29.3

(1.3) - memory_order_consume: a load operation [Note: Prefer memory_order_acquire , which provides more guarantees than memory_order_consume. Implement better than that of memory_order_acquire. Specification revisions are under consideration. - end note]
memory_order_acq_rel. Note also that the memory_order_acq_rel barrier is used only for atomic compound RMW operations (Read-Modify-Write) such as: compare_exchange_weak () / _ strong (), exchange (), fetch_ (add, sub, and, or, xor) or corresponding operator names : en.cppreference.com/w/cpp/atomic/atomic

The remaining 4 memory barriers can be used for any operations, with the exception of: acquire - not used for store (), and release - not used for load ().

Depending on the memory barrier chosen, the compiler and the processor are prohibited from moving executable code relative to the barrier in different directions.

Now we will show what exactly the given arrows denote - we will show what can change places and what cannot:

In order for 2 instructions to be swapped, it is necessary that the barriers of both of these instructions allow such an exchange. Since “Other any code” are ordinary non-atomic variables that have no barriers, then they allow any changes in the order. In the lower-left example of Relaxed-Release-Relaxed , as you can see, the ability to change the order of the same memory barriers depends on the sequence in which they are placed.

Let's take a look at what these memory barriers mean and what advantages they give us using as an example the implementation of the simplest “spin-lock” lock, which requires the most common semantics of the Acquire-Release reordering. Spin-lock is a lock by use similar to std :: mutex.

To begin with, we implement the spin-lock concept directly in the body of our program. And then we implement a separate class spin-lock. To implement locks (mutex, spinlock ...), you must use Acquire-Release semantics, C ++ 11 Standard § 1.10.1 (3): www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606 .pdf

... For example, it’s not necessary to complete the operation. Correspondingly, it’s the same mutex. In the case of a formal operation,

The basic meaning of Acquire-Release semantics is that: thread-2 “Thread-2” after performing the flag.load operation (std :: memory_order_acquire) should see all changes of any variables / structures / classes (even non-atomic) that were made thread-1 “Thread-1” before it performs flag.store operation (0, std :: memory_order_release) .

The basic meaning of locks (mutex, spinlock ...) is to create a piece of code that can be executed only by one thread at the same time, i.e. cannot be executed by two or more threads in parallel. This section of code is called the critical section. Inside it, you can use any ordinary code, including without std :: atomic.

Memory barriers (memory fences) - prevent the program from being optimized by the compiler, so that no operation from the critical section goes beyond its limits.

The thread that first captures the lock executes this block of code, while the remaining threads wait in a loop (possibly temporarily falling asleep). When the first thread releases the lock, the processor decides which next one of the waiting threads will capture it. And so on.

We give 2 identical in the sense of the example:

using std :: atomic_flag: [19] coliru.stacked-crooked.com/a/1ec9a0c2b10ce864
using std :: atomic <int>: [20] coliru.stacked-crooked.com/a/03c019596b65199a

Example-1 is preferable, so we will schematically show the meaning of using memory barriers — the atomic operations are shown in solid blue:

[19] coliru.stacked-crooked.com/a/1ec9a0c2b10ce864

The meaning of the barriers is very simple - the compiler optimizer is forbidden to move instructions from the critical section to the outside:

No instruction located after memory_order_ acquire can be executed before it.
No instruction before memory_order_ release can be executed later.

Any other changes in the order of execution of independent instructions can be performed by the compiler (compile-time) or processor (run-time) - in order to optimize performance.

For example, the string int new_shared_value = shared_value; can be executed before lock_flag.clear (std :: memory_order_release); . Such reordering is acceptable and does not create data-races, since All code that refers to data common to multiple threads is always enclosed within two acquire and release barriers. And outside there is a code that only works with data local to the stream - and no matter in what order it will be executed. Local dependencies for a stream are always saved as well, as it happens with single-threaded execution, therefore int new_shared_value = shared_value; cannot be executed earlier than shared_value + = 25;

So what is prohibited, and what acquire-release barriers for the critical section allow:

Any operation inside the critical section is prohibited to be performed outside the critical section.
Any operation outside the critical section is allowed to be executed inside the critical section (if there are no additional memory barriers).

What specific actions does the compiler perform on std :: memory_order:

1, 6 : The compiler generates acquire-barrier Assembler instructions for the load operation and a release barrier for the store operation if these barriers are necessary for a given processor architecture
2 : The compiler cancels the previous caching of variables in the processor registers in order to reload the values of these variables changed by another thread - after the load (acquire) operation
5 : The compiler saves the values of all variables from the processor registers to memory so that they become visible to other threads, i.e. performs spilling ( link ) to store (release)
3, 4 : The compiler prevents the optimizer from changing the order of instructions in forbidden directions - indicated by red arrows

And now let's see what happens if we use the Relaxed-Release semantics instead of Acquire-Release:

Left. In the case of Acquire-Release, everything is executed correctly.
On right. In the case of Relaxed-Release , the algorithm will work incorrectly, since part of the code inside the critical section protected by the lock can be moved outside by the compiler or processor. Then there will be a problem Data Races - many threads even before blocking start simultaneously working with data, on which non-atomic operations are performed - we can get an erroneous result.

Note that it is usually impossible to implement all the logic on general data only with the help of atomic operations, since there are very few of them and they are rather slow if there are a lot of them.Therefore, it is simpler and faster: in one atomic operation, set the “closed” flag, perform all non-atomic operations on the common data for the streams, and set the “open” flag.

We show schematically this process in time:

For example, two threads start the execution of the add_to_shared () function.

Thread-1 comes a little earlier, and one atomic instruction test_and_set () performs 2 operations at once: check if lock_flag == false, then set it to true (i.e., block spin-lock), and return false. Therefore, the expression while (lock_flag.test_and_set ()); right there the code of the critical section is completed and starts to be executed.
Stream-2 at this moment also starts to execute this atomic instruction. Test_and_set (): checks if lock_flag == false, then sets the value to true, otherwise it does not change anything and returns the current value to true. Therefore, the expression while (lock_flag.test_and_set ()); will be executed until while (lock_flag);
Stream-1 performs the addition operation shared_value + = 25; and then the atomic operation sets the value of lock_flag = false (i.e., unlocks spin-lock).
Thread 2 finally waiting for the condition lock_flag == false assigns lock_flag = true, returns false and ends the loop. It then performs the addition of shared_value + = 25; and assigns lock_flag = false (unlocks spin-lock).

At the beginning of this chapter, we gave 2 examples:

using std :: atomic_flag and test_and_set (): [21] coliru.stacked-crooked.com/a/1ec9a0c2b10ce864
using std :: atomic <int> and compare_exchange_weak (): [22] coliru.stacked-crooked.com/a/03c019596b65199a

Learn more about these operations by reference: en.cppreference.com/w/cpp/atomic/atomic

Let's look at how the assembler code for x86_64 generated by the GCC compiler for these two examples differs:

To make it convenient to use, you can combine it into one class:

 class spinlock_t { std::atomic_flag lock_flag; public: spinlock_t() { lock_flag.clear(); } bool try_lock() { return !lock_flag.test_and_set(std::memory_order_acquire); } void lock() { for (size_t i = 0; !try_lock(); ++i) if (i % 100 == 0) std::this_thread::yield(); } void unlock() { lock_flag.clear(std::memory_order_release); } };

An example of using the class spinlock_t the following link: [23] coliru.stacked-crooked.com/a/92b8b9a89115f080

ensure that you never can understand what kind of barrier you use and what it will be compiled on x86_64, I will give the following table:

The picture can be viewed in full size on the link: hsto.org/files/4f8/7b4/1b6/4f87b41b64a54549afca679af1028f84.jpg

The following is necessary to know only if you are writing code in x86_64 assembler, when the compiler cannot interchange assembly instructions for optimization:

seq_cst . (Clang MSVC) GCC – Store Sequential Consistency, : a.store(val, memory_order_seq_cst); — Clang MSVC [LOCK] XCHG reg, [addr], CPU-Store-buffer , MFENCE. GCC MOV [addr], reg MFENCE.
RMW (CAS, ADD…) always seq_cst . Since RMW (Read-Modify-Write) x86_64 LOCK, Store-Buffer, Sequential-Consistency . memory_order RMW , memory_order_acq_rel.
LOAD(acquire), STORE(release) . , x86_64 4 (relaxed, consume, acquire, release) – .. x86_64 acquire-release – - MESIF(Intel) / MOESI (AMD). Write Back ( , Un-cacheable Write Combined – Mapped Memory Area from Device – Acquire-semantic).

As we know nowhere and never can dependent operations be reordered, for example: (Read-X, Write-X) or (Write-X, Read-X)

Slides from the Herb Sutter performance for x86_64: https://onedrive.live. com / view.aspx? resid = 4E86B0CF20EF15AD! 24884 & app = WordPdf & authkey =! AMtj_EflYn2507c

• On x86_64, any can not be reordered:

Read-X - Read-Y
Read-X - Write-Y
Write-X - Write-Y

• On x86_64, any can be reordered: Write-X <-> Read-Y. To prevent this, the std :: memory_order_seq_cst barrier is used, which can generate 4 variants of code on x86_64 depending on the compiler:

load: MOV (from memory) store: MOV (to memory), MFENCE
load: MOV (from memory) store: LOCK XCHG (to memory)
load: MFENCE + MOV (from memory) store: MOV (to memory)
load: LOCK XADD (0, from memory) store: MOV (to memory)

Summary table of the memory barriers to processor instructions for architectures (x86_64, PowerPC, ARMv7, ARMv8, AArch64, Itanium) at: www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

View the actual assembly code for different compilers You can follow the links. And also you can choose other architectures: ARM, ARM64, AVR, PowerPC.

GCC 6.1 (x86_64):

Load / Store: godbolt.org/g/xq9ulH
RMW (fetch_add): godbolt.org/g/dp1zZ0
CAS (compare_exchange_weak () / compare_exchange_strong ()): godbolt.org/g/OMuXmz
std :: atomic_flag :: test_and_set (): godbolt.org/g/7ksl0J

Clang 3.8 (x86_64):

Load / Store: godbolt.org/g/gfpeZW
RMW (fetch_add): godbolt.org/g/afoIQW
CAS (compare_exchange_weak () / compare_exchange_strong ()): godbolt.org/g/kYXzK6
std :: atomic_flag :: test_and_set (): godbolt.org/g/RD5fJG

We will also briefly show in the table what effect various memory barriers have on the order relative to CAS (Compare-and-swap) instructions: en.cppreference.com/w/cpp/atomic/atomic/compare_exchange

Examples of reordering memory accesses

Now we will show more complex examples of reordering from 4 consecutive operations: LOAD, STORE, LOAD, STORE.

Blue rectangles are atomic operations.
The dark blue rectangles inside the light blue (in examples 3 and 4) are parts of the composite atomic Read-Modify-Write (RMW) instructions consisting of several atomic operations
White rectangles are common nonatomic operations.

Let us give 4 examples in each of which we will show the possible reordering of operations with ordinary variables around operations with atomic variables. But only in examples 1 and 3 some reordering of atomic operations between themselves is also possible.

The 1st case is interesting because several critical sections can be fused into one.

The compiler cannot perform such reordering at compile time, but the compiler allows the processor to do this reordering at run time. Therefore, merging critical sections that are executed in different sequences in different threads cannot lead to deadlock, because the initial sequence of instructions will be visible to the processor. Consequently, the processor will try to enter the second critical section in advance, but if it fails, it will continue the execution of the first critical section and, after its full completion, will wait for the entrance to the second critical section.

The 3rd case is interesting in that parts of two compound atomic instructions can be reordered: STORE A <-> LOAD B.

, 3- ASM- (lwarx, add, stwcx), LL/SC ( wiki-link ), : godbolt.org/g/j8uw7n , .

The 2nd case is interesting because std :: memory_order_seq_cst is the strongest barrier and it would seem that it should prohibit any reordering of atomic or non-atomic operations around itself. But seq_cst barriers have only one additional property compared to acquire / release — only if both atomic operations have a seq_cst barrier, then the sequence of operations is STORE-A (seq_cst); LOAD-B (seq_cst); cannot be reordered. Here are 2 quotes of the C ++ standard:

A strict unified reciprocal order of execution is preserved only for operations with a barrier memory_order_seq_cst, Standard C ++ 11 § 29.3 (3): www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606.pdf

There shall be a single total order S on all memory_order_seq_cst operations , consistent with the “happens before” order and modification orders for all affected locations, such that each memory_order_seq_cst operation B that loads a value from an atomic object M observes one of the following values: …
memory_order_seq_cst memory_order_seq_cst– , Standard C++11 § 29.3 (8): www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606.pdf

[Note: memory_order_seq_cst ensures sequential consistency only for a program that is free of data races and uses exclusively memory_order_seq_cst operations. Any use of weaker ordering will ensure that extreme care is used. In particular, memory_order_seq_cst fences ensure a total order only for the fences themselves. Fences cannot be in general, can be used to restore sequential consistency for weaker ordering specifications. - end note]

The allowed directions for reordering non – seq_cst operations (atomic and non-atomic) around seq_cst operations are the same as for acquire / release:

a.load (memory_order_seq_cst) - guarantees for non-seq_cst operations the same order as a.load (memory_order_acquire)
b.store (memory_order_seq_cst) - guarantees for non-seq_cst-operations the same order as b.store (memory_order_release)

Stricter order is possible, but not guaranteed.

In the case of seq_cst as well as for acquire and release: nothing going before STORE (seq_cst) can be done after it, and nothing going after LOAD (seq_cst) can be done before it.

But in the opposite direction reordering is possible.

Now let us show what changes in the order of instructions the compilers allow for processors, using the example of GCC for x86_64 and PowerPC - code examples in C ++ and generated code in Assembler x86_64 and PowerPC.

The following order changes are possible around memory_order_seq_cst – operations:

x86_64 Store-Load. Those. :

STORE-C(release); LOAD-B(seq_cst); ==> LOAD-B(seq_cst); STORE-C(release);
because x86_64, MFENCE STORE(seq_cst), LOAD(seq_cst) LOAD(release) LOAD(relaxed): godbolt.org/g/BsLqas
PowerPC Store-Load, Store-Store . Those. :

STORE-A(seq_cst); STORE-C(relaxed); LOAD-C(relaxed); ==>

LOAD-C(relaxed); STORE-C(relaxed); STORE-A(seq_cst);
becauseon the PowerPC architecture, for the seq_cst barrier, sync (hwsync) is added only up to STORE (seq_cst) and before LOAD (seq_cst), so all instructions that are between STORE (seq_cst) and LOAD (seq_cst) can be executed before STORE (seq_cst): godbolt.org/g/dm7tWd

We show in more detail an example of 3 variables with semantics: seq_cst and relaxed.

What reordering allows the C ++ compiler to do

C ++ to ASM for PowerPC: godbolt.org/g/mEM8T8
C ++ to ASM for x86_64: godbolt.org/g/lTNMJ2

Why are such changes in order possible? Because the C ++ compiler generates a assembler code that allows such reordering of x86_64 and PowerPC processors:

What sort of reordering allows different CPUs to do in the absence of assembler barriers between instructions. “Memory Barriers: a Hardware View for Software Hackers” Paul E. McKenney June 7, 2010 - Table 5: www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf

There is another data sharing feature between threads, which manifests itself in the interaction of 4 threads or more. If at least one of the following operations does not use the strictest barrier memory_order_seq_cst, then different threads can see the same changes in a different order. For example:

If thread-1 changed some value first
-2
-3 -2, -1
-4 -1, -2

This is possible due to the hardware features of the cache-coherent protocol, additional load / store buffers and the topology of the location of the cores in the processors. In this case, some two cores may see changes made by each other before they see changes made by other cores. For all threads to see changes in the same order, i.e. had a single total order (C ++ 11 § 29.3 (3)) - it is necessary that all operations (LOAD, STORE, RMW) be performed with a memory barrier memory_order_seq_cst: en.cppreference.com/w/cpp/atomic/memory_order

In the following example the program never complete with an assert error (z.load ()! = 0); becausein all operations, the strictest memory barrier is used memory_order_seq_cst: coliru.stacked-crooked.com/a/52726a5ad01f6529

In the figure, we will show what order changes are possible for Acquire-Release semantics and for Sequential semantics using the example of 4 streams:

Acuire-Release :

• It is possible to change the order of ordinary variables in permitted directions

• It is possible to change the order of atomic variables with Acquire-Release-semantics
Sequential :

• It is possible to change the order of ordinary variables in allowed directions

• It is impossible to change the order of atomic variables with Sequential-semantics

Active spin-lock and recursive-spin-lock locks

Let us show how memory barriers are applied to atomic operations using the example of implementing own active locks: spin-lock and recursive-spin-lock.

Further, we will need these locks to block individual rows (row-lock) of a table instead of locking the entire table (table-lock) - this will increase the degree of parallelism and increase performance, because different threads will be able to work with different rows in parallel, without blocking the entire table.

The number of synchronization objects provided by the operating system may be limited. The number of rows in a table can be millions or billions, many operating systems do not allow creating so many mutexes. And the amount of spin-lock can be any, as far as RAM allows it - therefore, they can be used to block each individual line.

In fact, spinlock is + 1 byte for each row (row), or +17 bytes when using recursive-spin-lock (1 byte for the flag + 8 bytes for the recursion counter + 8 bytes for the thread number thread_id).

We have already given an example of implementing and using spinlock: [24] coliru.stacked-crooked.com/a/92b8b9a89115f080
Now we will show an example of the implementation and use of recursive_spinlock: [25] coliru.stacked-crooked.com/a/eae6b7da971810a3

The main difference between recursive_spinlock and ordinary spinlock is that recursive_spinlock can be repeatedly blocked by the same thread, i.e. supports recursive nested locks. Similarly, std :: recursive_mutex and std :: mutex are different.

An example of nested locks:

spinlock_t – (deadlock) – : [26] coliru.stacked-crooked.com/a/d3b93315270fd367
recursive_spinlock_t – – shared_value = 50: [27] coliru.stacked-crooked.com/a/875ad2754a007037

Let's see how recursive_spinlock_t works.

If you try to run this code in the MSVC 2013 compiler, you will get a very strong slowdown due to the std :: this_thread :: get_id () function: connect.microsoft.com/VisualStudio/feedback/details/1558211

We will modify the recursive_spinlock_t class to cached the thread-id in the variable __declspec (thread) is an analogue of thread_local from the standard C ++ 11. This example will show good performance in MSVC 2013: [28] coliru.stacked-crooked.com/a/3090a9778d02f6ea

This is a temporary fix for the old MSVC 2013, so we will not think about the beauty of this solution.

It is believed that in most cases repeated (recursive) locking of a mutex is a design error, but in our case it may be a slow but working code. Secondly, everyone is mistaken, and with nested recursive locks, recursive_spinlock_t will work much slower, and spinlock_t will hang forever - it's better for you to decide.

In the case of using our thread safe, safe_ptr <T> is indicated - both examples are quite logical, but the second will only work with recursive_spinlock:

 safe_int_spin_recursive->second++; // spinlock & recursive_spinlock safe_int_spin_recursive->second = safe_int_spin->second + 1; // only recursive_spinlock

Implement your own high-performance shared-mutex

As you know, new types of mutexes gradually appeared in C ++ standards: en.cppreference.com/w/cpp/thread

C ++ 11: mutex, timed_mutex, recursive_mutex, recursive_timed_mutex
C ++ 14: shared_timed_mutex
C ++ 17: shared_mutex

shared_mutex is a mutex that allows many threads to read the same data at the same time, if at this moment there are no threads that modify this data. shared_mutex did not appear immediately, because there was a debate about its performance compared with the usual std :: mutex.

The classic implementation of shared_mutex with a counter for the number of readers showed an advantage in speed only when many readers held the lock for a long time - i.e. when read for a long time. With short reads, shared_mutex only slowed down the program and complicated the code.

But are all shared_mutex implementations so slow with short reads?

The main reason for the slow performance of std :: shared_mutex and boost :: shared_mutex is the atomic reader count. Each reading thread increments the counter when blocking and decrementing it when unlocking. As a result, the threads constantly drive one cache line between the cores (namely, the exclusive-state (E) drive it). According to the logic of such an implementation, each reading thread counts how many readers are now, but this is absolutely not important for the reading thread, since it is only important for him that there is not a single writer. Moreover, since The increment and decrement are RMW operations, they always generate cleaning Store-Buffer (MFENCE x86_64) and at the level x86_64 asm actually correspond to the slowest semantics of the Sequential Consistency.

Let's try to solve this problem.

There is a type of algorithms that is classified as write contention-free - when there is not a single memory cell in which more than one thread could write. And in a more general case, there is not a single cache line in which more than one stream could write. In order for our shared-mutex with only readers to be classified as write contention-free, it is necessary that readers do not interfere with each other - i.e. so that each reader wrote the flag (what he reads) in his own cell, and removed the flag in the same cell at the end of the reading - without RMW operations.

Write contention-free is the most productive guarantee, which is more productive than the wait-free and lock-free.

It is possible that each cell is located in a separate cache line to prevent false-sharing, and it is possible that the cells are tightly packed 16 each in one cache line - the performance loss will depend on the CPU and the number of threads. To eliminate false-sharing - each variable should be placed in a separate cache line, for this purpose, there is an alignas (std :: hardware_destructive_interference_size) in C ++ 17, and earlier you can use a processor-dependent solution char tmp [60]; (on x86_64 the size of the cache line is 64 bytes):

 //struct cont_free_flag_t { // alignas(std::hardware_destructive_interference_size) std::atomic<int> value; // C++17 // }; struct cont_free_flag_t { char tmp[60]; std::atomic<int> value; // tmp[] to avoid false sharing };

Before setting the flag, the reader checks if there is a writer - i.e. Is there an exclusive lock? And sinceshared-mutex is used in cases where there are very few writers, then all used kernels can have a copy of this value in their cache-L1 in a shared-state (S), from where in 3 cycles they will receive the value of the writer's flag until it changes.

For all writers, as usual, there is the same want_x_lock flag - it means that there is a writer at the moment. Threads writers install and remove it using RMW-operations.

 lock(): while(!want_x_lock.compare_exchange_weak(flag, true, std::memory_order_acquire))

 unlock(): want_x_lock.store(false, std::memory_order_release);

But in order for readers not to interfere with each other, for this they need to write information about their shared locks in different memory cells. We will allocate an array for these locks, the size of which we will set to the template parameter, by default equal to 20. When you first call lock_shared (), the stream will be automatically registered - taking a certain place in this array. If there are more threads than the size of the array, the remaining threads will call an exclusive writer lock when calling lock_shared (). Threads are rarely created, and the time taken by the operating system to create them is so long that the time to register a new thread in our object will be negligible.

Let us make sure that there are no obvious mistakes - by examples we will show that everything works correctly, and then we schematically prove the correctness of our approach.

Let us give an example of such a quick shared blocking in which readers do not interfere with each other: [30] coliru.stacked-crooked.com/a/b78467b7a3885e5b

During a shared lock, there can be no object changes. This line of two recursive shared-lock shows this: assert (readonly_safe_map_string-> at ("apple") == readonly_safe_map_string-> at ("potato")); - the values of both lines should always be equal, since we change 2 strings in std :: map under one eXclusive lock std :: lock_guard
While reading, we do call the lock_shared () function. Let's reduce the cycle to two iterations, remove the lines modifying the data, leaving only the first two inserts into std :: map in the main () function. Now add the output of the letter S to the lock_shared () function, and the letter X to the lock () function. We see that there are two inserts X first, and then only the letters S - it means that when reading a const-object we call shared_lock (): [31] coliru.stacked-crooked.com/a/515ba092a46135ae
During the changes, we actually call the lock () function. Now we will comment out the reading and leave only the operations for changing the array, now only the letters X are output: [32] coliru.stacked-crooked.com/a/882eb908b22c98d6

The main task is to ensure that at the same time there can be only one of two states:

Any number of threads successfully executed lock_shared (), while all threads attempting to perform lock () should go into standby
One of the threads successfully executed lock (), and all other threads trying to execute lock_shared () or lock () should go into wait

Schematically, the table of compatible states looks like this.

Consider the algorithm of our contention_free_shared_mutex for cases when 2 threads try to simultaneously perform operations:

T1-read & T2-read: - lock_shared() – , .. , - (want_x_lock == false). , , – - , CAS-: want_x_lock = true.
T1-write & T2-write: - (want_x_lock) true, CAS-: want_x_lock.compare_exchange_weak(); , recursive_spinlock_t, .
T1-read & T2-write: - T1 , (want_x_lock), (true), , (want_x_lock == false) .

The thread writer T2 sets the flag (want_x_lock = true), and then waits until all the thread readers clear the blocking flags from their cells.

The flow writer in our scheme takes precedence over the reader. And if they simultaneously set their own blocking flags, then the stream-reader by the next operation checks if there is a flow-writer (want_x_lock == true), and if there is, then the reader cancels its blocking. The thread writer sees that there are no more locks from readers and successfully completes the blocking function. The global order of these locks is maintained by the Sequential Consistency semantics (std :: memory_order_seq_cst).

Schematically, the interaction of two threads (reader and writer) is as follows.

Full size:habrastorage.org/getpro/habr/post_images/5b2/3a3/23b/5b23a323b6a1c9e13ade90bbd82ed98b.jpg

In both functions, lock_shared () and lock (), for both operations 1. and 2. use std :: memory_order_seq_cst - i.e. for these operations, a single order is guaranteed for all streams (single total order).

Automatic deregistration of stream in cont-free shared-mutex

When a thread first accesses our lock, it is registered. And when this thread ends or the lock is removed, the registration should be canceled.

But now let's see what happens if 20 threads work with our mutex, then these threads end and try to register new 20 threads, provided that the blocking array is equal to 20: [33] coliru.stacked-crooked.com/a/ f1ac55beedd31966

As you can see, the first 20 threads successfully registered, but the next 20 threads could not register (register_thread = -1) and had to use the writer's exclusive lock, despite the fact that the previous 20 threads had already left and no longer use the lock.

To solve this problem, we add an automatic deregistration of a stream when the stream is deleted. If the thread has worked with many such locks, then at the moment of the thread termination, the registration should be canceled in all such locks. And if at the moment of deleting a thread there are locks that are already deleted at the moment, then there should not be an error due to an attempt to cancel the registration in a nonexistent block.

Example: [34] coliru.stacked-crooked.com/a/4172a6160ca33a0f

As you can see, 20 threads first registered, and after deleting them and creating new 20 threads, they were also able to register with the same numbers register_thread = 0 - 19 (see the output (output) of the example).

Now we will show that even if the threads worked with a lock, and then the lock was removed, then at the end of the threads there will not be an error due to an attempt to cancel registration in a nonexistent blocking object: [35] coliru.stacked-crooked.com/a/d2e5a4ba1cd787da

We set the timers to create 20 threads, execute reading using our lock and fall asleep for 500ms, but at this time the object contfree_safe_ptr containing our contention_free_shared_mutex lock was removed after 100ms, and only after that 20 threads woke up and ended. There was no error when completing them due to cancellation of registration in the remote object of blocking.

As a small addition, we will support MSVS2013 so that the owners of the old compiler can also see the example. Add a simplified support for registering threads, but without the possibility of unregistering a stream

We will show the final result in the form of an example in which all the above thoughts are taken into account.

Example: [36] coliru.stacked-crooked.com/a/0a1007765f13aa0d

The correct functioning of the algorithm and the selected barriers

Above, we conducted tests that showed no obvious errors. But in order to prove operability it is necessary to create a scheme of possible changes in the order of operations and possible states of variables.

Exclusive lock lock () / unlock () is almost as simple as in the case of recursive_spinlock_t, so we will not consider it in detail.

The competition of the reader stream for lock_shared () and the writer stream for lock () was discussed in detail above.

Now the main task is to show that lock_shared () in all cases uses at least Acquire-semantic, and unlock_shared () in all cases uses at least Release-semantic. But this is not a requirement for recursive locking / unlocking.

Now we will show how these barriers are implemented in our code.

Barriers to lock_shared () are shown schematically with red crossed out arrows showing directions in which order change is prohibited:

Schematically barriers to unlock_shared () :

Full size: hsto.org/files/065/9ce/bd7/0659cebd72424256b6254c57d35c7d07.jpg

The full size rasmat ras rasters transitions: hsto.org/files/fa7/4d2/2b7/fa74d22b7cda4bf4b1015ee263cad9ee.jpg

present also a block diagram of this same function lock_shared ()

picture in full size: hsto.org/files/c3a/abd/4e7/c3aabd4e7cfa4e9db55b2843467d878f.jpg

in oval blocks are marked strict sequence of operations:

First, the operation is executed - in red
Then the operation is performed - in purple

Green color indicates changes that can be executed in any order, because these changes do not block our “shared-mutex”, but only increase the recursion nesting counter — these changes are important only for local use. Those. these green operations are not the actual entry to the lock.

Since 2 conditions are fulfilled - it is considered that all necessary side-effects from multithreading are taken into account:

The moment of making a decision about entering a lock always has a semantics of no less than “acquire”:

• want_x_lock.load (std :: memory_order_seq_cst)

• want_x_lock.compare_exchange_weak (flag, true, std :: memory_order_seq_cst)
(1-), (2-) . .

Further, the correctness of the algorithm can be checked by simply comparing the results of these operations and their sequence with the logic of the algorithm.

All other functions of our contention_free_shared_mutex blocking are more obvious from the point of view of multi-threaded execution logic.

I also note that when re-blocking (recursively), atomic operations do not have to have a std :: memory_order_acquire barrier (as shown in the figure), it is enough to set std :: memory_order_relaxed, because this is not the actual entry to the lock - we are already blocked . But this does not add much speed and can complicate understanding.

How to use

Let us show an example of using contention_free_shared_mutex <> in C ++ as a highly optimized shared_mutex.

Download this code for Linux (GCC 6.3.0) and Windows (MSVS 2015/13) at the following link: github.com/AlexeyAB/object_threadsafe/tree/master/contfree_shared_mutex

To compile this example on Clang ++ 3.8.0 for Linux you must change Makefile

 #include < iostream > #include < thread > #include < vector > #include "safe_ptr.h" template < typename T > void func(T &s_m, int &a, int &b) { for (size_t i = 0; i < 100000; ++i) { // x-lock for modification { s_m.lock(); a++; b++; s_m.unlock(); } // s-lock for reading { s_m.lock_shared(); assert(a == b); // will never happen s_m.unlock_shared(); } } } int main() { int a = 0; int b = 0; sf::contention_free_shared_mutex< > s_m; // 19 threads std::vector< std::thread > vec_thread(20); for (auto &i : vec_thread) i = std::move(std::thread([&]() { func(s_m, a, b); })); for (auto &i : vec_thread) i.join(); std::cout << "a = " << a << ", b = " << b << std::endl; getchar(); return 0; }

This code in the online compiler: coliru.stacked-crooked.com/a/11c191b06aeb5fb6

As you can see our sf :: contention_free_shared_mutex <> mutex is used in exactly the same way as the standard std :: shared_mutex.

Test: std :: shared_mutex vs contention_free_shared_mutex

Let us give an example of testing on 16 streams for a single Intel Xeon E5-2660 v3 2.6 GHz server processor. First of all, we are interested in blue and purple lines:

safe_ptr <std :: map, std :: shared_mutex>
contfree_safe_ptr <std :: map>

The source code of the test: github.com/AlexeyAB/object_threadsafe/tree/master/bench_contfree

Command line for starting:

numactl --localalloc --cpunodebind = 0 ./benchmark 16

If you only have one CPU on the motherboard, then run: . / benchmark

Performance of various locks for different ratios of read locks (shared-lock) and write locks (exclusive-lock).

% exclusive locks = (% of write operations)
% shared locks = 100 - (% of write operations)

(in case of std :: mutex - exclusive-lock locking always works)

Performance (the bigger - the better), MOps - millions of operations per second

With 0% of changes, our shared-mutex (as part of contfree_safe_ptr <map>) shows a performance of 34.60 Mops, which is 14 times faster than the standard std :: shared_mutex (as part of safe_ptr <map, std :: shared_mutex>), which shows only 2.44 Mops.
With 15% of changes, our shared-mutex (as part of contfree_safe_ptr <map>) shows a performance of 5.16 Mops, which is 7 times faster than the standard std :: shared_mutex (as part of safe_ptr <map, std :: shared_mutex>), which shows only 0.74 Mops.

Our contention-free shared-mutex lock works for any number of threads: for the first 36 threads, as contention-free, and for subsequent ones, as an exclusive-lock. As you can see from the graphs, even the “exclusive-lock” std :: mutex works faster than std :: shared_mutex with 15% of changes.

The number of threads 36 for contention-free is set by the template parameter and can be changed.

Now we will show the median delay for different ratios of the type of locks: read (shared-lock) and write (exclusive-lock).

In the test code main.cpp you need to set: const bool measure_latency = true;

The command line to run:

numactl --localalloc --cpunodebind = 0 ./benchmark 16

If you have only one CPU on the motherboard, then run: ./benchmark

Median-latency (the lower - the better), microseconds

Thus, we created a shared lock in which readers do not interfere with each other during locking and unlocking, unlike std :: shared_timed_mutex and boost :: shared_mutex. But we have an additional allocation for each stream: 64 bytes in the array of locks + 24 bytes is occupied by the unregister_t structure for unregistering + the element pointing to this structure from hash_map. About 100 bytes per stream.

A deeper problem is the ability to scale. For example, if you have 8 CPUs (Intel Xeon Processor E7-8890 v4) with 24 cores (48 threads each HyperThreading), then there are 384 logical cores in total. Before writing, each thread writer must read 24,576 bytes (64 bytes from each of the 384 cores), but you can read them in parallel, this is certainly better than waiting for one cache line to go sequentially from each of the 384 streams to each, as in ordinary std :: shared_timed_mutex and boost :: shared_mutex (for any type of unique / shared locking). But parallelization per 1000 cores and more is usually implemented through a different approach, and not through a call to an atomic operation to process each message. All options discussed above: atomic operations, active locks, lock-free data structures — all this is necessary for low latency (0,5 - 5 usec) individual messages.

For high rates of operations per second, i.e. for high system performance and scalability across tens of thousands of logical cores use mass parallelism approaches, “hide latency” and “batch processing” - batch processing, when messages are sorted (for map) or grouped (for hash_map) and merged with already existing sorted ones or in a grouped array of 50 - 500 usec. As a result, each operation has a delay of 10-100 times more, but these delays occur at the same time in a huge number of threads, resulting in the concealment of “hide latency” delays due to the use of “Fine-grained Temporal multithreading”.

If we assume: each message has a delay of 100 times more, but messages are processed 10,000 times more. This is a unit of time 100 times more efficient. Such principles are applied when developing on a GPU. Perhaps in the following articles we will analyze this in more detail.

Conclusion:

We have developed our own “shared-mutex”, which does not require that reader threads synchronize with each other, as is done in the standard std :: shared_mutex. We strictly proved the correctness of our “shared-mutex” work. We also studied atomic operations, memory barriers, and allowed reordering directions in detail for maximum performance optimization. Next, we will see how much we were able to increase the performance of multi-threaded programs, compared to highly optimized lock-free algorithms from the libCDS (Concurrent Data Structures library) library: github.com/khizmax/libcds

Source: https://habr.com/ru/post/328362/

All Articles