Making any object thread safe

In these 3 articles, I will talk in detail about atomic operations, memory barriers and fast data exchange between threads, as well as “sequence-points” using the example “execute-around-idiom”, and at the same time we will try to do something useful — a smart pointer that makes any object thread safe for any operations with its members — variables or functions. And then we will show how using it to achieve the performance of highly optimized lock-free algorithms on 8 - 64 cores.

Three related articles:

Making any object thread safe
Accelerate std :: shared_mutex 10 times
Thread safe std :: map with lock-free map performance

→ My article in English
→ Examples and tests from all three articles
')
In the standard C ++ library, there are no thread-safe containers (array, list, map ...) that can be used in several threads without additional locks. Using standard containers for multi-stream exchange there is a danger that you forget to protect one of the sections of the mutex code or mistakenly protect it with another mutex.

Obviously, the developer will make more mistakes if he uses his own solution instead of the standard one. And if the task is so complicated that the solution is not in the standard, then the developer, solving it, will drown in mistakes.

Based on the principle of “practicality is more important than impeccability” (“practicality beats purity”), we will try to create not an ideal, but practical solution to this problem.

In the article, we implement a smart pointer that makes any object thread safe, with a performance that is not inferior to optimized lock-free containers.

A simplified non-optimized example of using such a pointer:

int main() { contfree_safe_ptr< std::map<std::string, int> > safe_map_string_int; std::thread t1([&]() { safe_map_string_int->emplace("apple", 1); }); std::thread t2([&]() { safe_map_string_int->emplace("potato", 2); }); t1.join(); t2.join(); std::cout << "apple = " << (*safe_map_string_int)["apple"] << ", potato = " << (*safe_map_string_int)["potato"] << std::endl; return 0; }

At each step of our algorithm, we will quote from the standard to guarantee the necessary behavior.

We will take a closer look at the C ++ memory model, the various possible synchronization and data exchange options between threads.

In the second article, we will implement the optimized contention-free shared-mutex and optimized pointer based on its contfree_safe_ptr <>. And in the third article we will show examples of the optimal use of contfree_safe_ptr <> and give performance tests.

Start

Let's start with the development of the smart pointer template safe_ptr <T>, which will be thread safe for any type T and show multi-threaded performance at the level of optimized lock-free algorithms.

Moreover, with the ability to work atomically and consistently over several different objects, where even for lock-free data structures, we would have to use locks and there would be a risk of perpetual deadlock. But we will develop a special mutex lock class that resolves deadlocks situations. Then we implement our own high-performance contention-free shared-mutex, which is much faster than the standard std :: shared_mutex. And based on this, we will create an optimized version of the safe pointer safe_ptr <T>, called contfree_safe_ptr <>. At the end, we will perform performance tests comparing with the lock-free library CDSlib. We will see similar performance to CDSlib (SkipListMap and BronsonAVLTreeMap) using the example contfree_safe_ptr <std :: map <>>.

As a result, to make any of your T classes flow-safe, you do not need time, just write: contfree_safe_ptr <T> ptr_thread_safe;

And the performance will be close to the fact that if you had been developing lock-free algorithms in the methods of your class for a month. In addition, it will be possible to atomically change several contfree_safe_ptr <> at once. Like std :: shared_ptr <>, our smart pointer will contain a reference count. It can be copied, and after deleting the last copy, the dynamically allocated memory will automatically be freed.

At the end, 1 safe_ptr.h file will be provided, which is enough to connect via #include “safe_ptr.h” so that you can use this class.

Basic principles of multi-threaded data exchange

As you know, we can read and modify the same object from different threads only in the following 4 cases:

1. Lock-based. The object is protected by locking: spin-lock, std (mutex, recursive_mutex, timed_mutex, shared_timed_mutex, shared_mutex ...): en.cppreference.com/mwiki/index.php?title=Special%3ASearch&search=mutex
2. Atomic. The object is of type std :: atomic <T>, where T is a pointer, bool, or integral type (std :: is_integral <T> :: value == true), and only if at type T there are atomic operations at the CPU level: en .cppreference.com / w / cpp / atomic / atomic
2 + 1 (Lock-based-Atomic) Otherwise, if type T is a trivially copyable type, i.e. satisfy the condition std :: is_trivially_copyable <T> :: value == true, then std :: atomic <T> works like Lock-based - the lock is automatically used inside it
3. Transaction-safe. Functions implemented to work with the object provide a stream-safe transaction_safe guarantee (Transactional Memory TS ( ISO / IEC TS 19841: 2015 ) - Experimental C ++): en.cppreference.com/w/cpp/language/transactional_memory
4. Lock-free. Functions for working with an object are implemented based on lock-free algorithms, i.e. provide a stream-safe lock-free guarantee

If you are well aware of all 4 ways to ensure flow-safety, you can skip this chapter.

Consider in reverse order:

(4) Lock-free algorithms are rather complicated, and often several scientific papers are required to create each complex algorithm. Examples of lock-free algorithms for working with containers: unordered_map, ordered_map, queue ... and links to scientific papers can be found in the library - Concurrent Data Structures (CDS) library: github.com/khizmax/libcds
These are very fast and reliable multi-threaded data structures, but so far there are no plans to include them in the standard C ++ 17 or C ++ 20 library and they are not included in the draft standards: www.open-std.org/JTC1/SC22/WG21

(3) Transaction-safe is planned to be included in the experimental part of the C ++ standard and there is already a draft of the standard ISO / IEC TS 19841: 2015: www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4514.pdf
But not even all STL-containers are planned to be transaction-safe. For example, the container std :: map is not even planned to make transaction-safe, since for it, only the following functions are defined as transaction-safe: begin, end, size, max_size, empty. But not defined as thread safe functions: find, modify, insert. And implementing your own object with transaction-safe member functions is not at all easy, otherwise you would have done it for std :: map.

(2) Atomic. This approach is already standardized in C ++ 11 and you can easily use it. For example, it is enough to declare the variable std :: atomic shared_val; then pass the link or pointer to it in several threads and all calls through member functions and std :: atomic operators will be thread safe: [3] coliru.stacked-crooked.com/a/9d0219a5cfaa3cd5
Member functions, Specialized member functions: en.cppreference.com/w/cpp/atomic/atomic
It is important to understand that if you do several operations on an atomic variable, no matter in one expression or in several, then between them another thread can change the value of this variable. Therefore, for the atomic execution of several operations, the Lock-free approach based on the CAS function (Compare-And-Swap) compare_exchange_weak () is used - namely: we read the value from the atomic variable into the local variable (old_local_val), do the many operations we need and write the result into a local variable (new_local_val), and at the end of the CAS function we compare the current value of the atomic variable with the initial one (old_local_val) and if they are not equal, then we perform the cycle again, and if they are equal, it means that during this time another thread did not make any changes, then we change atomic value variable to a new value (new_local_val). Moreover, the comparison and assignment are done by a single operation. Compare_exchange_weak () is an atomic function and until it is completely executed, no one can change the value of a variable: [4] coliru.stacked-crooked.com/a/aa52b45150e5eb0a

This looping approach is known as optimistic locking. Pessimistic locks are called: spin-lock, mutex ...

And if all operations of such a cycle are performed without pessimistic locks, then such an algorithm is called lock-free.

Often, the atomic CAS function is replaced with pointers, namely: a new memory is allocated, a modified object is copied into it and a pointer to it is obtained, a series of actions are performed on this copy, and at the end the old pointer to the pointer to a new object is replaced with a CAS function, time another thread did not change the old pointer. But if the pointer was changed by another thread, then everything repeats from the beginning.

There may be a problem called “ ABA ”. When other threads have time to change the pointer twice, and the second time the pointer is changed to the original value, but at this address other threads already have time to delete the object and create a new one. Those. the pointer value also points to another object, and we see that the value has not changed and we think that the object has not been replaced. There are many ways to solve this problem, for example: LL / SC, RCU, hazard_pointer, garbage collector ...

Atomic is the fastest way to exchange data between threads. In addition, less strict and faster memory barriers can be used for all atomic operations, which we will discuss in more detail later. By default, the most secure and strict reordering barrier is used: std :: memory_order_seq_cst. But as we noted above: it takes a lot of effort to implement complex logic using atomic variables.

(2) - (1) Atomic & Lock-based.
But if you need to read or change atomic several variables at once, std :: atomic a, b, c; and you do not want to implement the lock-free algorithm and solve the ABA problem, you need to use a lock. The processor atomic CAS function on most CPUs can check whether only one variable with a maximum width of 64-bit was changed, but at that time another variable could have changed. Solution: std :: atomic <T> allows you to use for T - type structure of any size.

The C ++ standard introduced the ability to use any type T for std :: atomic <T>, if it is “trivially copyable type”, i.e. satisfies the condition std :: is_trivially_copyable <T> :: value == true

What the C ++ Working Draft Standard says , Standard for Programming Language C ++ N4606 2016-07-12 : www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606.pdf

§29.5 / 1
There is a generic class template atomic <T> . This is a trivially copyable (3.9). [Note: It can be difficult to use. - end note]

§3.9 / 9
Scalar types, (c) Clause 9, collectively called trivially copyable types

But if the atomic processor CAS function can check whether only one variable with a maximum width of 64-bit has been changed, and we have 3 variables for 32-bit, then how does the CAS function work in std :: atomic <T>? CAS and all other functions will automatically use the lock (std :: mutex or some other) contained in the standard implementation of std :: atomic <T>, for T - trivially copyable types.

For the atomic change of several variables - we can use the structure of the variables struct T {int price, count, total; }; as the type for the std :: atomic <T> template.
Example: [5] coliru.stacked-crooked.com/a/bdfb3dc440eb6e51

Example output: 10, 7, 70

In this example, the last value 70 at any time will be equal to the product of the first two values 10 * 7 - i.e. the whole structure changes only atomically.

This gcc and clang code for x86 must be compiled with the -latomic flag
In addition, each call to std :: atomic <T> shared_val; will cause a lock inside it, as indicated by the value of the function shared_val.is_lock_free () == false.

Those. Globally, we used optimistic locking (loop), and locally used 2 pessimistic locks when accessing an atomic variable: getting the old value and calling the CAS function.

(1) Lock-based.
But we will not be able to use std :: atomic <T> for any type T created by you because of the mandatory condition “trivially copyable” for type T. Of all STL containers, we can only use std :: array <>. For example, we cannot use std :: atomic <std :: map <int, int >>, since type std :: map <> is not trivially copyable, for any arguments of its template. And your own classes most likely cannot be used as a type T for std :: atomic <T> either.

In this case, you will have to create the mutex objects yourself, block them each time before using shared objects and unlock them later. Concept: en.cppreference.com/w/cpp/concept/Mutex .

C ++ has the following mutexes: std :: mutex, std :: recursive_mutex, std :: timed_mutex, std :: recursive_timed_mutex, std :: shared_timed_mutex, std :: shared_mutex. Read more about them here: en.cppreference.com/w/cpp/thread .

For example, we create any object shared by streams between std :: map <int, T> and create a mutex protecting it, then pass references to them in several streams. And in each thread we block the mutex before using the shared object: [6] coliru.stacked-crooked.com/a/a97e905d54ae1fbb .

And we block using RAII idioms: std :: lock_guard <std :: mutex> lock (mtx); - when creating this object, its constructor locks the mutex, and at the end of the object’s life, the destructor unlocks the mutex. Thus, we will not forget to unblock it, because The destructor will be called automatically.

But there are still 4 main problems:

Deadlocks - if you write the code so that stream-1 blocks mtx1, and stream-2 blocks mtx2, and holding blocking thread-1 will try to capture mtx2, and stream-2 will try to capture mtx1, then these threads will wait for each other forever. Lock-free algorithms are deprived of this problem, but not any logic can be implemented with the help of lock-free - we will show this using an example of coordinated atomic change of several containers.
If you write the code so that while the mutex is locked, you assign a reference to the shared object to the pointer, whose life is longer than that of the std :: lock_guard lock, then after removing the lock you can refer to the shared object using this pointer - this will lead to Data races i.e. to the non-consistent state of the shared object and to the incorrect operation or crash of the program. The same will happen if the iterator received from the shared object will be used after unlocking the mutex.
You can confuse mutexes and block a mutex that protects another object - Data races.
You can just forget to lock the mutex in the right place - Data races.

Execute Around Pointer Idiom

In addition to RAII idioms, there is another interesting idiom - Execute Around Pointer, which will help to cope with the last two problems:

The mutex will be merged with your object, and you will not be able to block a separate mutex, but your object itself.
The mutex will be blocked automatically when accessing any member of the class of the protected object — and it will be blocked for the entire duration of the expression.

As a result, you will not be able to forget to block the mutex, and you will not be able to confuse mutexes.

Making any object thread safe

Execute Around Pointer Idiom is a long-known idiom with a strictly defined order of execution, used for various purposes from visualization to logging: en.wikibooks.org/wiki/More_C%2B%2B_Idioms/Execute-Around_Pointer
Example: [7] coliru.stacked-crooked.com/a/4f3255b5473c73cc

  execute_around<std::vector<int>> vecc(10, 10); int res = my_accumulate(vecc->begin(), vecc->end(), 0);

First, temporary objects of the proxy type will be created that block the mutexes inside execute_around, then iterators returned by the begin () and end () functions will be passed to the function, then the my_accumulate () function will be executed, and only after its completion will temporary objects of the proxy type be deleted , and their destructors unlock mutexes.

Read more in the article: C ++ Patterns Executing Around Sequences. Kevlin Henney : hillside.net/europlop/HillsideEurope/Papers/ExecutingAroundSequences.pdf
In C ++, there are two definitions that strictly define the order of operations of Standard § 1.9 (13): sequenced before and sequenced after. In the standard quotes below, you will see “sequenced before” 2 times.

The principle and sequence of execution of all actions in Execute Around Pointer Idiom are strictly described in the C ++ Standard Draft Standard for C ++ N4606 Standard Programming Language 2016-07-12 : www.open-std.org/jtc1/sc22/wg21/docs/papers/2016 /n4606.pdf
First we give five quotes from the standard, and then we will show how each of the quotes explains the behavior of Execute Around Pointer Idiom.

1. For all types of T other than raw-pointers: x-> m is interpreted as (x.operator -> ()) -> m. Those. expression (x-> m) will unfold multiple times ((x.operator -> (). operator -> ()) -> m until we get a raw pointer. An example with three expressions identical in meaning: [8] coliru. stacked-crooked.com/a/dda2474024b78a0b

§ 13.5.6
operator-> shall be a non-static member function taking no parameters. It implements the class member access syntax that uses ->.
postfix-expression -> template opt id-expression
postfix-expression -> pseudo-destructor-name
An expression x-> m is interpreted as (x.operator -> ()) -> m for a class object by the overload resolution mechanism (13.3).

2. When a function is called, even if it is “inline,” absolutely any calculations and effects from expressions that calculate the function's arguments are executed before the function body begins to execute.

§ 1.9 / 16
If you are calling on a line, it’s called function.

3. The entire expression is completely executed before the temporary object is destroyed.

§ 1.9 / 10

 void f() { if (S(3).v()) // full-expression includes lvalue-to-rvalue and // int to bool conversions, performed before // temporary is deleted at end of full-expression { } }

4. After the entire expression is completely executed, the temporary objects are destroyed in the reverse order of the order in which they were created.

§ 1.9 Footnote 8
It has been determined that it has been determined that it has been defined as

5. Three cases when the temporary object is not destroyed at the end of the whole expression - 2 cases when the array elements are initialized, but the 3rd case when the reference to the temporary object is created.

§ 12.2 Temporary objects
§ 12.2 / 5
There are three contexts in which they can be destroyed. An initial context (8.6). The second context is where the copy is used (5.1.5, 12.8). In the case of the next array element, it should be The third context is when a reference is bound to a temporary.

For example, we have a simplified class execute_around <>

 template<typename T, typename mutex_type = std::recursive_mutex> class execute_around { std::shared_ptr<mutex_type> mtx; std::shared_ptr<T> p; void lock() const { mtx->lock(); } void unlock() const { mtx->unlock(); } public: class proxy { std::unique_lock<mutex_type> lock; T *const p; public: proxy (T * const _p, mutex_type& _mtx) : lock(_mtx), p(_p) { std::cout << "locked \n";} proxy(proxy &&px) : lock(std::move(px.lock)), p(px.p) {} ~proxy () { std::cout << "unlocked \n"; } T* operator -> () {return p;} const T* operator -> () const {return p;} }; template<typename ...Args> execute_around (Args ... args) : mtx(std::make_shared<mutex_type>()), p(std::make_shared<T>(args...)) {} proxy operator -> () { return proxy(p.get(), *mtx); } const proxy operator -> () const { return proxy(p.get(), *mtx); } template<class Args> friend class std::lock_guard; };

And we use our sample execute_around <> class as follows, example: [45] coliru.stacked-crooked.com/a/d2e02b61af6459f5

 int main() { typedef execute_around<std::vector<int>> T; T vecc(10, 10); int res = my_accumulate(vecc->begin(), vecc->end(), 0); return 0; }

Then the last expression can be converted to the following form through several transformations.

1. According to the 1st quotation of the standard, x-> m is interpreted as (x.operator -> ()) -> m:

  int res = my_accumulate( (vecc.operator->())->begin(), (vecc.operator->())->end(), 0);

2. Since vecc.operator -> () returns a temporary object T :: proxy (), we get:

  int res = my_accumulate( T::proxy(vecc.p.get(), *vecc.mtx)->begin(), T::proxy(vecc.p.get(), *vecc.mtx)->end(), 0);

3. Further, according to quotes 2, 3 and 4, temporary objects of proxy type will be created before the function starts to be executed and will be destroyed after the end of the function (after the end of the whole expression):

 T::proxy tmp1(vecc.p.get(), *vecc.mtx); // lock-1 std::recursive_mutex T::proxy tmp2(vecc.p.get(), *vecc.mtx); // lock-2 std::recursive_mutex int res = my_accumulate(tmp1->begin(), tmp2->end(), 0); tmp2.~ T::proxy(); // unlock-2 std::recursive_mutex tmp1.~ T::proxy(); // unlock-1 std::recursive_mutex

4. According to the 1st quotation again:
• tmp1-> begin () is equivalent to (tmp1.operator -> ()) -> begin ()
• tmp1.operator -> () returns p
As a result, we get where p is a shared_ptr pointer to our object of type std :: vector:

 typedef execute_around<std::vector<int>> T; T vecc(10, 10); T::proxy tmp1(vecc.p.get(), *vecc.mtx); // lock-1 std::recursive_mutex T::proxy tmp2(vecc.p.get(), *vecc.mtx); // lock-2 std::recursive_mutex int res = my_accumulate(tmp1.p->begin(), tmp2.p ->end(), 0); tmp2.~ T::proxy(); // unlock-2 std::recursive_mutex tmp1.~ T::proxy(); // unlock-1 std::recursive_mutex

In 4 steps, we have described a strict sequence of all actions of the idiom. Note that the standard does not guarantee the mutual order of creating temporary variables tmp1 and tmp2, i.e. tmp2 can be created first and then tmp1, but this does not change the logic of our program.
Note that we did not refer to the fifth quotation of the standard, since It describes 3 cases in which the time of removal of an object may differ from the given one, and as we see, none of these cases can correspond to ours. The first 2 cases in the standard citation are initialization or copying of an array, they shorten the life of a temporary object, and the 3rd case is the extension of the life of a temporary object due to the presence of a reference to it.

Thread safe associative array

Agree, it would be convenient to have such a template class safe_ptr <> in which you can pass any type, and as a result get a thread-safe result type?
safe_ptr <std :: map <std :: string, std :: pair <std :: string, int> >> safe_map_strings;
Then you can work with this object as with a pointer to an associative array: std :: shared_ptr <std :: map <std :: string, std :: pair <std :: string, int> >>
But now we can safely work with them from different streams, and each individual expression will be thread-safe:

  (*safe_map_strings)["apple"].first = "fruit"; (*safe_map_strings)["potato"].first = "vegetable"; safe_map_strings->at("apple").second = safe_map_strings->at("apple").second * 2; safe_map_strings->find("potato")->second.second++;

Here is a fully working example of a thread-safe associative: std :: map <>. [9]
coliru.stacked-crooked.com/a/5def728917274b22

 #include <iostream> #include <string> #include <vector> #include <memory> #include <mutex> #include <thread> #include <map> template<typename T, typename mutex_t = std::recursive_mutex, typename x_lock_t = std::unique_lock<mutex_t>, typename s_lock_t = std::unique_lock<mutex_t >> class safe_ptr { typedef mutex_t mtx_t; const std::shared_ptr<T> ptr; std::shared_ptr<mutex_t> mtx_ptr; template<typename req_lock> class auto_lock_t { T * const ptr; req_lock lock; public: auto_lock_t(auto_lock_t&& o) : ptr(std::move(o.ptr)), lock(std::move(o.lock)) { } auto_lock_t(T * const _ptr, mutex_t& _mtx) : ptr(_ptr), lock(_mtx){} T* operator -> () { return ptr; } const T* operator -> () const { return ptr; } }; template<typename req_lock> class auto_lock_obj_t { T * const ptr; req_lock lock; public: auto_lock_obj_t(auto_lock_obj_t&& o) : ptr(std::move(o.ptr)), lock(std::move(o.lock)) { } auto_lock_obj_t(T * const _ptr, mutex_t& _mtx) : ptr(_ptr), lock(_mtx){} template<typename arg_t> auto operator [] (arg_t arg) -> decltype((*ptr)[arg]) { return (*ptr)[arg]; } }; void lock() { mtx_ptr->lock(); } void unlock() { mtx_ptr->unlock(); } friend struct link_safe_ptrs; template<typename mutex_type> friend class std::lock_guard; //template<class... mutex_types> friend class std::lock_guard; // C++17 public: template<typename... Args> safe_ptr(Args... args) : ptr(std::make_shared<T>(args...)), mtx_ptr(std::make_shared<mutex_t>()) {} auto_lock_t<x_lock_t> operator -> () { return auto_lock_t<x_lock_t>(ptr.get(), *mtx_ptr); } auto_lock_obj_t<x_lock_t> operator * () { return auto_lock_obj_t<x_lock_t>(ptr.get(), *mtx_ptr); } const auto_lock_t<s_lock_t> operator -> () const { return auto_lock_t<s_lock_t>(ptr.get(), *mtx_ptr); } const auto_lock_obj_t<s_lock_t> operator * () const { return auto_lock_obj_t<s_lock_t>(ptr.get(), *mtx_ptr); } }; // --------------------------------------------------------------- safe_ptr<std::map<std::string, std::pair<std::string, int> >> safe_map_strings_global; void func(decltype(safe_map_strings_global) safe_map_strings) { //std::lock_guard<decltype(safe_map_strings)> lock(safe_map_strings); (*safe_map_strings)["apple"].first = "fruit"; (*safe_map_strings)["potato"].first = "vegetable"; for (size_t i = 0; i < 10000; ++i) { safe_map_strings->at("apple").second++; safe_map_strings->find("potato")->second.second++; } auto const readonly_safe_map_string = safe_map_strings; std::cout << "potato is " << readonly_safe_map_string->at("potato").first << " " << readonly_safe_map_string->at("potato").second << ", apple is " << readonly_safe_map_string->at("apple").first << " " << readonly_safe_map_string->at("apple").second << std::endl; } int main() { std::vector<std::thread> vec_thread(10); for (auto &i : vec_thread) i = std::move(std::thread(func, safe_map_strings_global)); for (auto &i : vec_thread) i.join(); std::cout << "end"; int b; std::cin >> b; return 0; }

Conclusion:
potato is vegetable 65042, apple is fruit 65043
potato is vegetable 81762, apple is fruit 81767
potato is vegetable 84716, apple is fruit 84720
potato is vegetable 86645, apple is fruit 86650
potato is vegetable 90288, apple is fruit 90291
potato is vegetable 93070, apple is fruit 93071
potato is vegetable 93810, apple is fruit 93811
potato is vegetable 95788, apple is fruit 95790
potato is vegetable 98951, apple is fruit 98952
potato is vegetable 100000, apple is fruit 100000
end

From here 2 conclusions:

100000 , 10 -. , , operator -> auto_lock_t auto_lock_obj_t , , - – data-race: [10] coliru.stacked-crooked.com/a/45d47bcb066adf2e
10000 , , .. . Those.Before each increment operator ++ blocked the mutex, and immediately after the increment it was unlocked, and then the mutex could be blocked by another thread that executed the increment. At the beginning of each thread, we can immediately block the mutex until the end of the execution of the stream function using std :: lock_guard <>, and see what the result would be if the threads were executed sequentially rather than pseudo- parallel : [11] coliru.stacked-crooked.com/ a / cc252270fa9f7a78

Both findings confirm that our smart pointer class template safe_ptr <T> automatically ensures the flow-safety of the protected object of type T.

Flow safety, atomicity and consistency of several objects.

We show how to atomically change several objects at once so that their consistency is preserved. And we will show when it is needed, how to do it and what happens if this is not done.
Let's give a simplified example, let's say we have 2 tables:

user_accounts (INT user_id, STRING user_name, INT money) - a table with the amount of money for each client - sorted by the user_id field
cash_flows (INT unique_id, INT src_id, INT dst_id, INT time, INT money) - a table showing the movement of money - each entry is referenced by two associative arrays sorted: by src_id field and by dst_id field

 // 1-st table struct user_accounts_t { std::string user_name; int64_t money; user_accounts_t(std::string u, int64_t m) : user_name(u), money(m) {} }; std::map<uint64_t, user_accounts_t> user_accounts; // 2-nd table struct cash_flows_t { uint64_t unique_id, src_id, dst_id, time; int64_t money; }; std::atomic<uint64_t> global_unique_id; // SQL-sequence std::multimap<uint64_t, std::shared_ptr<cash_flows_t>> cash_flows_src_id; std::multimap<uint64_t, std::shared_ptr<cash_flows_t>> cash_flows_dst_id;

In terms of a DBMS (RDBMS):

The 1st table with an index for the user_id field is the Index-Organized-Table (Oracle) or Table with Clustered Index (MS SQL).
The 2nd table is a table with two ordered indexes with one src_id field and one dst_id field, respectively.

In real tasks, the table can contain millions of records for clients and billions of records for cash flow; in this case, the indices for the user_id, src_id, dst_id fields speed up the search for them hundreds of thousands of times, so they are urgently needed.

Suppose from three users send requests to perform three tasks in three parallel streams:

1. move_money () - the stream transfers money from one client to another

a) takes away money from one client
b) add the same amount of money to another client
c) in the index on the field id-source adds a pointer to the cash transaction
d) in the index on the id-destination field adds a pointer to the same cash transaction (in real tasks this is not necessary, but for example we will do it)

2. show_total_amount () - show the amount of money of all customers

a) In the cycle we go through each client and summarize all the money.

3. show_user_money_on_time () - show the amount of client money with the specified user_id at the time point

a) incoming - summarize all the money that came to the client from the time and later (using the index on the id-source field)
b) outcoming - we summarize all the money that has left the client since time and later (using the index on the id-destination field)
c) user_money - get the current money from the client
d) user_ user_money - incoming + outcoming is the amount of customer money at the time.

We know that any of the threads can be interrupted at any time by the operating system, for example, to give the CPU-Core to another thread. The most dangerous thing is that it happens extremely rarely and you may never meet with it when debugging, but one day it will happen at the client, and if this results in data-races, then the money may simply disappear in the financial system.

Therefore, we will deliberately add wait functions that put the stream to sleep for a few milliseconds in the most critical places to see the errors immediately.

We will make our user_accounts, cash_flows_src_id, cash_flows_dst_id tables safe-safe using safe_ptr <>, but will the entire program become safe from this?
[12] coliru.stacked-crooked.com/a/5bbba59b63384a2b
Look at the "main lines" in the output of the program, labeled <<<:

Init table safe_user_accounts:
at time = 0 <<<
1 => John Smith, 100
2 => John Rambo, 150
- start transaction ... show_total_amount ()
1 => John Smith, 100
2 => John Rambo, 100
Result: all accounts total_amount = 200 <<<

- start transaction ... show_user_money_on_time ()
1 => John Smith, 150, at time = 0 <<<

Two errors are immediately visible:

Initially, the sum of all (two) users had 250 money, and the show_total_amount () function showed only 200, another 50 disappeared somewhere
At the time point, time = 0, user 1 had 100 money, and the show_user_money_on_time () function showed an incorrect result — user 1 had 150 money at time = 0

The problem is that atomicity is observed only at the level of individual tables and indexes, but not in the aggregate, therefore consistency is violated. The solution is to lock all used tables and indexes for the duration of all operations that must be performed atomically - this will preserve consistency.

Added lines are highlighted in yellow.

Correct example: [13] coliru.stacked-crooked.com/a/c8bfc7f5372dfd0c
Let's look at the "main lines" in the output of the program, labeled <<<:

Init table safe_user_accounts:
at time = 0 <<<
1 => John Smith, 100
2 => John Rambo, 150

Result: all accounts total_amount = 250 <<<

1 => John Smith, 100, at time = 0 <<<

Now everything is correct, the amount of money of all customers is 250, and the amount of money from customer 1 was 100 at time 0.

Those.we were able to atomically perform operations not with one object, but immediately with 3 objects, while maintaining consistency of data for any operations.

But there may be another problem. If you or another developer in some of the functions block container mutexes in a different order, then a situation called deadlock is possible - when 2 threads hang forever waiting for each other.

In the correct example above, we blocked mutexes in both the move_money () and show_user_money_on_time () functions in the same order:

lock1 (safe_user_accounts)
lock2 (safe_cash_flows_src_id)
lock3 (safe_cash_flows_dst_id)

Now let's see what happens if we block the mutexes in the containers in each function in a different order:
• move_money ()

lock2 (safe_cash_flows_src_id)
lock3 (safe_cash_flows_dst_id)
lock1 (safe_user_accounts)

• show_user_money_on_time ()

lock3 (safe_cash_flows_dst_id)
lock2 (safe_cash_flows_src_id)
lock1 (safe_user_accounts)

The move_money () function has locked lock2 and waits until lock3 is released to lock it. The show_user_money_on_time () function has locked lock3 and waits until lock2 is released to lock it. And they will wait for each other forever.
Example: [14] coliru.stacked-crooked.com/a/0ddcd1ebe2be410b

Conclusion:

Init table safe_user_accounts:
at time = 0 <<<
1 => John Smith, 100
2 => John Rambo, 150

- start transaction ... move_money ()
- start transaction ... show_total_amount ()

1 => John Smith, 100
2 => John Rambo, 150

Those.the move_money () and show_user_money_on_time () functions were not completed and stopped forever at deadlock.
Solutions are 4:

1. All developers in all functions always block mutexes in the same order and never make mistakes — this is a very unreliable assumption

2.You initially combine all the objects that will be used atomically into one structure, and use a safe pointer with the type of this structure struct all_t {std :: map <int, int> m1; std :: multimap <int, int> m2; ...}; safe_ptr <all_t> safe_all_obj; - but if you initially used these 2 containers only separately safe_ptr <map <int, int >> m1; safe_ptr <multimap <int, int >> m2; and have already written a lot of code, and then decided to combine them into one structure protected by one mutex, then you have to rewrite all the places where you use them, for example, instead of m2-> at (5); need safe_all_obj-> m2.at (5); To rewrite a lot of code is not very convenient.

3You can combine safe_ptr <> once together, so that they use the same recursive mutex, after which it will not matter in what order they are blocked, the consistency of these objects will always be preserved and never deadlock. To do this, you only need to add 1 line - it is very convenient. But it can reduce performance, because now locking one of the containers always results in locking all containers associated with it. You will get consistency even when it is not needed - at the cost of reduced performance. Example: [15] coliru.stacked-crooked.com/a/2a6f1535e0b95f7b

All changes in the code are just one line:

static link_safe_ptrs tmp_link (safe_user_accounts, safe_cash_flows_src_id, safe_cash_flows_dst_id);

Conclusion - the main lines are shown:

Init table safe_user_accounts:
at time = 0 <<<
1 => John Smith, 100
2 => John Rambo, 150

Result: all accounts total_amount = 250 <<<

1 => John Smith, 100, at time = 0 <<<

4. You can use locking for several mutexes of different types at once with setting timeout for locking each mutex. And if it is not possible to block at least one of the mutexes during this time, then all previously blocked mutexes will be unblocked, the thread will wait for some time and will try to block all mutexes in turn again. To do this, it is enough to add one line before each use of the containers lock_timed_any_infinity lock_all (safe_cash_flows_src_id, safe_cash_flows_dst_id, safe_user_accounts); and no matter in what order you block container mutexes. Example: [16] coliru.stacked-crooked.com/a/93206f216ba81dd6
Those. even if we block mutexes in different sequences:

Then when using lock_timed_any_infinity, deadlock does not occur: [17] coliru.stacked-crooked.com/a/7ac7640918228090
And in a similar example, when using lock_guard <>, deadlock occurs: [18] coliru.stacked-crooked.com/a/a281f90299771434

Thus, with the help of locks, we solved the problem of composability ( Composable ) guaranteed without permanent deadlock: https://en.wikipedia.org/wiki/Lock_(computer_science)#Lack_of_composability

Composable and Deadlocks

Sinceabove, for thread-safety, we used locks, then our algorithms are called lock-based.

And is everything really good with the absence of deadlocks in lock-free containers, algorithms based on Transactional Memory, and whether there are deadlocks in modern DBMS: MSSQL (lock-based IL) and Oracle (multi-version concurrency control).

The lock-free algorithms do not allow atomic change of several containers at once. RDBMS (RDBMS) have all the same problems with the deadlock, as in the lock-based algorithms, and often solve them in the same way via lock timeouts or lock graphs. And the new transaction-safe section in the C ++ standard does not allow using sophisticated algorithms such as std :: map <>.

Lock-free algorithms do not have the property Composable operations - the joint use of several lock-free algorithms. Those.several lock-free data structures cannot be changed or read atomically at once. For example, you can use lock-free containers of associative arrays from libCDS, and they will be individually thread safe. But if you want to atomically execute operations with several lock-free containers at once and maintain consistency, then you cannot do this, because their API does not provide functions of lock-free operations simultaneously on multiple containers. While you are changing or reading one container, another will be changed at this time. To avoid this, you will have to use locks, in this case, these will already be containers based on locks, which means that all the problems of lock-based algorithms, namely the possibility of deadlocks, will become common. In addition, locks are sometimes used when using just one container:

In transactional RDBMS such as MSSQL (lock-based) and Oracle (multi-version concurrency control), locks are also used, and that is why there are deadlocks problems that, for example, can be solved automatically by building a graph of locks and finding cyclic expectations, or through setting timeout lock select col from tbl where id in (....) for update wait 30; If the timeout has expired or deadlock is found in the blocking column, a rollback (rollback) of one of the transaction occurs - i.e. cancellation of all changes that have already been made by this transaction, unlocking of everything that was blocked, and then you can try to complete the transaction from the very beginning (and so many times):

Oracle Database Online Documentation 11g Release 1 (11.1) - Deadlocks: docs.oracle.com/cd/B28359_01/server.111/b28318/consist.htm#i5337
Oracle Locks: docs.oracle.com/cd/B28359_01/server.111 /b28318/consist.htm#i5249
As you can see, exclusive incompatible locks are used by any expression: Insert / Update / Delete / Select-For-Update
MSSQL Detecting and Ending Deadlocks: https://technet.microsoft.com/en-us/library/ms178104(v=sql.105).aspx
MS SQL Locks: https://technet.microsoft.com/en-us/ library / ms186396 (v = sql.105) .aspx

In turn, Transactional Memory, unlike Lock-free containers, can atomically work with multiple containers / data. Those.Transactional Memory has the property Composable operations. Inside, either pessimistic locks are used (with a probability of a deadlock conflict) or optimistic locks (with a greater probability of conflicting modifications with competitive modifications). And in case of any conflicts, the transaction is automatically canceled and repeated from the very beginning, which entails multiple repetitions of all operations - and this incurs a large overhead. These overheads are trying to reduce by creating a Hardware Transactional Memory at the CPU level, but so far there are no implementations showing acceptable performance, although Intel has already added Hardware Transactional Memory in the CPU Haswell. The C ++ standard also promises to include Transactional Memory, but only in the future, so far only as an experimental one and without the support of working with std :: map. Those.so far everything is beautiful only in theory. But in the future it will most likely replace the synchronization methods we are used to

Total:

Lock-based algorithms are not compiled, if such a possibility is not built into when they are implemented, but this possibility can be implemented, and we have successfully implemented it in the previous sections.
Lock-free algorithms are not compiled and linking them without locks is an overly complex task, and with locks such an algorithm is no longer lock-free and there is a risk of eternal deadlocks.
RDBMS : MSSQL (lock-based IL) and Oracle (MVCC) - eternal deadlocks are possible, which are removed via lock graphs or timeouts
The transactional memory from the experimental part of the C ++ standard is so far limited to use only in the simplest algorithms and does not allow the use of algorithms such as in the methods of std :: map <> or more complicated.

Current conclusion: the deadlock problem exists in all types of algorithms and systems showing high performance, where several data structures are involved simultaneously, and we have proposed 2 options for its solution for safe_ptr <>

static link_safe_ptrs tmp_link (safe_user_accounts, safe_cash_flows_src_id, safe_cash_flows_dst_id); - use one mutex for multiple containers
lock_timed_any_infinity lock_all (safe_cash_flows_src_id, safe_cash_flows_dst_id, safe_user_accounts); - use lock timeouts, and at the end of time unlock everything and try to lock again

In the case when only 1 container and a recursive mutex are used for safe_ptr <>, then deadlock is impossible in safe_ptr <>, since deadlock requires at least 2 recursive mutexes (or 1 non-recursive).

Composable lock-based algorithms

In the general case, it is considered that lock-based programs are not composable (Composable), i.e. if you simply take 2 lock-based data structures and atomically change them in turn, you will not get a consistent state at any time.

But above we easily put together three lock-based containers, how did we manage it? There is a slight clarification to this effect - highlighted in bold:

Perhaps the most fundamental objection [...] is that lock-based programs do not compose: correct fragments may fail when combined. For example, consider a hash table with thread-safe insert and delete operations. Now suppose that we want to delete one item A from table t1, and insert it into table t2; but the intermediate state (in which neither table contains the item) must not be visible to other threads. Unless the implementor of the hash table anticipates this need, there is simply no way to satisfy this requirement. [...] In short, operations that are individually correct (insert, delete) cannot be composed into larger correct operations.

—Tim Harris et al., “Composable Memory Transactions”, Section 2: Background, pg.2 [6]
www.microsoft.com/en-us/research/wp-content/uploads/2005/01/2005-ppopp- composable.pdf
The fact is that the lock-based algorithms are not assembled, if such a possibility is not included in their implementation. Those.lock-based data structures are not compiled automatically, but they can be manually assembled, for example, like with the lock_timed_any_infinity class, if you have access to their mutexes for layout operations from the outside.

We implemented the lock-based template class safe_ptr <T> and in it for any type T we provided for the need to link and solve deadlocks using linking operations: link_safe_ptrs , lock_timed_any_infinity , lock_timed_any_once .

So why did we choose locks and their pessimistic option?

Locking is a standard mechanism for ensuring the security of operating systems and C ++.
With the help of locks, we can realize the composability and consistency of multiple data structures.
deadlock , , . conflicting modifications, , C++- .
Tom Kyte is a Senior Technical Architect in Oracle's Server Technology Division – Oracle DB (Multi-Version Concurrency Control): https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:5771117722373
, Lockout Deadlock :
I am a huge fan of so called pessimistic locking . The user has very clearly announced their intent to UPDATE THE DATA. The lockouts they refer to are easily handled with session timeouts (trivial) and the deadlock is so rare and would definitely be an application bug (in both Oracle and RDB).
Deadlock – . , , lock_timed_any_infinity . , , : link_safe_ptrs std::lock_guard <>.
There is also no need for automatic escalation of locks. For example, Oracle DB never does this: docs.oracle.com/cd/B28359_01/server.111/b28318/consist.htm#CIHCFHGE
Oracle Database never escalates locks. Lock escalation greatly increases the likelihood of deadlocks.

We will continue to periodically refer to the rich experience of industrial DBMS in the implementation of our algorithms.

An example of using safe_ptr <T>
from safe_ptr.h: github.com/AlexeyAB/object_threadsafe/tree/master/any_object_threadsafe

Conclusion:

We proved the correctness of Execute Around Pointer Idiom to automatically ensure secure access from different threads. Showed an example of its composable. And also showed the advantages of using pessimistic locks to ensure thread-safety.

Source: https://habr.com/ru/post/328348/

All Articles

Making any object thread safe

Start

Basic principles of multi-threaded data exchange

Execute Around Pointer Idiom

Making any object thread safe

Thread safe associative array

Flow safety, atomicity and consistency of several objects.

Composable and Deadlocks

Composable lock-based algorithms

Conclusion:

More articles: