Thread safe std :: map with lock-free map performance

Case studies and testing of a stream-safe pointer and contention-free shared-mutex

In this article, we show: additional optimizations, examples of use and testing of a thread-safe pointer developed by us with an optimized shared mutex. Contfree_safe_ptr <T> is equivalent to safe_ptr <T, contention_free_shared_mutex <>>
At the end we will show comparative graphs of tests of our thread-safe pointer and one of the best lock-free algorithms from libCDS on Intel Core i5 / i7, Xeon, 2 x Xeon processors.

Three related articles:

Making any object thread safe
Accelerate std :: shared_mutex 10 times
Thread safe std :: map with lock-free map performance

→ My article in English
→ Examples and tests from all three articles

You can find the libCDS library with which we will compare our solution at the link: github.com/khizmax/libcds
All tests in this article use this commit from libCDS: github.com/khizmax/libcds/tree/5e2a9854b0a8625183818eb8ad91e5511fed5898
')

Different granularity of locks

First, we will show how to optimally use shared-mutex, using the example of working with a table (an array of structures). Referring to the experience of industrial databases. For example, in the MSSQL DBMS, locks of different granularity are used — a lock: one or several rows, a page, a extent, one section of a table, an index, the entire table, the entire database. https://technet.microsoft.com/en-us/library/ms189849 (v=sql.105).aspx

Indeed, if we work with one row for a long time and it is important for us that at this time the row is not changed by another thread, then there is no need to block the entire table all this time - it is enough to block only 1 row.

Lock the entire table with shared locking (shared)
We are looking for the desired line or several lines.
Then we block the found string.
Unlock the table
And working with a locked string

Then other threads will be able to work in parallel with the other lines.
So far, we have only used table level locking, i.e. blocked one or more tables.

Or all the tables used in the expression were automatically locked until it was fully completed.

(*safe_map_1)["apple"].first = "fruit"; // locked Table-1 (safe_map_1) // unlocked Table-1 safe_map_1->at("apple").second = safe_map_1->at("apple").second * 2; // locked Table-1 (safe_map_1) // unlocked Table-1 safe_map_1->at("apple").second = safe_map_2->at("apple").second*2; // locked Table-1(safe_map_1) and Table-2(safe_map_2) // unlocked Table-1 and Table-2

In other cases, we manually locked one or more tables using the RAII objects of the lock until the end of the scope of these locks (until they are destroyed):

 { std::lock_guard<decltype(safe_map_1)> lock(safe_map_1); // locked Table-1 (safe_map_1) (*safe_map_1)["apple"].first = "fruit"; safe_map_1->at("apple").second = safe_map_1->at("apple").second * 2; // unlocked Table-1 } { lock_timed_any_infinity lock_all(safe_map_1, safe_map_2); // locked Table-1(safe_map_1) and Table-2(safe_map_2) safe_map_1->at("apple").second = safe_map_2->at("apple").second*2; //locked Table-1(safe_map_1) and Table-2(safe_map_2) safe_map_1->at("potato").second = safe_map_2->at("potato").second*2; //locked Table-1(safe_map_1) and Table-2(safe_map_2) // unlocked Table-1 and Table-2 }

Let's look at an example in which we randomly select an index to insert, then randomly one of four operations (insert, delete, update, read) and perform it on a thread-safe object of type contfree_safe_ptr <std :: map> .

Example: [37] coliru.stacked-crooked.com/a/5b68b65bf2c5abeb

In this case, we will impose the following locks on the table:

Insert - eXclusive lock
Delete - eXclusive lock
Update - eXclusive lock
Read - Shared lock

For Update or Read operations, we do:

Lock the entire table (xlock for Update, slock for Read)
We search for the necessary line, we read or we change it
Unlock the table

One iteration code of our example is 1:

  int const rnd_index = index_distribution(generator); // 0 - container_size int const num_op = operation_distribution(generator);// insert_op, update_op, delete_op, read_op switch (num_op) { case insert_op: { safe_map->emplace(rnd_index, (field_t(rnd_index, rnd_index))); // insert with X-lock on Table break; } case delete_op: { size_t erased_elements = safe_map->erase(rnd_index); // erase with X-lock on Table } break; case update_op: { auto x_safe_map = xlock_safe_ptr(safe_map); // X-lock on Table auto it = x_safe_map->find(rnd_index); if (it != x_safe_map->cend()) { it->second.money += rnd_index; // still X-lock on Table (must necessarily be) } } break; case read_op: { auto s_safe_map = slock_safe_ptr(safe_map); // S-lock on Table auto it = s_safe_map->find(rnd_index); if (it != s_safe_map->cend()) { volatile int money = it->second.money; // still S-lock on Table (must necessarily be) // volatile here only to avoid optimization for unused money-variable } } break; default: std::cout << "\n wrong way! \n"; break; }

Now we will make it so that during the Update operation the table is blocked by the read lock (shared), instead of the change lock (exclusive). This will greatly speed up Update operations when using our “write contention free shared mutex” that we developed earlier.
In this case, multiple threads can simultaneously perform Update and Read operations on a single table. For example, one thread reads one line, and another thread changes another line. But if one thread tries to change the same line that another thread is reading, then in order to avoid Data-races we have to block the line itself when reading it and changing it.

Example: [38] coliru.stacked-crooked.com/a/89f5ebd2d96296d3

Now for Update or Read operations we do:

Lock the entire table with shared locking (shared)
We are looking for the desired line or several lines.
Then we block the found string (xlock for Update, slock for Read)
And we work with a locked row (X / S-lock) and a locked table (S-lock)
Unlock the string
Unlock the table

Diff - what we changed:

One iteration code for our example is 2:

  int const rnd_index = index_distribution(generator); // 0 - container_size int const num_op = operation_distribution(generator);// insert_op, update_op, delete_op, read_op switch (num_op) { case insert_op: { safe_map->emplace(rnd_index, (field_t(rnd_index, rnd_index))); // insert with X-lock on Table break; } case delete_op: { size_t erased_elements = safe_map->erase(rnd_index); // erase with X-lock on Table } break; case update_op: { auto s_safe_map = slock_safe_ptr(safe_map); // S-lock on Table auto it = s_safe_map->find(rnd_index); if (it != s_safe_map->cend()) { auto x_field = xlock_safe_ptr(it->second); x_field->money += rnd_index; // X-lock on field, still S-lock on Table (must necessarily be) } } break; case read_op: { auto s_safe_map = slock_safe_ptr(safe_map); // S-lock on Table auto it = s_safe_map->find(rnd_index); if (it != s_safe_map->cend()) { auto s_field = slock_safe_ptr(it->second); volatile int money = s_field->money; // S-lock on field, still S-lock on Table (must necessarily be) // volatile here only to avoid optimization for unused money-variable } } break; default: std::cout << "\n wrong way! \n"; break; }

Here, for thread-safe string handling, we used safe_obj. Inside safe_obj <T> there is an object of type T, and not a pointer to it, as in safe_ptr <T>. Therefore, when using safe_obj, you do not need to separately allocate memory for the object itself and change the atomic reference counter, as is required in safe_ptr. Therefore, the Insert and Delete operations are performed much faster with safe_obj than with safe_ptr.

It should be noted that when copying safe_obj, not the pointer to the object is copied, but the object itself is copied, while previously blocking the initial and final safe_obj.

Note: Strictly speaking, we do not block the entire line, but all fields of the line, except for the index of the line we are looking for. Therefore, we called our object field, not row. And also, to emphasize that in this way we can block not only one line, but even individual fields in one line, if we place them in separate safe_obj objects. This would allow different threads to block different fields and work with them in parallel.

Now we use the following locks for different operations:

This example is very fast for a large number of short time operations. But we still hold the read lock (shared) on the table in the process of changing or reading a row (field). And if we have rare but very long operations on the rows of the table, then all of this will hold the lock on the entire table for a long time.

However, if according to the logic of your task it doesn’t matter that a row can be deleted by one thread, while another thread is reading or modifying the same row, then we only need to block the table for the duration of the search row. And to avoid accessing the freed memory when another thread has deleted a string, we need to use std :: shared_ptr <T> - a pointer with an atomic thread-safe reference counter. In this case, the memory will be released only when no thread has pointers to this string.

Instead of safe_obj, we will use safe_ptr to protect the string. This will allow us to copy the pointer to the string and use the thread-safe reference counter in std :: shared_ptr <T> which is contained in safe_ptr.

Example: [39] coliru.stacked-crooked.com/a/f2a051abfbfd2716

Now for Update or Read operations we do:

Lock the entire table with shared locking (shared)
We are looking for the desired line or several lines.
Then we block the found string (xlock for Update, slock for Read)
Unlock the table
And working with a locked string (X / S-lock) as long as it takes.
Unlock the string

Diff - what we changed:

Example-3:

  int const rnd_index = index_distribution(generator); // 0 - container_size int const num_op = operation_distribution(generator);// insert_op, update_op, delete_op, read_op safe_ptr_field_t safe_nullptr; switch (num_op) { case insert_op: { safe_map->emplace(rnd_index, (field_t(rnd_index, rnd_index))); // insert with X-lock on Table break; } case delete_op: { size_t erased_elements = safe_map->erase(rnd_index); // erase with X-lock on Table } break; case update_op: { auto pair_result = [&]() { auto s_safe_map = slock_safe_ptr(safe_map); // S-lock on Table auto it = s_safe_map->find(rnd_index); if (it != s_safe_map->cend()) return std::make_pair(it->second, true); // found else return std::make_pair(safe_nullptr, false); // null-value }(); // unlock Table if(pair_result.second) { auto x_field = xlock_safe_ptr(pair_result.first); // X-lock on field x_field->money += rnd_index; // if a long time is read } // unlock field } break; case read_op: { auto pair_result = [&]() { auto s_safe_map = slock_safe_ptr(safe_map); // S-lock on Table auto it = s_safe_map->find(rnd_index); if (it != s_safe_map->cend()) return std::make_pair(it->second, true); // found else return std::make_pair(safe_nullptr, false); // null-value }(); // unlock Table if(pair_result.second) { auto s_field = slock_safe_ptr(pair_result.first); // S-lock on field volatile int money = s_field->money; // if a long time is read // volatile here only to avoid optimization for unused money-variable } // unlock field } break; default: std::cout << "\n wrong way! \n"; break; }

Well-designed multi-threaded programs use short references to shared objects, so in the future we will use not the last but the penultimate example for short read operations.

Disadvantages of Execute Around Idiom

Let's look at possible problems and criticize our code. In the previous chapter, we looked at a fairly convenient and high-performance example, explicitly setting the type of blocking for Update and Read operations, using functions:

slock_safe_ptr () - read only
xlock_safe_ptr () - for reading and editing

Here, the lock is held until the end of the life of the lock_obj object returned by these functions: auto lock_obj = slock_safe_ptr (sf_p);
However, for the Insert and Delete operations, implicit locks were used, i.e. our safe_ptr <std :: map> object was blocked automatically using the Execute Around Pointer idiom, and unlocked immediately after the end of the Insert or Delete operation.

Example: [40] coliru.stacked-crooked.com/a/89f5ebd2d96296d3

But you can forget to use explicit locks on Update and Read operations. In this case, safe_ptr <std :: map> will be unlocked immediately after the search operation ends, and then you continue to use:

found iterator that may be invalidated by another thread
or found item that can be deleted by another thread

To partially solve this problem, instead of safe_ptr <> and safe_obj <>, you can use safe_hide_ptr <> and safe_hide_obj <> - they do not use Execute Around Pointer and you can only access members of the class after explicit blocking:

  safe_hide_obj<field_t, spinlock_t> field_hide_tmp; //safe_obj<field_t, spinlock_t> &field_tmp = field_hide_tmp; // conversion denied - compile-time error //field_hide_tmp->money = 10; // access denied - compile-time error auto x_field = xlock_safe_ptr(field_hide_tmp); // locked until x_field is alive x_field->money = 10; // access granted

Example: [41] coliru.stacked-crooked.com/a/d65deacecfe2636b

If earlier you could be mistaken and write the following - the erroneous code:

  auto it = safe_map->find(rnd_index); // X-lock, find(), X-unlock if (it != s_safe_map->cend()) volatile int money = it->second ->money; // X-lock, operator=(), X-unlock

Now, such a call will not compile and will require explicit blocking of objects — the correct code:

  auto s_safe_map = slock_safe_ptr(safe_map); // S-lock on Table auto it = s_safe_map->find(rnd_index); if (it != s_safe_map->cend()) { auto s_field = slock_safe_ptr(it->second); volatile int money = s_field->money; // S-lock on field, still S-lock on Table // volatile here only to avoid optimization for unused money-variable } // S-unlock Table, S-unlock field

However, you still have the danger of using locks as temporary objects — not correctly :

  auto it = slock_safe_ptr(safe_map)->find(rnd_index); // S-lock, find(), S-unlock if (it != s_safe_map->cend()) { volatile int money = slock_safe_ptr(it->second)->money; // S-lock, =, S-unlock }

You have a choice:

Use safe_ptr and safe_obj to be able to explicitly or automatically (Execute Around Idiom) block your object
Or use safe_hide_ptr and safe_hide_obj leaving only the ability to explicitly block your object.

It is up to you what to choose:

use convenient automatic locking (Execute Around Idiom)
or slightly reduce the possibility of making a mistake by explicitly blocking

In addition, the following functions are planned to be added to the standard C ++ 17 for std :: map <> :

insert_or_assign () - if there is an element then assign, if not, then insert
try_emplace () - if there is no element, then create an element
merge () - merge 2 maps to 1
extract () - get the element, and remove it from the map

The introduction of such functions allows you to perform frequently used composite operations without using iterators - in this case, using Execute Around Idiom will always guarantee the flow-safety of these operations. In general, avoiding iterators for all containers (except for std :: array and std :: vector arrays) is a big help in building multithreaded programs. The less often you use iterators, the less likely you are to access an iterator that is invalid by this or another thread. But the very idea of iterators does not contradict the idea of multithreading, for example, the DBMS (Oracle, MSSQL) supports cursors (analogs of iterators) and different isolation levels of translations (different consistency of multithreading).

But whatever you choose: use Execute Around Idiom, use explicit locks for safe_hide_ptr , discard them and use standard std :: mutex locks ... or even write your own lock-free algorithms - you always have many opportunities to make a mistake.

We section the table - still we increase productivity

Let's return again to the experience of industrial relational DBMS. For example, in a DBMS, table partitioning may be used to improve performance. In this case, instead of the entire table, you can block only the used partition (partition): https://technet.microsoft.com/en-us/library/ms187504(v=sql.105).aspx

Although the DBMS does not normally lock the entire table for the Delete and Insert operations, and this is always true for Delete operations. But for Insert operations, it is possible to perform very fast data loading into the table, the necessary condition of which is the exclusive table blocking:

MS SQL (dbcc traceon (610, -1)): INSERT INTO sales_hist WITH ( TABLOCKX )
Oracle: INSERT / * + APPEND * / INTO sales_hist
https://docs.oracle.com/cd/E18283_01/server.112/e17120/tables004.htm#i1009887
Locking Considerations with Direct-Path INSERT
During the INSERT, the database of the partitioned table locks. If you’re not in use, you can’t make it. But it is supported, but it is supported by the request before the insert operation.

Since our task is to create the fastest multithreaded container, then we also blocked the container entirely on Insert / Delete operations. But now let's try blocking only part of our container.

Let's try to implement our own partitioned ordered partitioned_map associative array and see how the performance increases. We will block in it only the section currently required.

The meaning is: std :: map <safe_ptr <std :: map <>>>
Here the first std :: map will be constant and will contain sections (sub-tables).
This will be a very simplified example where the number of sections is set in the constructor and does not change further.

Each section is a stream-safe associative array safe_ptr <std :: map <>> .

For maximum performance, the number of sections and their ranges must be such that the probability of accessing each section is the same. If you have a key range of 0 - 1000000, and the read / update / insert / delete requests to the beginning of the range are greater than to the end of the range, then the sections with a small key value should be larger and their ranges smaller. For example, 3 sections: [0 - 100000], [100001 - 400000], [400001 - 1000000].

But in our examples we will assume that query keys have a uniform distribution.
Section ranges can be set in two ways:

safe_map_partitioned_t <std :: string, int> safe_map {"a", "f", "k", "p", "u"};

safe_map_partitioned_t <int, int> (0, 100000, 10000);
// set the boundaries of values 0 - 100000 and step for each section 10000

If, when accessing the container, the key goes beyond the section boundaries, then the request will address the nearest section — The program will work correctly.

Example: [42] coliru.stacked-crooked.com/a/fc148b08eb4b0580

Also, for maximum performance, it is necessary to use the previously implemented "contention-free shared-mutex" inside safe_ptr <> , i.e. the meaning is: std :: map <contfree_safe_ptr <std :: map <>>>
Take the code from the previous example and add the code contfree_safe_ptr from the previous chapter.
Replace: safe_map_partitioned_t <std :: string, std :: pair <std :: string, int >>
To: safe_map_partitioned_t <std :: string, std :: pair <std :: string, int>, contfree_safe_ptr>
Example: [43] coliru.stacked-crooked.com/a/23a1f7a3982063a1
This class safe_map_partitioned_t <> we did “Just for fun”, i.e. It is not recommended to use it in real programs. We just showed an example of how you can implement your own algorithms based on the contfree_safe_ptr <> pointer and contention_free_shared_mutex <> blocking.

How to use

First download the safe_ptr.h file from the repository root: github.com/AlexeyAB/object_threadsafe
Then include this file in your cpp-file: #include "safe_ptr.h"
As an optimal use case, we will focus on Example 2 shown above - this is simple and highly productive:

 struct field_t { int money, time; field_t(int m, int t) : money(m), time(t) {} field_t() : money(0), time(0) {} }; typedef safe_obj<field_t, spinlock_t> safe_obj_field_t; // thread-safe ordered std::map by using execute around pointer idiom with contention-free shared-lock contfree_safe_ptr< std::map<int, safe_obj_field_t > > safe_map_contfree; bool success_op; switch (num_op) { case insert_op: slock_safe_ptr(test_map)->find(rnd_index); // find for pre-cache to L1 with temprorary S-lock test_map->emplace(rnd_index, field_t(rnd_index, rnd_index)); break; case delete_op: slock_safe_ptr(test_map)->find(rnd_index); // find for pre-cache to L1 with temprorary S-lock success_op = test_map->erase(rnd_index); break; case read_op: { auto s_safe_map = slock_safe_ptr(test_map); // S-lock on Table (must necessarily be) auto it = s_safe_map->find(rnd_index); if (it != s_safe_map->cend()) { auto x_field = xlock_safe_ptr(it->second); volatile int money = x_field->money; // get value x_field->money += 10; // update value } } break; default: std::cout << "\n wrong way! \n"; break; }

Insert and delete - change the map : Because our slock_safe_ptr () shared lock is as fast as possible, then before the change (insert_op or delete_op) we find the element we need to delete or closest to the one we need to insert using slock_safe_ptr (), which will be unlocked immediately after the find () operation. , -L1 map. test_map->emplace() test_map->erase() exclusive-lock . Insert/delete — , , — shared-lock exclusive-lock ( deadlock — ).

Read and update — ( map) : (read_op) — map ( map). slock_safe_ptr() map, map. s_safe_map, xlock_safe_ptr() , . , {}, x_field , s_safe_map map.

test_map — — .

, auto const& some_ptr = test_map; , , some_ptr->find(); (shared lock) , ~~some_ptr->emplace();~~ . .

slock_safe_ptr(test_map) , slock_safe_ptr(test_map)->find(); , slock_safe_ptr(test_map)->emplace(); — . .
.

safe_ptr

. .

– (MOps), 0 – 90%. : insert, delete, update ( Update std::map, ). , 15% – : 5% insert, 5% delete, 5% update, 85% read. : g++ 4.9.2 (Linux Ubuntu) x86_64

Linux (GCC 4.9.2) Windows (MSVS2015): github.com/AlexeyAB/object_threadsafe/tree/master/benchmark
Clang++ 3.8.0 Linux Makefile.

16 Intel Xeon E5-2660 v3 2.6 GHz. : safe<map,contfree>&rowlock .

CPU, :

numactl --localalloc --cpunodebind=0 ./benchmark 16

CPU, : ./benchmark

Conclusion:

, contfree- contfree_safe_ptr<map> , std::shared_mutex safe_ptr<map,std::shared_mutex>
, 15% std::mutex , std::shared_mutex ( safe_ptr<map,std::mutex> , safe_ptr<map,std::shared_mutex>)
, 30% std:map – 1 thread contfree_safe_ptr<map>&rowlock. . - , .
safe_map_partitioned<,,contfree_safe_ptr>, , «Just-for-fun» — .
15% , shared-mutex ( contfree_safe_ptr<map> & rowlock) 8.37 Mops, 10 , std::shared_mutex ( safe_ptr<map, std::shared_mutex>), 0.74 Mops.

«contention-free shared-mutex» : 70 contention-free, exclusive-lock. , «exclusive-lock» std::mutex , std::shared_mutex 15% .

, .. .

main.cpp : const bool measure_latency = true;
: numactl --localalloc --cpunodebind=0 ./benchmark 16

, std::map & std::mutex safe_ptr<map,std::mutex>, .. safe_ptr<> - , .
, 60% , contfree_safe_ptr<map>&rowlock , contfree_safe_ptr<map>. , contfree_safe_ptr<map>&rowlock .

contfree_safe_ptr<std::map> CDS-lib desktop-CPU

desktop-CPU, .

, contfree_safe_ptr<map> , lock-free-map CDS-lib.
«Just-for-fun» safe_map_partitioned<,,contfree_safe_ptr> 1.7 .

contfree_safe_ptr<std::map> CDS-lib server-CPU

CPU, .

Linux (GCC 4.9.2) Windows (MSVS2015): github.com/AlexeyAB/object_threadsafe/tree/master/CDS_test
Clang++ 3.8.0 Linux Makefile.

Xeon , , :

numactl --localalloc --cpunodebind=0 ./benchmark 16

CPU, : ./benchmark

, 16 , , lock-free CDS-lib , contfree_safe_ptr<map>. Those. lock-free . 16 , 8 , contfree_safe_ptr<map> , lock-free .
«Just-for-fun»- safe_map_partitioned<,,contfree_safe_ptr> , , .

– 50% .

main.cpp : const bool measure_latency = true;
: numactl --localalloc --cpunodebind=0 ./benchmark 16

contfree_safe_ptr<map> lock-free 8 , 16 .

.

. .

for(volatile int i = 0; i < 9000; ++i); - . 100 000 . Intel Xeon E5-2686 v4 2.3 GHz 20.5 .

Linux (GCC 4.9.2) Windows (MSVS2015) : github.com/AlexeyAB/object_threadsafe/tree/master/Real_app_bench
Clang++ 3.8.0 Linux Makefile.

2- : 2 x Intel Xeon E5-2686 v4 2.3 GHz (Broadwell) : 36 72 (Hyper Threading).

:

cd libcds
make
cd…
make
./benchmark

std::mutex std::shared_mutex std::map lock-free-map 16 . std::mutex&std::map std::shared_mutex&std::map , 32 .
- contfree_safe_ptr< std::map<> > lock-free-map libCDS 1 64. , 20 .

— main.cpp : const bool measure_latency = true;

Linux : ./benchmark

, 64 std::mutex , std::shared_mutex.
- contfree_safe_ptr<std::map<>> lock-free-map libCDS 1 64. , 20 .
64 30 , 20 — , 10 — . , 30% , contfree_safe_ptr<T> (MOPS -), lock-free libCDS.

lock-free wait-free (queue, stack …) libCDS , .

Conclusion:

, .. 8 , CDSlib: github.com/khizmax/libcds
thread-safe CDSlib – .
- , 30% , contfree_safe_ptr<> , map- CDSlib.
, CDSlib, contfree_safe_ptr<T> . , lock-free .

C++ Execute Around Pointer Idiom. Contention-free shared-mutex. lock-free libCDS lock-based .

, safe_ptr.h , : github.com/AlexeyAB/object_threadsafe

boost?

Source: https://habr.com/ru/post/328374/

All Articles