Shared pointers and multithreading. And again about them, again

The chapter from the book "Modern Programming in C ++" is called "One hundred and first time about smart pointers". Everything would be fine, but the book was published in 2001, so is it worth it again to return to this topic? It seems to me that it is just now worth it. Over these fifteen years, the point of view itself has changed, the angle at which we are looking at the problem. In those days, the first de facto standard implementation, boost :: shared_ptr <>, was just released, before that everyone wrote an implementation on demand and at least imagined the details, the strengths and weaknesses of their code. All books on C ++ at that time necessarily described one of the variations of smart pointers in the smallest detail.

Now we are given a standard, and this is good. But on the other hand, you no longer need to understand what is inside, instead, it is enough to repeat the mantra three times "use smart pointers wherever you would use ordinary pointers" , and this is not so good. I suspect that not everyone is aware that this standard is only one of the possible interface options, not to mention the difference between the implementations of different vendors. When choosing a standard, a choice was made between various possibilities, taking into account various factors, but, optimal or not, this choice is obviously not the only one.

And on stackoverflow, for example, the question is asked again and again: "Are smart pointers from the standard library thread safe?". Answers are usually categorical, but some are not very informative. If I for example did not know what was going on, I probably would not have understood. And by the way, all the relatively new books describing the new C ++ standard also pay little attention to this issue.

So let's try to break the covers and deal with the details.

We will immediately define the terminology, we are not talking about data protection to which the pointer refers, it can be an object of arbitrary complexity and multi-threaded access to it requires, in general, separate synchronization. By the thread-safety of a smart pointer, we mean the security of the pointer itself to the data and the validity of the internal reference counter. Figuratively speaking, if pointers are created via std :: make_shared <> (), then no assignments, transfers to functions or other streams, swaps, destruction, can cause it to be in an invalid state. Until we call reset () or get () , we can expect the pointer to refer to some valid object, although not necessarily the one we mean.

One of the popular answers to the heading question: " It is only the control block itself which is thread-safe. ". So we will see what is meant specifically by the control block and safe .
For experiments used g ++ - 5.4.0 .

Let's start with an example

Let some information in the shared memory be packed into the structure and accessible through the pointer. There is one or many independent streams that must read and use this data without modification, as a rule, it turns out that access speed is crucial for them. At the same time, let there be one or several streams modifying this data with integrity violation, in practice it usually turns out that modifications happen much less frequently and the access speed there is not so important. However, staying within the framework of the classic ( exclusive lock ) synchronization, we are forced to serialize read access even if no data changes have occurred. Naturally, this is reflected in efficiency in the most fatal way, and this situation occurs so often, perhaps in a slightly different version, that I would venture to call it the main issue of multi-threaded programming.

Of course, there are standard solutions, boost :: shared_mutex and his young scion std :: shared_mutex , which allow two access levels - shared for reading and exclusive for writing. However, I, having heard that std :: shared_ptr gives thread-safe access to the control block (and not really understanding what this means), and also that the operations on it are implemented lock-free, I want to offer my elegant solution:

//   std::shared_ptr<SHARED_DATA> data; reading_thread { //       auto read_copy=data; //  ,   ... //  read_copy } writing thread { //     auto update=std::make_shared<SHARED_DATA>(...args); // (?)   data=update; //     ,     //          }

here we have to recreate the structure with the data each time for any update, but this is a fairly acceptable case in practice.

So, will it work? Of course not! But why?

How it works

If you look at the structure of the shared pointer
/usr/include/c++/5/bits/shared_ptr_base.h: 1175

 template<typename _Tp, _Lock_policy _Lp> class __shared_ptr { ... private: _Tp* _M_ptr; // Contained pointer. __shared_count<_Lp> _M_refcount; // Reference counter. };

it can be seen that it consists of two members - a pointer to the actual data and the control block itself . But the fact is that there is no way to change, assign, move, etc. atomically and without blocking. both elements. That is, shared pointers cannot be safe (.?) A point or a question mark? Well, it seems to be a point, but some kind of vague, not final, somehow it is trivial and too simple. We were told that "only access to the control unit is safe," and we did not check.

Let's figure it out, dig deeper

 auto data=std::make_shared<int>(0); void read_data() { //  ,     for(;;) auto read_copy=data; } int main() { std::thread(read_data).detach(); //  ,      for(;;) data=std::make_shared<int>(0); return 0; }

Such a minimalist example implements the proposed idea and, in general, justifies the expectations - it falls with a crash just by winding several hundred cycles. However, pay attention, we never address the data itself, the pointer is not dereferenced. That is, something amiss is happening with the control unit? But now we have a code with which you can work as a debugger. But first, let's take a look at other possible options:

 auto data=std::make_shared<int>(0); void read_data() { for(;;) auto sp=std::atomic_load(&data); } int main(int argc, char**argv) { std::thread(read_data).detach(); for(;;) { std::atomic_exchange(&data, std::make_shared<int>(0)); assert(std::atomic_is_lock_free(&data)); } return 0; }

here everything is fine work em if it were not for assert () in the body of the loop. That is, atomic operations on std :: shared_ptr are defined, but they are blocking. Well, this is not our way, on mutexes I myself can. Another option:

 std::shared_ptr<int> variant[]={ std::make_shared<int>(0), std::make_shared<int>(0) }; auto data=variant[0]; void read_data() { for(;;) auto sp=data; } int main() { std::thread(read_data).detach(); for(size_t n=0;; ++n) { data=variant[n%2]; } return 0; }

almost identical, but it works perfectly by fully loading two cores at 100%. The difference is that here one of the threads never calls the destructor. So, destructors of standard pointers are unsafe? I do not believe. Let's go back to the original version and

Digging deeper

Consider a closer reading stream:

  auto sp=data;

Here, the copy constructor and destructor are all called in a loop.

here are the extracts from the source code

 //L1#shared_ptr_base.h : 662 __shared_count(const __shared_count& __r) noexcept : _M_pi(__r._M_pi) { if (_M_pi != 0) _M_pi->_M_add_ref_copy(); } //L2#shared_ptr_base.h : 134 void _M_add_ref_copy() { __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); } //L1#shared_ptr_base.h : 658 ~__shared_count() noexcept { if (_M_pi != nullptr) _M_pi->_M_release(); } //L2#shared_ptr_base.h : 147 if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1) { _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); _M_dispose(); }

which, if we discard all unnecessary, come down to

 // copy ctor __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); // old instance dtor if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1) _M_dispose();

or if go to pseudocode

 ++_M_pi->_M_use_count; if(--_M_pi->_M_use_count == 0) dispose();

Here, the increment and decrement operators are implied atomic, and the dispose () function clears the memory and, in particular, invalidates the pointer to the reference counter _M_pi . I must say that for accustomed multithreading expression:

if (- cnt == 0)
do_something ();

It looks like a grenade with a torn-out check, between these two lines can occur, and necessarily, literally anything happens. The only thing from which such a construction reliably protects is from a similar call in another thread - no matter how many times the atomic decrement operator is called, only one of them will reset the counter.

However, what happens at this time in a different, writing, stream?

The constructor of the new object is called. This is a completely independent object, so we are not concerned in any way.
another degenerate pointer is created
over these three pointers the classic cyclic swap () is called
the destructor of the temporary pointer (which now contains the original data) is called
the destructor of the degenerate pointer is called, it also does not threaten us

somewhere like this

 //L1#shared_ptr.h : 291 shared_ptr& operator=(shared_ptr&& __r) noexcept { this->__shared_ptr<_Tp>::operator=(std::move(__r)); //L2#shared_ptr_base.h : 997 __shared_ptr& operator=(__shared_ptr&& __r) noexcept { __shared_ptr(std::move(__r)).swap(*this); //L#3shared_ptr_base.h : 932 __shared_ptr(__shared_ptr&& __r) noexcept : _M_ptr(__r._M_ptr), _M_refcount() { _M_refcount._M_swap(__r._M_refcount); //L#4shared_ptr_base.h : 684 void _M_swap(__shared_count& __r) noexcept { _Sp_counted_base<_Lp>* __tmp = __r._M_pi; __r._M_pi = _M_pi; _M_pi = __tmp; } //L2#shared_ptr_base.h : 1073 void swap(__shared_ptr<_Tp, _Lp>& __other) noexcept { std::swap(_M_ptr, __other._M_ptr); _M_refcount._M_swap(__other._M_refcount); } //L3#shared_ptr_base.h : 684 void _M_swap(__shared_count& __r) noexcept { _Sp_counted_base<_Lp>* __tmp = __r._M_pi; __r._M_pi = _M_pi; _M_pi = __tmp; } //L2#shared_ptr_base.h : 658 ~__shared_count() noexcept { if (_M_pi != nullptr) _M_pi->_M_release(); } //L3#shared_ptr_base.h : 142 void _M_release() noexcept { // Be race-detector-friendly. For more info see bits/c++config. _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1) { _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); _M_dispose(); //L1#shared_ptr.h : 93 (destructor) //L2#shared_ptr_base.h : 658 ~__shared_count() noexcept { if (_M_pi != nullptr) //_M_pi == nullptr - true here _M_pi->_M_release(); }

If we drop all the excess, there will be approximately such code fragments:

  _Sp_counted_base<_Lp>* __tmp = __r._M_pi; __r._M_pi = _M_pi; _M_pi = __tmp; if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1) _M_dispose();

Again, the first three lines ( swap () ) look extremely suspicious, but with careful analysis they turn out to be completely safe (of course only in this context) and all that remains is (pseudo-code).

 if(--_M_pi->_M_use_count == 0) dispose();

the very expression that we just considered safe. Here comes the moment of truth:
// assert (_M_use_count == 1);

// writing thread	// reading thread
if (--_ M_pi -> _ M_use_count == 0) // true	.
.	++ _ M_pi -> _ M_use_count; // count = 1
.	if (--_ M_pi -> _ M_use_count == 0) // true
dispose (); // I am the first! BANG !!	dispose (); // not me! BANG !!

This is how the combination of atomic increment and atomic decrement leads to a race between threads and std :: shared_ptr <> is not thread-safe, even at the level of a control block. Now really the point.

A little later, I found an aphoristic summary of this principle in the documentation for boost :: shared_ptr <> . It sounds like this: "Pointers are safe either only for writing or only for reading, and unsafe with concurrent reading / writing." Having slightly developed this idea and realizing that bare writing without reading data is practically meaningless, we see that standard smart pointers are safe for reading, that is, just as safe are normal constant pointers. That is, all this complex internal machinery is created in order to reach the level of ordinary pointers, but no more, a fat point .

Instead of a conclusion. From analysis to synthesis

I would like to finish on a light note, that is, to offer, at least at the concept level, an algorithm that does not use blocking mutexes and allows safe operations with shared pointers from different threads. However, I failed, moreover, I had the conviction that, on the basis of existing elementary non-blocking primitives, this is simply impossible. Of course, it’s enough just to write a spinlock -based version, but it would be unsportsmanlike, I don’t consider such algorithms to be truly non-blocking. You can take as a basis any existing multithreaded blocking implementation and replace each mutex with the corresponding spinlock , that is, algorithmically reduce the task to choosing a more efficient mutex type. Obviously this is not our way.

What are we missing for a full-fledged non-blocking implementation? There is a very small number of non-blocking primitives working only with built-in types:

add an integer constant (postfix or prefix) - atomic_fetch_and_add and its variations
the exchange of two values (in fact, asymmetrical, only one of the operands changes atomically) - atomic_swap
conditional exchange or assignment with equality test - atomic_compare_and_swap

Considering the unconditional ban on the if () operator in non-blocking algorithms, only the latter is suitable for the role of the branch operator, but its most serious limitation is that it allows checking and assignment solely to the same variable. In general, it is easy to notice something in common in all three primitives - they work atomically with one and only one memory area the size of a machine word, the reasons are obvious, I think. Looking closely at the generalized structure of the shared pointer, we see that it must contain at least one pointer to the shared data (including the control block ), and somewhere in this block there should be a reference counter. For any operations with the pointer itself, for example, assignment, we must atomically check the counter and simultaneously change the pointer to the control block, which is impossible using existing atomic primitives. It follows that the creation of a full-fledged non-blocking shared pointer is impossible in principle.

In fact, I would be happy to make a mistake, maybe there is something that I do not know or understand wrong? Anyone wishing to challenge waiting in the comments.

Source: https://habr.com/ru/post/311560/

All Articles