rcu_read_lock()
, on exit - rcu_read_unlock()
. These are very lightweight functions, with virtually no effect on performance; in the Linux kernel, they weigh nothing at all (zero-overhead).synchronize_rcu()
and waits for it to end. A deleted item can be accessed only by those readers who have declared their critical reading section in parallel with the writer (in the figure such sections are highlighted in gray). By definition, all such readers will finish their work before the end of the grace-period. At the end of the grace-period, that is, when all critical reading sections initiated or active during the grace-period are completed, the second stage of deletion begins - “reclamation” - that is, the physical deletion of memory under the element. std::atomic<uint32_t> g_nGlobalCtl(1) ; struct thread_record { std::atomic<uint32_t> nThreadCtl; thread_record * pNext; thread_record(): nThreadCtl(0), pNext(nullptr) {} };
thread_record
structure contains local data for the thread and links all such objects to the list of RCU streams.nThreadCtl
31 bits of nThreadCtl
contain the URCU call nesting depth counter (yes, the URCU allows almost unlimited nesting of the critical reading sections), the high bit determines the identifier of the grace-period at the moment the stream enters the critical reading section. In the described scheme, only two identifiers for a grace-period are sufficient.g_nGlobalCtl
global variable contains the identifier of the current grace-period, the low-order bits serve to initialize the nThreadCtl
per-thread variables and are not changed.access_lock
and access_unlock
respectively, are used to enter / exit the critical reading section: static uint32_t const c_nControlBit = 0x80000000; static uint32_t const c_nNestMask = c_nControlBit — 1; void access_lock() { thread_record * pRec = get_thread_record(); assert( pRec != nullptr ); uint32_t tmp = pRec->nThreadCtl.load( std::memory_order_relaxed ); if ( (tmp & c_nNestMask) == 0 ) { pRec->nThreadCtl.store(g_nGlobalCtl.load( std::memory_order_relaxed ), std::memory_order_relaxed ); std::thread_fence( std::memory_order_acquire ); } else pRec->nThreadCtl.fetch_add( 1, std::memory_order_relaxed ); } void access_unlock() { thread_record * pRec = get_thread_record(); assert( pRec != nullptr ); pRec->nThreadCtl.fetch_sub( 1, std::memory_order_release ); }
nThreadCtl
current thread is assigned the value of the global variable g_nGlobalCtl
; it is thus marked that the input to the critical section was made in a certain grace-period (the high bit g_nGlobalCtl
), and the one in the low bits g_nGlobalCtl
initializes the nesting counter of the current stream. At the first, outermost entrance to the critical section, a acquire-memory barrier is applied. It guarantees that the following code will not be moved (“optimized”) upstream of the barrier by either the processor or the compiler. This ensures the visibility of the current grace-period of the thread to all processors - if this order is disturbed, the URCU algorithm will be scattered. When entering the embedded critical section of the barrier is not required, since the current grace-period (high bit) does not change.access_unlock
), the nesting counter in the current thread's access_unlock
simply decremented. The release semantics of the atomic operation is applied; in fact, the release-barrier is needed here only when exiting the uppermost critical section (when moving from 1 to 0 of the nesting counter); when exiting the nested critical section, there is enough relaxed semantics. The release barrier is required when resetting the counter, because when the nesting counter moves from 1 to 0, the declaration “the stream no longer uses RCU” actually occurs, that is, the exit from the grace period, which is critical for the URCU algorithm, is a violation of the order by the compiler or processor will lead to the inoperability of the algorithm. Recognizing the “0 - ​​not 0” situations in the code will require a conditional transition, which is unlikely to add performance to the access_unlock
function, and the basic pattern of using the critical sections of the URCU is without nesting, therefore, the release semantics is always used here.nThreadCtl
bits (nesting counter) of the nThreadCtl
each thread are zero, which means that the thread is not in the critical section of the URCUnThreadCtl
high bit nThreadCtl
not match the high g_nGlobalCtl
high bit, which means that the reader entered the critical section after the beginning of the grace period bool check_grace_period( thread_record * pRec ) { uint32_t const v = pRec->nThreadCtl.load( std::memory_order_relaxed ); return (v & general_purpose_rcu::c_nNestMask) && ((( v ^ g_nGlobalCtl.load( std::memory_order_relaxed )) & ~c_nNestedMask )); }
synchronize
function, which waits for the end of the current grace-period: std::mutex g_Mutex ; void synchronize() { std::atomic_thread_fence( std::memory_order_acquire ); { cds::lock::scoped_lock<std::mutex> sl( g_Mutex ); flip_and_wait(); flip_and_wait(); } std::atomic_thread_fence( std::memory_order_release ); }
g_Mutex
is a global mutex for the URCU algorithm (yes, yes! URCU is still a synchronization technique, so there is nowhere without the mutex). Thus, only one thread writer can go into synchronize
. Do not forget that RCU is positioned for “almost read-only” data, so no particular push on this mutex is expected.flip_and_wait
function: void flip_and_wait() { g_nGlobalCtl.fetch_xor( c_nControlBit, std::memory_order_seq_cst ); for (thread_record* pRec = g_ThreadList.head(std::memory_order_acquire); pRec!= nullptr; pRec = pRec->m_pNext ) { while ( check_grace_period( pRec )) { sleep( 10 ); // 10 CDS_COMPILER_RW_BARRIER ; } } }
fetch_xor
and waits (by calling check_grace_period
) until all check_grace_period
threads have completed this new grace-period. In pseudocode, waiting occurs at a simple sleep of 10 milliseconds; in the real libcds code, a template parameter is used that defines the back-off strategy.flip_and_wait
writer call flip_and_wait
twice? For clarification, consider the following sequence of actions with two threads A and B. Suppose that the call to flip_and_wait
synchronize
only one:access_lock
A calls access_lock
. In the body of this function, it is determined that the call is not nested, the global g_nGlobalCtl
is read, but so far not assigned to the variable nThreadCtl
thread (everything is done in parallel, so this situation is quite acceptable)synchronize
. The first flip_and_wait
, which changes the bit ID of the grace-period in g_nGlobalCtl
. The current grace-period identifier becomes 1nThreadCtl
), thread B completes synchronize
nThreadCtl
A performs the assignment of its variable nThreadCtl
. Recall that the stream read the old grace-period value of 0access_lock
A terminates access_lock
and continues execution in the critical section.synchronize
again (apparently, it wants to delete something again). Again, the current grace-period is g_nGlobalCtl
in g_nGlobalCtl
, so its identifier is now 0.synchronize
is called by the writer before physically removing memory for an itemflip_and_wait
twice, that is, twice waiting for the end of the grace-period, we solve the above problem, the cause of which is the competitive execution of threads.flip_and_wait
dwell on a common solution with a bit as the identifier of the grace-period and calling two flip_and_wait
synchronize
before each deletion. Is there any way to improve this?synchronize
function will be called only when the buffer is full. Unlike the Hazard Pointer, in the URCU the buffer will be common to all threads (in general, you can make per-thread buffers, nothing interferes with this).cds::urcu
:general_instant
is an implementation that exactly follows the described URCU algorithm: each deletion causes a synchronize
, no buffering. If deleting is quite a frequent operation, that is, the structure is not too “almost read-only”, this implementation is rather slow.general_buffered
- implementation with a general lock-free buffer of a predefined size. Dmitry Vyukov's queue is used as a lock-free buffer - cds::container::VyukovMPMCCycleQueue
. The performance of such an implementation is comparable to the Hazard Pointergeneral_threaded
is similar to general_buffered
, but clearing buffers is done by the selected thread. Such an implementation is slightly inferior to general_buffered
due to additional synchronization with the selected stream, but does not slow down the writerssignal_buffered
is an analogue of general_buffered
, but is based on the signal-handled URCU. Not for Windows systemssignal_threaded
is the analogue of general_threaded
for signal-handled URCU. Also not for windowscds::urcu::gc
was introduced: template <typename RCUimpl> class gc;
RCUimpl
is one of the URCU implementations: general_instant
, general_buffered
, etc. With such a wrapper, the specialization for the URCU is easy to write and it will be the only one: template < class RCU, typename Key, typename Value, class Traits > class SplitListMap< cds::urcu::gc< RCU >, Key, Value, Traits > ...
synchronize
, but retire_ptr
. This function places the item to be removed in the URCU buffer and at the right time (for example, when the buffer is full) calls synchronize
. So an explicit call to synchronize
not required, although it is valid. In addition, this solution unifies the interface URCU and Hazard Pointer.cds::urcu::gc<cds::urcu::general_buffered<> >
at the beginning of main()
after calling cds::Initialize()
: #include <cds/init.h> //cds::Initialize cds::Terminate #include <cds/gc/general_buffered.h> // general_buffered URCU int main(int argc, char** argv) { // libcds cds::Initialize() ; { // general_buffered URCU cds::urcu::gc<cds::urcu::general_buffered<> > gbRCU ; // main thread lock-free // main thread // libcds cds::threading::Manager::attachThread() ; // , libcds // ... } // libcds cds::Terminate() ; }
// cds::threading::Manager #include <cds/threading/model.h> int myThreadEntryPoint(void *) { // libcds cds::threading::Manager::attachThread() ; // // lock-free libcds ... // libcds cds::threading::Manager::detachThread() ; return 0; }
access_lock
before calling the container method, and after completing work with the pointer, access_unlock
, and the best (exception-safe) technique will be to use scoped-lock in a separate code block.gc::access_lock()
, upon exit - gc::access_unlock()
(here gc
is one of the URCU implementations; for security exceptions it is better to use the scoped lock technique instead of calling functions) . The only subtle point is the removal of the element: the deletion method must also be included in the critical section of the reading, but the physical removal of the element by calling gc::retire_ptr
must be done outside the critical section, otherwise deadlock is possible: the gc::retire_ptr
method may trigger synchronize
inside.Source: https://habr.com/ru/post/206984/
All Articles