"Lock-free, or not lock-free, that is the question" or "Healthy sleep worse than bitter radish"

On writing this article, I was moved by comments to the article " How to sleep correctly and wrongly ."

This article focuses on the development of multi-threaded applications, the applicability of lock-free to some cases arising in the process of working on LAppS , the function of nanosleep and violence against the task scheduler.

NB:      C++  Linux,       POSIX.1-2008  a (    ).

In general, everything is pretty messy, I hope the train of thought in the presentation will be clear. If interested, then I ask under the cat.

Event-oriented software is always waiting for something. Whether they are a GUI or a network server, they are waiting for any events: keyboard input, mouse events, data packet arrival over the network. But all software is waiting in different ways. Lock-free systems should not wait at all. At least the use of lock-free algorithms should occur where it is not necessary to wait, and even harmful. But we are talking about competitive (multi-stream) systems, and strangely enough, lock-free algorithms are also waiting. Yes, they do not block the execution of parallel threads, but at the same time they themselves are waiting for the opportunity to do something without blocking.

LAppS is very active in using mutexes and semaphores. In this case, the standard C ++ semaphores are missing. The mechanism is very important and convenient, however C ++ should work on systems that do not have semaphore support, and therefore semaphores are not included in the standard. Moreover, if the semaphores, I use because they are convenient, then mutexes because forced.

The behavior of the mutex in the case of concurrent lock (), as well as sem_wait () in Linux, puts the waiting thread at the end of the task scheduler queue, and when it comes to the top, the check is repeated and without returning to userland, the thread is placed again in the queue if the expected event has not happened yet. This is a very important point.

And I decided to check whether I can drop the std :: mutex and POSIX semaphores by emulating them with std :: atomic, transferring the load for the most part to userland. In fact, it was not possible, but about everything in order.

First of all, I have several sections in which these experiments could be useful:

blocking in LibreSSL (case 1);
blocking the transfer of payload received packets in the Lua application (case 2);
Waiting for events on receipt payload ready for processing applications Lua (case 3).

Let's start with non-blocking locks. Let's write our mutex using atomics, as shown in some speeches by X. Sutter (there is no original code, therefore, from memory and therefore the code does not match the original 100%, and Sutter also referred to C ++ 20 progress, so there are differences). And despite the simplicity of this code, there are pitfalls in it.

 #include <atomic> #include <pthread.h> namespace test { class mutex { private: std::atomic<pthread_t> mLock; public: explicit mutex():mLock{0} { } mutex(const mutex&)=delete; mutex(mutex&)=delete; void lock() { pthread_t locked_by=0; //  C++20     , .. compare_exchange_strong          while(!mLock.compare_exchange_strong(locked_by,pthread_self())) { locked_by=0; //      } } void unlock() { pthread_t current=pthread_self(); if(!mLock.compare_exchange_strong(current,0)) { throw std::system_error(EACCES, std::system_category(), "An attempt to unlock the mutex owned by other thread"); } } const bool try_lock() { pthread_t unused=0; return mLock.compare_exchange_strong(unused,pthread_self()); } }; }

Unlike std :: mutex :: unlock (), the test :: mutex: unlock () behavior when trying to unlock from another thread is deterministic. An exception will be thrown. This is good, though not consistent with the behavior of the standard. What's wrong with this class? The bad thing is that the test :: mutex: lock () method will shamelessly eat CPU resources in time-allocated quotas in an attempt to seize a mutex that another thread already owns. Those. a loop in test :: mutex: lock () will lead to a waste of CPU resources. What are our options for resolving this situation?

We can use sched_yield () (as suggested in one of the comments to the above article). Is it easy? First of all, in order to use sched_yield (), it is necessary that the execution threads use the SCHED_RR, SCHED_FIFO policies for their prioritization in the task scheduler. Otherwise, calling sched_yield () will be a waste of CPU resources. Secondly, a very frequent call to sched_yield () still raises the CPU usage. Moreover, the use of real-time policies in your application, and provided that there are no other real-time applications in the system, will limit the scheduler queue with the selected policy to only your threads. It seemed to be - this is good! No, it's not good. The whole system will become less responsive, because busy with priority task. CFQ will be in the pen. But there are other threads in the application, and very often there is a situation when the thread that captures the mutex is put at the end of the queue (the quota has expired), and the thread that is waiting for the mutex to be released right in front of it. In my experiments (case 2) this method gave roughly the same results (by 3.8% worse) as std :: mutex, but the system is less responsive and the CPU consumption increases by 5% -7%.

You can try to change test :: mutex :: lock () like this (also bad):

 void lock() { pthread_t locked_by=0; while(!mLock.compare_exchange_strong(locked_by,pthread_self())) { static thread_local const struct timespec pause{0,4}; // -      nanosleep(&pause,nullptr); locked_by=0; } }

Here you can experiment with the sleep duration in nanoseconds, 4ns latency turned out to be optimal for my CPU and the performance drop relative to std :: mutex in the same case 2 was 1.2%. Not the fact that nanosleep slept 4ns. In fact, or more (in general) or less (if interrupted). The fall (!) Of consumption of CPU resources was 12% -20%. Those. it was such a healthy dream.

OpenSSL and LibreSSL have two functions that set callbacks to block when using these libraries in a multi-threaded environment. It looks like this:

 //  callback void openssl_crypt_locking_function_callback(int mode, int n, const char* file, const int line) { static std::vector<std::mutex> locks(CRYPTO_num_locks()); if(n>=static_cast<int>(locks.size())) { abort(); } if(mode & CRYPTO_LOCK) locks[n].lock(); else locks[n].unlock(); } //  callback-a CRYPTO_set_locking_callback(openssl_crypt_locking_function_callback); //  id CRYPTO_set_id_callback(pthread_self);

And now the worst thing is, using the above test :: mutex mutex in LibreSSL reduces LAppS performance by almost 2 times. And regardless of the option (empty wait cycle, sched_yield (), nanosleep ()).

In general case 2 and case 1 are deleted, and remain with std :: mutex.

We turn to semaphores. There are many examples of how to implement semaphores with std :: condition_variable. All of them use std :: mutex as well. And such simulators of semaphores are slower (according to my tests) than system POSIX semaphores.

Therefore, we will make a semaphore on atoms:

  class semaphore { private: std::atomic<bool> mayRun; mutable std::atomic<int64_t> counter; public: explicit semaphore() : mayRun{true},counter{0} { } semaphore(const semaphore&)=delete; semaphore(semaphore&)=delete; const bool post() const { ++counter; return mayRun.load(); } const bool try_wait() { if(mayRun.load()) { if(counter.fetch_sub(1)>0) return true; else { ++counter; return false; } }else{ throw std::system_error(ENOENT,std::system_category(),"Semaphore is destroyed"); } } void wait() { while(!try_wait()) { static thread_local const struct timespec pause{0,4}; nanosleep(&pause,nullptr); } } void destroy() { mayRun.store(false); } const int64_t decrimentOn(const size_t value) { if(mayRun.load()) { return counter.fetch_sub(value); }else{ throw std::system_error(ENOENT,std::system_category(),"Semaphore is destroyed"); } } ~semaphore() { destroy(); } };

Oh, this semaphore is many times faster than the system semaphore. The result of the separate testing of this semaphore with one provider and 20 consamers:

 OS semaphores test. Started 20 threads waiting on a semaphore Thread(OS): wakes: 500321 Thread(OS): wakes: 500473 Thread(OS): wakes: 501504 Thread(OS): wakes: 502337 Thread(OS): wakes: 498324 Thread(OS): wakes: 502755 Thread(OS): wakes: 500212 Thread(OS): wakes: 498579 Thread(OS): wakes: 499504 Thread(OS): wakes: 500228 Thread(OS): wakes: 499696 Thread(OS): wakes: 501978 Thread(OS): wakes: 498617 Thread(OS): wakes: 502238 Thread(OS): wakes: 497797 Thread(OS): wakes: 498089 Thread(OS): wakes: 499292 Thread(OS): wakes: 498011 Thread(OS): wakes: 498749 Thread(OS): wakes: 501296 OS semaphores test. 10000000 of posts for 20 waiting threads have taken 9924 milliseconds OS semaphores test. Post latency: 0.9924ns ======================================= AtomicEmu semaphores test. Started 20 threads waiting on a semaphore Thread(EmuAtomic) wakes: 492748 Thread(EmuAtomic) wakes: 546860 Thread(EmuAtomic) wakes: 479375 Thread(EmuAtomic) wakes: 534676 Thread(EmuAtomic) wakes: 501014 Thread(EmuAtomic) wakes: 528220 Thread(EmuAtomic) wakes: 496783 Thread(EmuAtomic) wakes: 467563 Thread(EmuAtomic) wakes: 608086 Thread(EmuAtomic) wakes: 489825 Thread(EmuAtomic) wakes: 479799 Thread(EmuAtomic) wakes: 539634 Thread(EmuAtomic) wakes: 479559 Thread(EmuAtomic) wakes: 495377 Thread(EmuAtomic) wakes: 454759 Thread(EmuAtomic) wakes: 482375 Thread(EmuAtomic) wakes: 512442 Thread(EmuAtomic) wakes: 453303 Thread(EmuAtomic) wakes: 480227 Thread(EmuAtomic) wakes: 477375 AtomicEmu semaphores test. 10000000 of posts for 20 waiting threads have taken 341 milliseconds AtomicEmu semaphores test. Post latency: 0.0341ns

Te this semaphore with an almost free post (), which is 29 times faster than the system one, is also very fast in awakening its waiting streams: 29325 revivals ¹ in milliseconds, against 1007 awakenings in milliseconds for the system ones. He has a deterministic behavior when a semaphore is destroyed or a semaphore is destroyed. And naturally segfault when trying to use already destroyed.

(¹) In fact, so many times per second, a thread cannot be postponed and roused by the scheduler. Since post () is not blocking; with this synthetic test, wait () very often finds itself in a situation where it is not necessary to sleep. At the same time, at least 7 streams in parallel read the semaphore value.

But using it in case 3 in LAppS leads to a loss of performance regardless of sleep time. He wakes up too often to check, and events in LAppS arrive much slower (network latency, client-side latency generating the load, etc.). And checking less often means losing performance.

Moreover, the use of sleep in such cases and in a similar way is completely harmful, since on the other hardware, the results may turn out to be completely different (as in the case of the assembly instruction pause), and for each CPU model you also have to select the delay time.

The advantage of the system mutex and semaphore is that the execution thread does not wake up until the event (unlock mutex or semaphore increment) occurs. Extra CPU cycles do not waste, - profit.

In general, everything from this evil one, disabling iptables on my system gives from 12% (with TLS) to 30% (without TLS) performance gain ...

Source: https://habr.com/ru/post/428087/

All Articles

"Lock-free, or not lock-free, that is the question" or "Healthy sleep worse than bitter radish"

More articles: