Prefer SRW-blocking to critical sections.

This article explains why when developing Win32 applications, the Slim Reader / Writer Lock (SRWL) mechanism is often preferable to classic critical sections .

Lightness

The SRWL object occupies only 8 bytes in memory on the x64-architecture, while the critical section is 40 bytes. The critical section requires initialization and deinitialization through calls to the OS kernel functions, while the SRWL is initialized by simply assigning the SRWLOCK_INIT constant to it, and there are no costs for deleting anything at all. Using SRWL generates a more compact code and uses less RAM when working.

If you have 100,000 objects that require some internal synchronization, the memory savings will already be substantial. The performance gain from avoiding unnecessary cache misses will be even more tangible. In modern processors (starting with Intel Nehalem , released in 2008th), one cache line is 64 bytes. If you use 40 of them on the synchronization object, this will significantly affect the performance of access to small objects in your software.

Speed

First of all, keep in mind that the SRWL implementation in the OS kernel has been significantly reworked over the previous few years. If you are reading a benchmark on the Internet regarding measuring the speed of various synchronization primitives in Windows OS, pay attention to the date of writing.

')

And the critical section and SRWL for some time spinning in a loop in user mode, and only then go into standby mode in the kernel. Only the critical section allows you to customize the timeout in user mode.

I have not investigated the details of the implementation deeper. Also, I never tried to conduct the right benchmark in order to fully correctly compare the speeds of the critical sections and the SRWL. It is very difficult to build a theoretically sound and practically useful benchmark at the same time.

But I replaced critical sections with SRWL in my applications about 20 times in different scenarios. SRWL was always faster (or at least not slower) and often gave a visible performance boost.

I will not give specific numbers here. The amount of work when locking a lock, the granularity of locks, the level of parallelism, the ratio of reads and writes, cache usage, processor load and other factors have too much influence on the final result.

I will not argue that SRWL is absolutely always faster than critical sections. In each case, profiling is necessary to clarify the whole picture.

SRWL lack of reentrancy

This is not a bug, but a feature.

If we do not have reenterability of locks, this immediately leads to more transparent public contracts, requires caution when making decisions about the seizure and release of the lock, which ultimately allows us to avoid deadlocks. Well, at least until you do stupid things, like calling a callback inside a locked lock.

Reentrant locks, of course, are also useful. For example, when you are trying to add parallelism to some old code and do not want to get too deep into its refactoring. The original POSIX mutex was created reentrant by accident . One can only imagine how many problems associated with parallel code and locks could be happily avoided if the reentrant synchronization primitives did not become mainstream.

A stream that the same SRWL tries to capture for recording twice will catch itself in deadlock. This type of error is easy to identify and correct right at the time of the first appearance. Just look at the tackle - there will be all the necessary information. There is no influence of timings and parallel streams.

Recursive read locks also caused deadlocks before, well, at least I'm sure about 90% of this :). If I'm not mistaken, Microsoft has quietly changed the behavior either in some kind of update, or when moving from Win8 to Win10 and now there is no deadlock. Unfortunately, this has complicated the search for errors related to reentrancy. Erroneously inserted read locks lead to unpleasant bugs in cases where the internal lock is released too early. Perhaps even worse, an external lock can release a lock taken by another reader. Microsoft SAL annotations for locks can theoretically help to detect this type of problem at the compilation stage, but I personally have never tried them in practice.

Parallel reading

Parallel reading in practice happens quite often. The critical section does not support parallelism in this case.

Write performance problems

The downside to the advantage of parallel reading is the fact that the write lock cannot be obtained until all the read locks are released. Moreover, the SRWL does not guarantee a write request for a record, no preferences at all, or even fairness in order to issue a blocking right (new read locks can be successfully captured while the write lock will continue to be pending). On the one hand, critical sections are not better in this respect (priorities for capturing reading or writing cannot be set there either), but on the other hand, due to the lack of the possibility of parallel captures for reading, the problem will occur less frequently.

The Windows Task Scheduler provides some fairness in terms of providing resources to all threads. This helps in blocking a resource in one thread to complete the wait cycle in user mode in all other threads. But, since the algorithm of the work of the scheduler is not part of any public contract, it is not necessary to write any code for its current implementation.

If the continuity of the progress in writing is important, then neither the critical section nor the SRWL are suitable as a synchronization mechanism. Other constructs, such as a reader-writer queue, may be the preferred mechanism.

Competition at runtime

concurrency :: reader_writer_lock gives more stringent guarantees of priorities than SRWL and is designed specifically to work in conditions of frequent captures. It has its price. In my experience, this synchronization primitive is significantly slower than critical sections and SRWL, and also takes up more memory space (72 bytes).

Personally, I think it is too redundant to perform individual tasks (jobs) only to attempt to capture a lock, but, probably, it will suit someone.

Wrong cache hit

An erroneous cache hit is much more likely for SRWL than for critical sections - again, because of the difference in size (8 bytes versus 40). If a critical section enters the cache, then its 40 bytes occupy most of the 64-byte cache lines, which excludes the possibility of another critical section falling into it. If you are creating an array of locks - try to take into account the size of the cache line of your platform.

On this, however, should not concentrate ahead of time. Even SRWL rarely get into the same cache line. This happens only when a very large number of threads simultaneously modify some relatively small number of objects. If you have, for example, several thousand small objects, it is hardly worth it because of the probability of an erroneous blocking in the cache to significantly change their size - the game, as a rule, is not worth the candle. Certainly, of course, can only be asserted after profiling each individual situation.

OS kernel bug

I have to mention a bug in the Windows kernel that caused me to lose faith in SRWL and indeed in Windows a little. Several years ago, my colleagues and I began to notice strange bugs, when some streams sometimes could not capture one or another SRWL. This happened mainly on dual-core processors, but sometimes, very rarely, and on single-core ones as well. Debugging showed that at the time of attempting to capture, no other thread held this lock. What is even more surprising, an instant later in the same thread, an attempt to capture the same lock was already successful. After a long study, I managed to reduce the playback time of this bug from several days to half an hour. In the end, I proved that this is a problem in the OS kernel, which also applies to IOCP.

From the moment of detection of a bug to the release of the hotfix, 8 months passed and, of course, it took some time to distribute the update to user PCs.

findings

Most locks protect information in some objects from accidental simultaneous access from different threads. Here the key word is “random”, since the requirement of precisely simultaneous access is rarely intentionally programmed. Both the critical section and the SRWL have good performance when capturing and releasing the currently free interlock. In this case, the overall size of the protected object comes to the fore. If the object is small enough to get into the same cache line with the lock, this immediately gives a performance boost. At 32 bytes, the smaller size of the SRWL is the main reason to use it for this purpose.

For code scenarios, when the lock in most cases at the time of the capture attempt will be already taken, it is impossible to make such unequivocal conclusions. This requires measurement of each optimization made. But in this case, the capture and release rate of the lock itself is unlikely to be a bottleneck in the code. At the head of the corner will be a reduction in the time the code inside the lock. Everything that can be done before or after blocking should be done there. Consider using several separate locks instead of one. Try to pull the necessary data before calling the lock (this gives a chance that the code inside the lock will work faster, because it will get it from the cache). Do not allocate memory in the global heap inside the lock (try using any allocator with preallocation of memory). And so on.

Finally, non-reentrant locks are much easier to read in code. Reentrant locks are a kind of “goto concurrency” because they complicate the understanding of the current state of locking and the causes of deadlocks.

Source: https://habr.com/ru/post/318396/

All Articles