📜 ⬆️ ⬇️

Swift Mutexes and Capture



Translation of an article by Matt Gallager .

This article will focus on the lack of threading (threading) and thread synchronization tools in Swift. We will discuss the proposal to implement “multithreading” (concurrency) in Swift and how, before this opportunity appears, streaming execution in Swift will imply the use of traditional mutexes and the shared mutable state.
')
It is not particularly difficult to use a mutex in Swift, but against this background I would like to highlight the subtle nuances of performance in Swift - dynamic memory allocation during the capture of closures. We want our mutex to be fast, but passing a circuit for execution inside a mutex can reduce performance by 10 times due to the memory overhead. Let's look at several ways to solve this problem.

Absence of stream execution (threading) in Swift


When Swift was first announced in June 2014, he had two obvious omissions:


Error handling was implemented in Swift 2 and was one of the key features of this release.

And streaming execution for the most part is still ignored by Swift. Instead of language-based streaming support, Swift includes the Dispatch module (libdispatch, aka Grand Central Dispatch) on all platforms, and implicitly suggests that we use Dispatch instead of expecting help from the language.

Delegating the responsibility of the library that comes with the delivery seems particularly strange compared to other modern languages, such as Go and Rust, in which streaming execution primitives and strict thread safety (respectively) have become the main features of their languages. Even the @synchronized and atomic properties in Objective-C seem like a generous offer compared to the lack of something similar in Swift.

What is the reason for such an obvious omission in this language?

Future "multithreading" in Swift


The answer is briefly discussed in the proposal for the introduction of "multithreading" in the Swift repository .

I mention this proposal to emphasize that Swift developers in the future would like to do something regarding multi-threading, but please keep in mind what Swift developer Joe Groff says: “this document is just a proposal, not an official statement of direction development. "

This sentence appeared to describe a situation where, for example, in Cyclone or Rust, links cannot be divided between the execution threads. Regardless of whether the result is similar to these languages, it seems that Swift plans to eliminate the common memory of streams, with the exception of types implementing Copyable and transmitted through strictly controlled channels (in the sentence they are called Stream 's). A kind of coroutine will also appear (in a sentence called Task 's), which will behave like asynchronous dispatch blocks that can be paused / resumed.

Further in the sentence it is stated that in libraries on top of the Stream / Task / Copyable primitives the most common language means of stream execution can be implemented (by analogy with chan in Go, async / await in .NET, actor in Erlang).

Sounds good, but when to expect multithreading in Swift? In Swift 4? In Swift 5? Not soon.

Thus, now it does not help us, but rather even hinders.

The impact of future functions on the current library


The problem is that Swift avoids the use of simple multi-threaded primitives in a language, or thread-safe versions of language functions on the grounds that they will be replaced or eliminated by some future means.

You can find this evidence by reading the Swift-Evolution mailing list:


We are trying to find a fast general purpose mutex


In short: if we need multi-threaded behavior, then we need to build it ourselves using existing streaming tools and mutex properties.

Standard mutex advice in Swift : use DispatchQueue and call sync for it.

I like libdispatch, but in most cases using DispatchQueue.sync as a mutex is the slowest way to fix the problem, more than an order of magnitude slower than other solutions due to the inevitable closure capture cost passed to the sync function. This is due to the fact that the closure of the mutex must capture the surrounding state (in particular, capture the link to the protected resource), and this capture implies the use of the context of the closure located in the dynamic memory. Until Swift is able to optimize insulating (non-escaping) closures in the stack, the only way to avoid the extra cost of putting closures into dynamic memory is to make sure they are embedded. Unfortunately, this is not possible within module boundaries, such as the Dispatch module boundaries. This makes DispatchQueue.sync unnecessarily slow Swift mutex.

The next proposed option in frequency can be considered objc_sync_enter / objc_sync_exit . Being 2-3 times faster than libdispatch, it is still a bit slower than the ideal (because it is always a re-entrant mutex), and depends on runtime Objective-C (therefore limited to the Apple platform).

The fastest option for a mutex, OSSpinLock is more than 20 times faster than dispatch_sync . In addition to the general limitations of spinlock (high CPU load, if several threads try to enter at the same time), iOS has serious problems that make it completely unsuitable for use on this platform. Accordingly, it can only be used on Mac .

If you are targeting iOS 10 or macOS 10.12, or something newer, you can use os_unfair_lock_t . This performance decision should be close to OSSpinLock , while devoid of its most serious problems. However, this lock is not a FIFO. Instead, a mutex is granted to an arbitrary peer (hence, unfiar). You need to decide whether this is a problem for your program, although in general it means that this option should not be your first choice for a general purpose mutex.

All these problems make pthread_mutex_lock / pthread_mutex_unlock only reasonable, productive and portable option.

Capture Mutexes and Pitfalls


Like most things in a pure C language, pthread_mutex_t has a rather clumsy interface that helps to use the Swift wrapper (especially for building and automatic cleaning). In addition, it is useful to have a “scoped” mutex - which accepts the function and executes it inside the mutex, providing a balanced “lock” and “unlocking” from either side of the function.

PThreadMutex call our wrapper PThreadMutex . The following is the implementation of the simple scoped-mutex function in this wrapper:

 public func sync<R>(execute: () -> R) -> R { pthread_mutex_lock(&m) defer { pthread_mutex_unlock(&m) } return execute() } 

It should work quickly, but it is not. Do you see why?

The problem arises from the implementation of reusable functions, such as the one presented in a separate CwlUtils module. This leads to exactly the same problem as in the case of DispatchQueue.sync : locking by grab leads to the allocation of dynamic memory. Due to the overhead during this process, the function will work more than 10 times slower than needed (3.124 seconds for 10 million calls, compared to the ideal 0.263 seconds).

What exactly is "captured"? Let's consider the following example:

 mutex.sync { doSomething(&protectedMutableState) } 

To do something useful inside a mutex, the reference to protectedMutableState must be stored in a "closure context", which is data that is in dynamic memory.

This may seem harmless enough (in the end, the seizure is what the closures do). But if the sync function cannot be built into what calls it (because it is in another module or file, and the optimization of the entire module is turned off), then the dynamic memory will be allocated during the capture.

And we don't want that. To avoid this, pass the closure to the appropriate parameter instead of capturing it.

WARNING : The following few code examples are becoming more and more ridiculous, and in most cases I suggest not following them. I do this to demonstrate the depth of the problem. Read the chapter “Another Approach” to see what I use in practice.

 public func sync_2<T>(_ p: inout T, execute: (inout T) -> Void) { pthread_mutex_lock(&m) defer { pthread_mutex_unlock(&m) } execute(&p) } 

That's better ... now the function works at full speed (0.282 seconds for a test of 10 million calls).

We solved the problem using the values ​​passed by the functions. A similar problem occurs with the return of the result. Next feature:

 public func sync_3<T, R>(_ p: inout T, execute: (inout T) -> R) -> R { pthread_mutex_lock(&m) defer { pthread_mutex_unlock(&m) } return execute(&p) } 

shows the same low speed of the original, even when the closure does not capture anything (at 1.371 seconds the speed drops even lower). To handle its result, this closure performs dynamic memory allocation.

We can fix this by adding an inout parameter to the result.

 public func sync_4<T, U>(_ p1: inout T, _ p2: inout U, execute: (inout T, inout U) -> Void) -> Void { pthread_mutex_lock(&m) execute(&p, &p2) pthread_mutex_unlock(&m) } 

and cause so

 // ,  `mutableState`  `result`  ,       mutex.sync_4(&mutableState, &result) { $1 = doSomething($0) } 

We are back to full speed, or close enough to it (0.307 seconds for 10 million calls).

Another approach


One advantage of capturing a closure is how easy it seems. The elements inside the grip have the same names inside and outside the closure, and the connection between them is obvious. When we avoid capturing a closure, and instead try to pass all the values ​​as parameters, we have to either rename all of our variables or give them shadow names — which does not contribute to the simplicity of understanding — and we still risk accidentally capturing the variable again degrading performance.

Let's put everything aside and solve the problem differently.

We can create a free sync function in our file that takes a mutex as a parameter:

 private func sync<R>(mutex: PThreadMutex, execute: () throws -> R) rethrows -> R { pthread_mutex_lock(&mutex.m) defer { pthread_mutex_unlock(&mutex.m) } return try execute() } 

If you put the function in the file from which it will be called, then everything almost works. We get rid of the cost of dynamic memory allocation, while the speed of execution drops from 3.043 to 0.374 seconds. But we still have not reached a level of 0.263 seconds, as with a direct call to pthread_mutex_lock / pthread_mutex_unlock . What's wrong again?

It turns out that despite the presence of a private function in the same file, where Swift can fully inline this function, Swift does not eliminate the redundant retention and release of the PThreadMutex parameter (the type of which is class so that pthread_mutex_t does not break during copying).

We can force the compiler to avoid these deductions and exemptions by making the function an extension of PThreadMutex , rather than a free function:

 extension PThreadMutex { private func sync<R>(execute: () throws -> R) rethrows -> R { pthread_mutex_lock(&m) defer { pthread_mutex_unlock(&m) } return try execute() } } 

This causes Swift to handle the self parameter as @guaranteed , eliminating hold / release costs, and we finally reach a value of 0.264 seconds.

Semaphores, not mutexes?


Why not use dispatch_semaphore_t ? The advantage of dispatch_semaphore_wait and dispatch_semaphore_signal is that they do not require a closure - these are separate, unscoped calls.

You can use dispatch_semaphore_t to create constructs like a mutex:

 public struct DispatchSemaphoreWrapper { let s = DispatchSemaphore(value: 1) init() {} func sync<R>(execute: () throws -> R) rethrows -> R { _ = s.wait(timeout: DispatchTime.distantFuture) defer { s.signal() } return try execute() } } 

It turns out that this is about a third faster than the pthread_mutex_lock / pthread_mutex_unlock mutex (0.168 seconds versus 0.244). But, despite the increase in speed, using a semaphore for a mutex is not the best option for a common mutex.

Semaphores are prone to a series of errors and problems. The most serious of these are forms of priority inversion . Priority inversion is the same type of problem that caused OSSpinLock be used for iOS, but the problem for semaphores is a bit more complicated.

With spinlock, priority inversion means:

  1. The high priority thread is active, spinning (spinning), and waiting for the lock to be held by the lower priority thread.
  2. A low priority thread never removes a lock because it is depleted by a higher priority thread.

With a semaphore, priority inversion means:

  1. A high priority thread is waiting for a semaphore.
  2. The medium priority stream does not depend on the semaphore.
  3. It is expected that the low priority stream signals the semaphore that the high priority stream can continue.

A medium priority thread will deplete a low priority (this is normal for thread priority). But since the high-priority stream is waiting for the low-priority stream to signal with a semaphore, the high-priority stream is also being depleted by the medium-priority one. Ideally, this should not happen.

If the correct mutex was used instead of the semaphore, then the priority of the high-priority thread will be transferred to the stream with a lower priority while the high-priority waits for the mutex held by the low-priority thread - this allows the low-priority thread to complete its work and unlock the high-priority thread. However, semaphores are not held by the stream, so priority transmission cannot occur.

Ultimately, semaphores are a good way to bind notifications of terminations between threads (something that is not easy with mutexes), but the construction of semaphores is complex and carries risks, so their use should be limited to situations where you know all the threads in advance and their priorities - when it is known that the priority of the waiting thread is equal to or lower than the priority of the signaling flow.

All this may seem a bit confusing - since you probably don’t intentionally create threads with different priorities in your programs. However, Cocoa frameworks add a bit of complexity: they use dispatch queues everywhere, and each queue has a “QoS class”. And this may lead to the fact that the queue will work with a different thread priority. If you do not know the order of each task in the program (including the user interface and other tasks that are queued using Cocoa frameworks), then you may suddenly encounter a multithreaded priority situation. Better to avoid it.

Application


A project containing the implementations of PThreadMutex and DispatchSemaphore is available on Github .

The CwlMutex.swift file is completely self-contained, so you can simply copy it, if that's all you need.

Alternatively, the ReadMe.md file contains detailed information about cloning the entire repository and adding the framework creating it to your projects.

Conclusion


The pthread_mutex_t remains the best and safest mutex option in Swift for both Mac and iOS. In the future, Swift is likely to acquire the ability to optimize non-escaping closures in the stack, or inline beyond the boundaries of the modules. Any of these innovations will eliminate the inherent problems with Dispatch.sync , probably making it a better option. But for now, it's too inefficient.

While semaphores and other “light” locks are valid approaches in some scenarios, they are not general purpose mutexes and imply additional considerations and risks when designing.

Regardless of the choice of machine mutexes, you need to be careful when inline for maximum performance, otherwise an excessive number of seizures with closures can slow down mutexes 10 times. In the current version of Swift, this may mean copying and pasting code into the file in which it is used.

Streaming, inlining, and optimization are all topics in which we can expect significant changes beyond Swift 3. However, current Swift users have to work in Swift 2.3 and Swift 3 - and this article describes the current behavior in these versions when trying to get maximum performance when using a scoped mutex.

Addition: performance metrics


10 million times a simple cycle was driven away: inputting a mutex, increasing the counter and outputting a mutex. The “slow” versions of DispatchSemaphore and PThreadMutex were compiled as part of a dynamic structure, separate from the test code.

Results:

Mutex optionSeconds (Swift 2.3)Seconds (Swift 3)
PThreadMutex.sync (closure capture)3,0433.124
DispatchQueue.sync2,3303,530
PThreadMutex.sync_3 (return result)1,3711,364
objc_sync_enter0.8690.833
sync (PThreadMutex) (function in the same file)0.3740.387
PThreadMutex.sync_4 (Dual inout parameters)0.3070.310
PThreadMutex.sync_2 (inout single parameter)0.2820.284
PThreadMutex.sync (inline non-capture)0.2640.265
Direct calls to pthread_mutex_lock / unlock0.2630.263
OSSpinLockLock0.0920.108

The test code used is part of the associated CwlUtils project, but the test file containing these performance tests (CwlMutexPerformanceTests.swift) is by default not connected to the test module and must be turned on intentionally.

Source: https://habr.com/ru/post/336260/


All Articles