Translation of an article by Matt Gallager .This article will focus on the lack of threading (threading) and thread synchronization tools in Swift. We will discuss the proposal to implement “multithreading” (concurrency) in Swift and how, before this opportunity appears, streaming execution in Swift will imply the use of traditional mutexes and the shared mutable state.
')
It is not particularly difficult to use a mutex in Swift, but against this background I would like to highlight the subtle nuances of performance in Swift - dynamic memory allocation during the capture of closures. We want our mutex to be fast, but passing a circuit for execution inside a mutex can reduce performance by 10 times due to the memory overhead. Let's look at several ways to solve this problem.
Absence of stream execution (threading) in Swift
When Swift was first announced in June 2014, he had two obvious omissions:
- error processing,
- streaming execution and synchronization of threads.
Error handling was implemented in Swift 2 and was one of the key features of this release.
And streaming execution for the most part is still ignored by Swift. Instead of language-based streaming support, Swift includes the
Dispatch
module (libdispatch, aka Grand Central Dispatch) on all platforms, and implicitly suggests that we use
Dispatch
instead of expecting help from the language.
Delegating the responsibility of the library that comes with the delivery seems particularly strange compared to other modern languages, such as Go and Rust, in which streaming execution primitives and strict thread safety (respectively) have become the main features of their languages. Even the
@synchronized
and
atomic
properties in Objective-C seem like a generous offer compared to the lack of something similar in Swift.
What is the reason for such an obvious omission in this language?
Future "multithreading" in Swift
The answer is briefly discussed in the
proposal for the introduction of "multithreading" in the Swift repository .
I mention this proposal to emphasize that Swift developers in the future would like to do something regarding multi-threading, but please keep in mind what Swift developer Joe Groff says: “this document is just a proposal, not an official statement of direction development. "This sentence appeared to describe a situation where, for example, in
Cyclone or
Rust, links cannot be divided between the execution threads. Regardless of whether the result is similar to these languages, it seems that Swift plans to eliminate the common memory of streams, with the exception of types implementing
Copyable
and transmitted through strictly controlled channels (in the sentence they are called
Stream
's). A kind of coroutine will also appear (in a sentence called
Task
's), which will behave like asynchronous dispatch blocks that can be paused / resumed.
Further in the sentence it is stated that in
libraries on top of the
Stream / Task / Copyable
primitives the most common
language means of stream execution can be implemented (by analogy with
chan
in Go,
async / await
in .NET,
actor
in Erlang).
Sounds good, but when to expect multithreading in Swift? In Swift 4? In Swift 5? Not soon.
Thus, now it does not help us, but rather even hinders.
The impact of future functions on the current library
The problem is that Swift avoids the use of simple multi-threaded primitives in a language, or thread-safe versions of language functions on the grounds that they will be replaced or eliminated by some future means.
You can find this evidence by reading the Swift-Evolution mailing list:
- References to objects (both strong and weak) are not defined "if there is read / write, write / write or anything / destroy data in the race variable" . There is no intention to change this behavior or to suggest an integrated “atomic” approach, since this is “one of the few vague rules of behavior that we adopt”. A possible “correction” of this indefinite behavior will be a new multithreading model.
- The resulting types (or other throws (throws) other than functional interfaces) would be useful for numerous “continuation passing style” algorithms and would be carefully discussed, but ultimately they will be ignored until until Swift “provides proper language support [for coroutines or asynchronous promises]” as part of changes in multithreading.
We are trying to find a fast general purpose mutex
In short: if we need multi-threaded behavior, then we need to build it ourselves using existing streaming tools and mutex properties.
Standard
mutex advice in Swift : use
DispatchQueue
and call
sync
for it.
I like libdispatch, but in most cases using
DispatchQueue.sync
as a mutex is the slowest way to fix the problem, more than
an order of magnitude slower than other solutions due to the inevitable closure capture cost passed to the
sync
function. This is due to the fact that the closure of the mutex must capture the surrounding state (in particular, capture the link to the protected resource), and this capture implies the use of the context of the closure located in the dynamic memory. Until Swift is able to optimize insulating (non-escaping) closures in the stack, the only way to avoid the extra cost of putting closures into dynamic memory is to make sure they are embedded. Unfortunately, this is not possible within module boundaries, such as the Dispatch module boundaries. This makes
DispatchQueue.sync
unnecessarily slow Swift mutex.
The next proposed option in frequency can be considered
objc_sync_enter
/
objc_sync_exit
. Being 2-3 times faster than libdispatch, it is still a bit slower than the ideal (because it is always a re-entrant mutex), and depends on runtime Objective-C (therefore limited to the Apple platform).
The fastest option for a mutex,
OSSpinLock
is more than 20 times faster than
dispatch_sync
. In addition to the general limitations of spinlock (high CPU load, if several threads try to enter at the same time),
iOS has serious problems that make it completely unsuitable for use on this platform. Accordingly, it can only be used on Mac .
If you are targeting iOS 10 or macOS 10.12, or something newer, you can use
os_unfair_lock_t
. This performance decision should be close to
OSSpinLock
, while devoid of its most serious problems. However, this lock is not a FIFO. Instead, a mutex is granted to an arbitrary peer (hence, unfiar). You need to decide whether this is a problem for your program, although in general it means that this option should not be your first choice for a general purpose mutex.
All these problems make
pthread_mutex_lock
/
pthread_mutex_unlock
only reasonable, productive and portable option.
Capture Mutexes and Pitfalls
Like most things in a pure C language,
pthread_mutex_t
has a rather clumsy interface that helps to use the Swift wrapper (especially for building and automatic cleaning). In addition, it is useful to have a “scoped” mutex - which accepts the function and executes it inside the mutex, providing a balanced “lock” and “unlocking” from either side of the function.
PThreadMutex
call our wrapper
PThreadMutex
. The following is the implementation of the simple scoped-mutex function in this wrapper:
public func sync<R>(execute: () -> R) -> R { pthread_mutex_lock(&m) defer { pthread_mutex_unlock(&m) } return execute() }
It should work quickly, but it is not. Do you see why?
The problem arises from the implementation of reusable functions, such as the one presented in a separate
CwlUtils
module. This leads to exactly the same problem as in the case of
DispatchQueue.sync
: locking by grab leads to the allocation of dynamic memory. Due to the overhead during this process, the function will work more than 10 times slower than needed (3.124 seconds for 10 million calls, compared to the ideal 0.263 seconds).
What exactly is "captured"? Let's consider the following example:
mutex.sync { doSomething(&protectedMutableState) }
To do something useful inside a mutex, the reference to
protectedMutableState
must be stored in a "closure context", which is data that is in dynamic memory.
This may seem harmless enough (in the end, the seizure is what the closures do). But if the
sync
function cannot be built into what calls it (because it is in another module or file, and the optimization of the entire module is turned off), then the dynamic memory will be allocated during the capture.
And we don't want that. To avoid this, pass the closure to the appropriate parameter instead of capturing it.
WARNING : The following few code examples are becoming more and more ridiculous, and in most cases I suggest not following them. I do this to demonstrate the depth of the problem. Read the chapter “Another Approach” to see what I use in practice.
public func sync_2<T>(_ p: inout T, execute: (inout T) -> Void) { pthread_mutex_lock(&m) defer { pthread_mutex_unlock(&m) } execute(&p) }
That's better ... now the function works at full speed (0.282 seconds for a test of 10 million calls).
We solved the problem using the values ​​passed by the functions. A similar problem occurs with the return of the result. Next feature:
public func sync_3<T, R>(_ p: inout T, execute: (inout T) -> R) -> R { pthread_mutex_lock(&m) defer { pthread_mutex_unlock(&m) } return execute(&p) }
shows the same low speed of the original, even when the closure does not capture anything (at 1.371 seconds the speed drops even lower). To handle its result, this closure performs dynamic memory allocation.
We can fix this by adding an
inout
parameter to the result.
public func sync_4<T, U>(_ p1: inout T, _ p2: inout U, execute: (inout T, inout U) -> Void) -> Void { pthread_mutex_lock(&m) execute(&p, &p2) pthread_mutex_unlock(&m) }
and cause so
// , `mutableState` `result` , mutex.sync_4(&mutableState, &result) { $1 = doSomething($0) }
We are back to full speed, or close enough to it (0.307 seconds for 10 million calls).
Another approach
One advantage of capturing a closure is how easy it seems. The elements inside the grip have the same
names inside and outside the closure, and the connection between them is obvious. When we avoid capturing a closure, and instead try to pass all the values ​​as parameters, we have to either rename all of our variables or give them shadow names — which does not contribute to the simplicity of understanding — and we still risk accidentally capturing the variable again degrading performance.
Let's put everything aside and solve the problem differently.
We can create a free
sync
function in our file that takes a mutex as a parameter:
private func sync<R>(mutex: PThreadMutex, execute: () throws -> R) rethrows -> R { pthread_mutex_lock(&mutex.m) defer { pthread_mutex_unlock(&mutex.m) } return try execute() }
If you put the function in the file from which it will be called, then everything
almost works. We get rid of the cost of dynamic memory allocation, while the speed of execution drops from 3.043 to 0.374 seconds. But we still have not reached a level of 0.263 seconds, as with a direct call to
pthread_mutex_lock
/
pthread_mutex_unlock
. What's wrong again?
It turns out that despite the presence of a private function in the same file, where Swift can fully inline this function, Swift does not eliminate the redundant retention and release of the
PThreadMutex
parameter (the type of which is
class
so that
pthread_mutex_t
does not break during copying).
We can force the compiler to avoid these deductions and exemptions by making the function an extension of
PThreadMutex
, rather than a free function:
extension PThreadMutex { private func sync<R>(execute: () throws -> R) rethrows -> R { pthread_mutex_lock(&m) defer { pthread_mutex_unlock(&m) } return try execute() } }
This causes Swift to handle the
self
parameter as
@guaranteed
, eliminating hold / release costs, and we finally reach a value of 0.264 seconds.
Semaphores, not mutexes?
Why not use
dispatch_semaphore_t
? The advantage of
dispatch_semaphore_wait
and
dispatch_semaphore_signal
is that they do not require a closure - these are separate, unscoped calls.
You can use
dispatch_semaphore_t
to create constructs like a mutex:
public struct DispatchSemaphoreWrapper { let s = DispatchSemaphore(value: 1) init() {} func sync<R>(execute: () throws -> R) rethrows -> R { _ = s.wait(timeout: DispatchTime.distantFuture) defer { s.signal() } return try execute() } }
It turns out that this is about a third
faster than the pthread_mutex_lock
/
pthread_mutex_unlock
mutex (0.168 seconds versus 0.244). But, despite the increase in speed, using a semaphore for a mutex is not the best option for a common mutex.
Semaphores are prone to a series of errors and problems. The most serious of these are forms of
priority inversion . Priority inversion is the same type of problem that caused
OSSpinLock
be used for iOS, but the problem for semaphores is a bit more complicated.
With spinlock, priority inversion means:
- The high priority thread is active, spinning (spinning), and waiting for the lock to be held by the lower priority thread.
- A low priority thread never removes a lock because it is depleted by a higher priority thread.
With a semaphore, priority inversion means:
- A high priority thread is waiting for a semaphore.
- The medium priority stream does not depend on the semaphore.
- It is expected that the low priority stream signals the semaphore that the high priority stream can continue.
A medium priority thread will deplete a low priority (this is normal for thread priority). But since the high-priority stream is waiting for the low-priority stream to signal with a semaphore, the high-priority stream is
also being depleted by the medium-priority one. Ideally, this should not happen.
If the correct mutex was used instead of the semaphore, then the priority of the high-priority thread will be transferred to the stream with a lower priority while the high-priority waits for the mutex held by the low-priority thread - this allows the low-priority thread to complete its work and unlock the high-priority thread. However, semaphores are not held by the stream, so priority transmission cannot occur.
Ultimately, semaphores are a good way to bind notifications of terminations between threads (something that is not easy with mutexes), but the construction of semaphores is complex and carries risks, so their use should be limited to situations where you know all the threads in advance and their priorities - when it is known that the priority of the waiting thread is equal to or
lower than the priority of the signaling flow.
All this may seem a bit confusing - since you probably don’t intentionally create threads with different priorities in your programs. However, Cocoa frameworks add a bit of complexity: they use dispatch queues everywhere, and each queue has a “QoS class”. And this may lead to the fact that the queue will work with a different thread priority. If you do not know the order of each task in the program (including the user interface and other tasks that are queued using Cocoa frameworks), then you may suddenly encounter a multithreaded priority situation. Better to avoid it.
Application
A project containing the implementations of PThreadMutex
and DispatchSemaphore
is available on Github .The
CwlMutex.swift file
is completely self-contained, so you can simply copy it, if that's all you need.
Alternatively, the
ReadMe.md file contains detailed information about cloning the entire repository and adding the framework creating it to your projects.
Conclusion
The
pthread_mutex_t remains the best and safest mutex option in Swift for both Mac and iOS. In the future, Swift is likely to acquire the ability to optimize non-escaping closures in the stack, or inline beyond the boundaries of the modules. Any of these innovations will eliminate the inherent problems with
Dispatch.sync , probably making it a better option. But for now, it's too inefficient.
While semaphores and other “light” locks are valid approaches in some scenarios, they are not general purpose mutexes and imply additional considerations and risks when designing.
Regardless of the choice of machine mutexes, you need to be careful when inline for maximum performance, otherwise an excessive number of seizures with closures can slow down mutexes 10 times. In the current version of Swift, this may mean copying and pasting code into the file in which it is used.
Streaming, inlining, and optimization are all topics in which we can expect significant changes beyond Swift 3. However, current Swift users have to work in Swift 2.3 and Swift 3 - and this article describes the current behavior in these versions when trying to get maximum performance when using a scoped mutex.
Addition: performance metrics
10 million times a simple cycle was driven away: inputting a mutex, increasing the counter and outputting a mutex. The “slow” versions of DispatchSemaphore and PThreadMutex were compiled as part of a dynamic structure, separate from the test code.
Results:
Mutex option | Seconds (Swift 2.3) | Seconds (Swift 3) |
---|
PThreadMutex.sync (closure capture) | 3,043 | 3.124 |
DispatchQueue.sync | 2,330 | 3,530 |
PThreadMutex.sync_3 (return result) | 1,371 | 1,364 |
objc_sync_enter | 0.869 | 0.833 |
sync (PThreadMutex) (function in the same file) | 0.374 | 0.387 |
PThreadMutex.sync_4 (Dual inout parameters) | 0.307 | 0.310 |
PThreadMutex.sync_2 (inout single parameter) | 0.282 | 0.284 |
PThreadMutex.sync (inline non-capture) | 0.264 | 0.265 |
Direct calls to pthread_mutex_lock / unlock | 0.263 | 0.263 |
OSSpinLockLock | 0.092 | 0.108 |
The test code used is part of the associated
CwlUtils project, but the test file containing these performance tests (CwlMutexPerformanceTests.swift) is by default not connected to the test module and must be turned on intentionally.