📜 ⬆️ ⬇️

Swift Speed ​​Secrets

From the moment Swift was announced, speed was a key element of marketing. No wonder - it is mentioned in the very name of the language ( swift, eng. - “fast” ). It was stated that it is faster than dynamic languages ​​like Python and Javascript, potentially faster than Objective C, and in some cases even faster than C! But how exactly did they do it?

Specs


Despite the fact that the language itself provides tremendous opportunities for optimization, the current version of the compiler is not all right with this, and getting at least some success in performance tests cost me a lot of energy. This is mainly due to the fact that the compiler generates a lot of unnecessary retain-release actions. I think that this will be quickly corrected in the next versions, but for now I will have to talk about how Swift can be potentially faster than Objective C in the future.

Faster dispatching methods


As you know, every time we call a method in Objective C, the compiler translates it into a call to the objc_msgSend function, which deals with finding and calling the right method at runtime. It receives a method selector and an object, in the method tables of which a search is made for a direct piece of code that will handle this call. The function works very quickly, but often does a lot more work than it really needs.
')
The transfer of messages is convenient because the compiler does not make any assumptions as to what object will fall into it in runtime. This may be an instance of an expression type, a child class, or a completely different class. The compiler can be deceived, and everything will work as expected.

On the other hand, in 99.999% of cases you will not lie to the compiler. When an object is declared as NSView * , it is either directly NSView or a child class. Dynamic scheduling is necessary, but real message forwarding is practically unnecessary, but the nature of Objective-C forces you to always use the most “expensive” kind of calls.

Here is an example of Swift code:

 class Class { func testmethod1() { print("testmethod1") } @final func testmethod2() { print("testmethod2") } } class Subclass: Class { override func testmethod1() { print("overridden testmethod1") } } func TestFunc(obj: Class) { obj.testmethod1() obj.testmethod2() } 

In equivalent Objective-C code, the compiler would turn both method calls obj_msgSend calls — and that’s the end of it.

In Swift, the compiler can take advantage of the more stringent guarantees provided by the language. We can't lie to the compiler. If the type of the expression is Class , then the object can be either directly of this type or a child.

Instead of calling objc_msgSend , the Swift compiler generates code that calls a method using a virtual call table. In essence, this is just an array of pointers to functions stored inside a class. The code that the compiler generates for the first call will be something like this:

 methodImplementation = object->class.vtable[indexOfMethod1] methodImplementation() 

Despite all the caching and assembler optimizations in objc_msgSend , the usual reference to an array index will always be much faster, and this is a tangible plus.

Calling testMethod2 is still better. Since it is declared with the @final modifier, the compiler can guarantee that this method is not overridden anywhere. Whatever happens next, the method call will always be associated with its implementation in the Class class. Thanks to this, you can not even use the reference to the virtual methods table, but directly call the implementation, in my case located in the method with the internal name __TFC9speedtest5Class11testmethod2fS0_FT_T_ .

Of course, this is not such a huge breakthrough in terms of performance. In addition, Swift will still use objc_msgSend to access Objective-C objects. But it will provide some interest anyway.

Smarter Method Calls


Optimization of method calls can be much more significant than just using a more optimal dispatch scheme. The compiler can produce them by analyzing the control flow. For example, method calls can be embedded or removed for good.

For example, we take and remove the body of the testmethod2 method, leaving it empty:

 @final func testmethod2() {} 

The compiler was able to guess that now this method does nothing. With optimizations enabled, the call to this method is not generated at all. Called testmethod1 - that's all.

Similar approaches work not only with methods marked with the @final attribute. For example, if the code is slightly modified as follows:

 let obj = Class() obj.testmethod1() obj.testmethod2() 

Since the compiler sees where and how the obj variable is initialized, it can be sure that by the time testmethod1 called, an object of the child class cannot get into it, and therefore dynamic dispatching is not needed in the first or second case.

Consider another extreme case:

 for i in 0..1000000 { obj.testmethod2() } 

In Objective-C, this code will send a million messages and will run for a considerable time. Swift knows that the method has no side effects, and with a clear conscience can remove the entire cycle, allowing the code to be executed instantly.

Fewer memory allocations


Having enough information, the compiler can remove unnecessary memory allocations. For example, if the creation and all uses of an object are limited to the local scope, it can be placed on the stack instead of a heap, which is much faster. In rare cases when method calls on an object do not use the object itself, its placement can be avoided! Here, for example, is a rather ridiculous Objective-C code:

 for(int i = 0; i < 1000000; i++) [[[NSObject alloc] init] self]; 

Objective C will honestly create and delete a million objects by sending three million messages. The equivalent code in Swift, if there is a sufficient compiler, can not generate any instructions for this code at all if the self method does not do anything useful and does not refer to the object on which it was called.

More efficient use of registers


Each Objective-C method takes two implicit parameters, self and _cmd , after which all others are passed. On most architectures (including x86-64, ARM, ARM64), the first parameters are passed through registers, and the rest are put on the stack. Registers are much faster, so passing parameters through them can affect performance.

The implicit parameter _cmd almost never used. It is only needed if you are using true dynamic message forwarding, which 99.999% of Objective-C code never does. The register is still involved, but there are not so many of them: on the ARM - four, x86-64 - six, and on the ARM64 - eight.

In Swift, there is no such parameter, which allows transferring more “useful” parameters through registers. For methods that take a lot of arguments, this also means a small performance boost on every call.

Duplicate Pointers


There are many examples of when Swift is faster than Objective C, but what about normal C?

A pointer is considered to be duplicating when, in addition to it, there is another pointer to the same memory area. For example:

 int *ptrA = malloc(100 * sizeof(*ptrA)); int *ptrB = ptrA; 

The situation is not simple: writing to ptrA will affect reading from ptrB , and vice versa. This can negatively affect how the compiler can perform optimizations.

For example, the naive implementation of the memcpy function from the standard library:

 void *mymemcpy(void *dst, const void *src, size_t n) { char *dstBytes = dst; const char *srcBytes = src; for(size_t i = 0; i < n; i++) dstBytes[i] = srcBytes[i]; return dst; } 

Of course, byte-by-byte copying is completely inefficient. Most likely, we would like to copy the data in larger chunks: SIMD instructions allow you to transfer 16 or 32 bytes at once, which would speed up the function. In theory, the compiler would have to guess about the purpose of this cycle and use these instructions - but because of the possibility of duplicating pointers, it does not have the right to do this.

To understand, look at the following code:

 char *src = strdup("hello, world"); char *dst = src + 1; mymemcpy(dst, src, strlen(dst)); 

If you use the standard memcpy function, you would get an error, since it does not allow overlapping data areas to be copied. Our function does not contain such checks, and in this case it will behave in unexpected ways: in the first iteration, the ' h ' character will be copied from position 1 to position 2, to the second from 2 to 3, and so on until the whole line will not be clogged with the same symbol. Not exactly what we were waiting for.
It is for this reason that memcpy does not accept overlapping pointers. For such a case, there is a special memmove function, but it requires additional operations and, accordingly, it works slower.

The compiler does not know anything about this context. He does not know that we intend to pass non-overlapping pointers to the function. If we consider two cases - when the pointers overlap and when there is not - then optimization cannot be performed for one if it changes the result in the other. At the moment, the compiler understands only that we want to get the string " hhhhhhhhhhhh ". We need it . The code we wrote requires it. Any optimization is obliged to leave the behavior in this case exactly the same, even if we absolutely do not care.

Clang tried on this function for the glory. It generates code that checks pointer overlap, and only otherwise uses an optimized algorithm. The computational complexity of this test, which the compiler is forced to perform due to lack of knowledge of the context, is rather small, but nonetheless non-zero.

This problem is very common in C, since any two pointers of the same type can refer to the same memory area. Most of the code is written, assuming that the pointers do not overlap, but the compiler by default should consider this possibility. Because of this, it is difficult to optimize a program, and it runs slower than it could.

The prevalence of this problem has forced the addition of a new restrict keyword to the C99 standard. It tells the compiler that the pointers do not intersect. If you apply this modifier to our parameters, the generated code will be more optimal:

  void *mymemcpy(void * restrict dst, const void * restrict src, size_t n) { char *dstBytes = dst; const char *srcBytes = src; for(size_t i = 0; i < n; i++) dstBytes[i] = srcBytes[i]; return dst; } 

Can we assume that the problem is solved? But ... how often did you use this keyword in your code? I feel that the answer of the majority of readers will be "never." In my case, I used it for the first time in my life while I was writing the example above. It is used for the most critical places for performance, but in other cases we just spit on non-optimality and move on.

Overlapping pointers can float in places where you do not expect this at all. For example:

 - (int)zero { _count++; memset(_ptr, 0, _size); return _count; } 

The compiler is forced to assume the option when _count points to the same place as _ptr . Therefore, it generates code that increments _count , saves its value, calls memset , and then reads _count again to return. We know that _count cannot change during the memset operation, and there is no need to re-read it, but the compiler is obliged to do it - just in case. Compare this example with the following:

 - (int)zero { memset(_ptr, 0, _size); _count++; return _count; } 

If the memset call is pushed up, the need to _count disappears. This is a tiny win, but it is still there.

Even the seemingly harmless NSError ** can affect the situation. Imagine a method whose interface suggests the possibility of an error, but the current implementation never calls it:

  - (int)increment: (NSError **)outError { _count++; *outError = nil; return _count; } 

Again, the compiler is forced to generate an excessive repeated reading of _count in case outError suddenly looks the same way as count . This would be very strange, since C rules usually do not allow different types of pointers to overlap, and it would be safe to discard this read. Apparently, Objective-C somehow breaks these rules with its add-ins. Of course, you can add restrict - but you can hardly remember this at the right moment.

In Swift code, this is much less common: as a rule, you do not have to use pointers to arbitrary objects, and the semantics of arrays do not allow pointers to overlap. This allows the Swift compiler to generate more optimal code with fewer additional save and reads in case you still use overlapping pointers.

Summing up


In Swift, there are several tricks that allow you to generate more optimal code than on Objective C. Less dynamic dispatching, the ability to embed methods, and the refusal to pass unnecessary arguments - all this leads to an increase in the speed of calls. Also, since pointers are very rare in Swift, the compiler can perform more efficient optimizations.

Translator's Note:

The article was written a month and a half ago. Since then, posts have already appeared confirming the excellent work of the optimizer in practice. Knowledge of English is not necessary, just look at the table.

Source: https://habr.com/ru/post/233927/


All Articles