📜 ⬆️ ⬇️

ARM64 and you

Somewhat overdue translation of a blog post that interested me about what actually gives a 64-bit processor on the iPhone without marketing husks. If the text seems too obvious to you, skip the “Basic Advantages and Disadvantages” section.

As soon as the iPhone 5S was announced, the technical media was filled with inaccurate articles. Unfortunately, writing good articles takes time, and the world of technical journalism values ​​speed more than reliability. Today, at the request of several of my readers, I will summarize what gives 64-bit ARM to the iPhone 5S in terms of performance, features and design.

64 bits


Let's first consider what 64 bits actually mean. There is a lot of confusion associated with this term, mainly due to the fact that there is no single well-established definition. However, there is a general understanding of this term. Bitness usually means either the size of a numeric register, or the size of a pointer. Fortunately, for most modern processors, their size is the same. Thus, 64-bit means that the processor has 64-bit numeric registers and 64-bit pointers.
')
It is also important to note that 64-bit does not mean, because and there are a lot of misunderstandings here. So, 64-bit does not define:

  1. The size of the addressable memory. The number of bits actually used in the index is not related to the processor’s bit depth. ARM processors use from 26 to 40 bits, and this number can vary in isolation from the processor’s bit depth.
  2. The width of the data bus. The amount of data requested from the RAM or cache is also not related to bit depth. Separate processor instructions may request arbitrary amounts of data, but the amount of actually requested data at a time may differ, either by splitting the requests into parts, or by requesting more than necessary. Already in the iPhone 5, the size of the requested data block is 64 bits, while the PC reaches 192 bits.
  3. Everything related to floating-point calculations. FPU registers are not related to architecture and ARM processors used 64-bit registers long before ARM64.


Basic advantages and disadvantages


If you compare identical processors 32 and 64 bit CPUs, you will not find big differences, so the significance of Apple’s transition to 64-bit ARMs is somewhat exaggerated. This is an important step, but an important one, mainly because of the features of ARM and the peculiarity of processor usage by Apple. However, there are some differences. The most obvious is 64-bit numeric registers work more effectively with 64-bit numbers. You can work with 64-bit numbers on a 32-bit processor, but this usually leads to working with two 32-bit parts, which works noticeably slower. 64-bit processors usually perform operations on 64-bit numbers as quickly as on 32-bit ones, so code that actively uses calculations with 64-bit numbers will run much faster.

Despite the fact that 64-bit is not directly related to the amount of addressable memory, it greatly facilitates the use of a large amount of RAM in one program. A program running on a 32-bit processor can address no more than 4GB of address space. Part of the memory is allocated for the operating system and standard libraries, which leaves 1-3GB for the program itself. If a 32-bit system has more than 4GB of RAM, then the use of this entire address space for the program becomes much more complicated. You will have to do frauds like sequential mapping of different parts of RAM into a part of the virtual address space or splitting one program into several processes.

Such tricks are extremely hard-working and can slow down the system greatly, so very few programmers actually use them. In practice, on 32-bit processors, each program uses up to 1-3GB RAM, and the whole value of having more physical RAM is the ability to run programs more at the same time and the ability to cache more data from the disk.

Increasing the amount of address space is also useful for systems with a small amount of RAM - memory-mapped files, the size of which may be more available RAM, because the operating system actually loads only those parts of the file that were accessed and, moreover, is able to “wipe out” the downloaded data back into the file, freeing up RAM. On 32-bit systems, files larger than 1-3GB cannot be displayed. On 64-bit systems, the address space is much larger, so there is no such problem.

Increasing the pointer size can be a tangible disadvantage: the program will use more memory (perhaps more) while running on a 64-bit processor. Increasing the memory used also “clogs” the cache, which reduces performance.

In a nutshell: 64-bit can increase the performance of some parts of the code and simplifies some techniques, like memory-mapped files. However, performance may suffer due to increased memory usage.

ARM64


The 64-bit processor in the iPhone 5S is not just an ARM with an increased register size, there are significant changes.

Firstly, I’ll note the name: the official name from ARM is “AArch64”, but this is a stupid name, which annoys me to type. Apple calls the ARM64 architecture and I will call as well.

ARM64 doubled the number of integer registers. The 32-bit ARM provides 16 integer registers, one of which is the program counter, two more are used for the pointer to the stack and the link register and 13 general-purpose registers. The ARM64 has 32 integer registers, with a zero register, a link register, and a frame pointer register. Another register is reserved by the platform, which leaves 28 general purpose registers.

ARM64 also increases the number of registers for floating-point numbers. The registers in 32-bit ARM are a bit strange, so it's hard to compare. A 32-bit ARM has 32 32-bit floating point registers, which can be represented as 16 overlapping 64-bit registers. In addition, there are 16 more independent 64-bit registers. ARM64 simplifies this up to 32 non-overlapping 128-bit registers that can be used for smaller data.

The number of registers can significantly affect performance. Memory is much slower than the processor, and reading / writing memory takes much more time than the execution of processor instructions. The processor tries to fix this with caches, but even the fastest cache is much slower than the processor registers. More registers - more data can be stored inside the processor. How much this affects performance depends on the specific code and the efficiency of the compiler, which optimizes the use of registers. When Intel architecture moved from 32 to 64 bits, the number of registers increased from 8 to 16, and this was a significant performance change. ARM already had more registers than the 32-bit Intel architecture, so the increase in registers would have a smaller effect on performance, but this change will still be noticeable.

ARM64 also introduced significant changes in addition to the increase in the number of registers.

Most 32-bit ARM instructions may be executed / not executed depending on the state of the register-condition. This allows you to broadcast conditional expressions (if-statements) without using branching. It was assumed that this would increase productivity, however, judging by the fact that ARM64 refused this opportunity, it gave rise to more problems than it did.

In the ARM64 SIMD (one-instruction-multi-data) suite, NEON fully supports the IEEE754 standard for double- precision floating-point numbers, while the 32-bit version of NEON only supports single precision and does not exactly follow the standard for some bits.

ARM64 added specialized instructions for AES encryption and SHA-1 & SHA-256 hashes. Not very useful in general, but a significant bonus if you deal with exactly these issues.

In general, the most important difference is the increase in the number of general-purpose registers and full support for IEEE754-compatible arithmetic on double-precision numbers in NEON. This can give a significant increase in performance in a large number of places.

Compatibility with 32-bit applications


It is important to note that A7 includes 32-bit compatibility mode, which allows you to run 32-bit applications without any changes. This means that the iPhone 5S can run any old applications without any impact on performance.

Changes in the execution period system


Apple takes advantage of the new architecture in its libraries. Since they do not need to worry about binary backward compatibility with such changes, this is a great time to make changes that would otherwise “break” existing applications.

In Max OS X 10.7, Apple introduced tagged pointers. Labeled pointers allow you to store some classes with a small amount of data in an instance directly in the pointer. This avoids memory allocations in some cases, for example NSNumber, and can provide significant performance gains. Labeled pointers are supported only on a 64-bit platform, partly due to the performance issue, and partly due to the fact that there is not much space under the “tags” in the 32-bit pointer. Apparently because of this, iOS did not have support for labeled pointers. Thus, support for labeled pointers is included in the ARM64 runtime Objective-C, which gives the same advantages as in Mac.

Although the pointer size is 64 bits, not all of these bits are actually used. On Mac OS X on x86-64, only 47 bits are used. In iOS, the ARM64 uses even less - only 33 bits. If you mask these bits each time before use, you can use the remaining bits to store additional data. This made it possible to make one of the most significant changes to the Objective-C runtime in its entire history.

Rethinking the isa pointer


Most of the information in this section is drawn from an article by Greg Parker . First, to refresh the memory: objects in Objective-C represent allocated blocks of memory. The first part, the size of a pointer, is isa. Normally, isa is a pointer to an object class. To learn more about how objects are stored in memory, read my other article .

To use the entire pointer size to the isa pointer is somewhat wasteful, especially on a 64-bit platform that does not use all 64-bit. ARM64 on iOS actually uses 33 bits, leaving 31 bits for other things. Classes in memory are aligned at the 8 byte boundary, so the last 3 bits can be discarded, giving 34 bits of isa available to store additional information. And Apple's runtime in ARM64 uses this to improve performance.

Probably the most important optimization was the inline link counter. Virtually all objects in Objective-C have a reference count (except for immutable objects, such as the NSString literals) and retain / release operations that change this counter occur very often. This is especially critical for ARC, which inserts retain / release calls more often than a programmer would. Thus, high performance retain / release methods are extremely important.

Normally, the reference count is not stored in the object itself. It would be possible to store a reference counter in the object itself, but this would take up too much space. It is not so important now, but then, long ago, it was very significant. Because of this, the reference count is stored in an external hash table. Each time you call retain, the following actions are taken:
  1. Get the global hash table of pointer counters
  2. Block the hash table so that the operation is thread-safe
  3. Find a counter in the hash table for a given object
  4. Increment the counter by one and save it back to the table.
  5. Release the hash table lock

Slow enough! The implementation of the hash table of counters has been made very efficient for a hash table, but it is still much slower than DMA. In ARM64, the 19-bit isa pointer is used to store the reference count. This means that the retain call is simplified to:
  1. Perform an atomic increase of a portion of the isa-field

And that's it! It should be much, much faster.

In fact, everything is somewhat more complicated due to different regional cases, which also need to be processed. The true sequence of actions is approximately as follows:
  1. The last bit in isa tells if the extra bits are used in isa to store the reference count. If not, the old algorithm with hash tables is used.
  2. If the object is already deleted, nothing is done.
  3. If the counter overflows (which rarely happens, but it is quite possible with 19 bits), the old algorithm with the hash table is used.
  4. Make atomic isa change to new value

Most of these actions were still required in the previous approach, and they do not slow down the operation too much. So the new approach should still be much faster than the old one.

Using the remaining links unused under the counter in isa allowed to speed up the allocation of objects. Theoretically, you need to perform a bunch of actions when an object in Objective-C is deleted and the ability to skip unnecessary ones can significantly increase performance. These steps are:
  1. If an object did not have associated objects set using objc_setAssociatedObject, they should not be deleted.
  2. If the object does not have a C ++ destructor (which is called during dealloc), it also does not need to be called.
  3. If the object has never been referenced by a weak (__weak) pointer, then these pointers should not be reset.

Previously, all of these flags were stored in classes. For example, if at least one instance of the class ever stored the associated object, then all instances of this class deleted the associated object during the deployment. Tracking these flags for each instance separately allowed calling the appropriate methods only for those instances for which it was really needed.

In total, this is a significant gain. My benchmarks showed that creating and deleting a simple object takes 380ns per 5S in 32-bit mode, while in 64-bit it takes only 200ns. If at least one copy ever had a weak reference to itself, then in 32-bit mode the deletion time for all increased to 480 ns, while in 64-bit mode, the time remained in the region of 200 ns for all instances on which weak links are not It was.

In short, improvements in runtime are such that in 64-bit mode, the allocation time takes 40-50% of the allocation time in 32-bit mode. If your application creates and deletes many objects, this can be significant.

Conclusion


64-bit A7 is not just a marketing ploy, but it is also an unbelievable breakthrough that will allow you to create a new class of applications. The truth, as always, lies in the middle.

The mere fact of switching to 64 bits gives a bit. In some cases, this speeds up the applications, and most of the programs increase the amount of memory used. In general, there is no big difference.

ARM architecture has changed not only in 64 bits. The increased number of registers and the revised, upgraded instruction set gives a good performance boost compared to 32-bit ARM.

Apple used the transition to a new architecture to improve runtime. The main change is the embedded link counter, which allows you to avoid expensive hash table searches. So retain / release operations are very frequent in Objective-C, this is a significant gain. Removing resources, depending on the flags, makes the removal of objects almost twice as fast. Tagged pointers also add performance and reduce memory consumption.

ARM64 is a nice addition from Apple. We all knew that it would happen sooner or later, but few expected it so soon. But it is, and it is great.

Source: https://habr.com/ru/post/197854/


All Articles