📜 ⬆️ ⬇️

Increase code performance: think first about data



Being engaged in graphics rendering programming, we live in a world in which low-level optimizations are required in order to achieve GPU frames with a length of 30 ms. To do this, we use various techniques and developed from scratch new rendering passes with enhanced performance (geometry attributes, texture cache, export, etc.), GPR compression, latency hiding, ROP ...

In the area of ​​improving CPU performance, various tricks were used at one time, and it is noteworthy that today they are used for modern video cards to speed up ALU computations ( Low Level Optimization for AMD GCN , Fast Backward Square Root in Quake ).
')

Quick square root in quake

But recently, especially in the light of the transition to 64 bits, I have noticed an increase in the number of non-optimized code, as if the industry is rapidly losing all the previously accumulated knowledge. Yes, old tricks like fast reverse square root on modern processors are counterproductive. But programmers should not forget about low-level optimizations and hope that the compilers will solve all their problems. They will not decide.

This article is not an exhaustive hardcore iron guide. This is just an introduction, a reminder, a set of basic principles for writing effective code for the CPU. I want to “ show that low-level thinking is still useful today, ” even if it’s about processors that I could add.

In this article, we will look at caching, vector programming, reading and understanding assembly code, and writing code that is convenient for the compiler.

Why bother?


Do not forget about breaking


In the 1980s, the memory bus frequency was equal to the CPU frequency, and the delay was almost zero. But processor performance grew logarithmically in accordance with Moore's law, and the performance of RAM chips increased disproportionately, so that memory soon became a bottleneck. And the point is not that it is impossible to create a faster memory: it is possible, but economically unprofitable.


Change processor speed and memory

To reduce the impact of memory performance, CPU developers have added a tiny amount of this very expensive memory between the processor and the main memory - this is how the processor cache appeared.



The idea is this: there is a good chance that the same code or data may again be required in a short period of time.


CPU cache is a complex method of increasing productivity, but without the help of a programmer, it will not work correctly. Unfortunately, many developers do not realize the cost of using memory and CPU cache structure.

Data-oriented architecture


We are interested in game engines. They process increasing amounts of data, transform them and display them in real time. Considering this, as well as the need to solve problems with efficiency, the programmer must understand what data he processes and know the equipment with which his code will work. Therefore, he must be aware of the need to implement a data-oriented architecture (DoD).

Or maybe the compiler will do it for me?



Simple addition. On the left - C ++, on the right - the resulting code in assembler

Let's look at the above example for an AMD Jaguar processor (similar to those used in game consoles) (useful links: AMD's Jaguar Microarchitecture: Memory Hierarchy , AMD Athlon 5350 APU and AM1 Platform Review - Performance - System Memory ):


Even in such a simple example, most of the processor time is spent waiting for data, and in more complex programs, the situation does not get better until the programmer pays attention to the basic architecture.

In short, compilers:


The compiler has quite a bit of room for maneuver when it comes to optimizing memory access. The context is known only to the programmer, and only he knows what code he wants to write. Therefore, you need to understand the flow of information flows and first of all proceed from data processing in order to squeeze everything possible from modern CPUs.

Brutal Truth: OOP vs. DoD




Impact of memory access on performance (Mike Acton GDC15)

Object-oriented programming (OOP) today is the dominant paradigm, it is her who is first of all studied by future programmers. It makes you think in terms of real-world objects and their relationships.

In a class, code and data are usually encapsulated, therefore an object contains all its information. Forcing to use arrays of structures (array of structures) and arrays * of pointers to * structures / objects, OOP violates the principle of spatial locality , on which cache memory acceleration is based. Remember the gap between processor and memory performance?



Excessive encapsulation is detrimental when working on modern hardware.

I want to tell you that when developing software, you need to shift the focus from the code itself to understanding data transformations, as well as respond to the current programming culture and state of affairs imposed by OO supporters.

In conclusion, I want to quote three big lies told by Mike Acton (Mike Acton) ( CppCon 2014: Mike Acton, "Data-Oriented Design and C ++" )


Study iron


Microprocessor cache


The processor is not physically connected directly to the main memory. All operations with RAM (loading and storage) on modern processors are performed through the cache.

When the processor is busy with a call (load) command, the memory controller first searches the cache for an entry with a tag corresponding to the memory address at which it needs to read. If such a record is found — that is, a cache hit occurs — then the data can be loaded directly from the cache. If not - cache miss , - the controller will try to extract data from lower cache levels (for example, first L1D, then L2, then L3) and, finally, from RAM. Then the data will be stored in L1, L2 and L3 ( inclusive cache ).


Memory Console Delay - Jason Gregory

In this simplified illustration, the processor (AMD Jaguar, used in PS4 and XB1) has two cache levels - L1 and L2. As you can see, not just data is cached, L1 is divided into code instruction cache (code instruction) (L1I) and data cache (L1D). The memory areas needed for code and data are independent of each other. In general, L1I creates far fewer problems than L1D.

In terms of latency, L1 is orders of magnitude faster than L2, which is 10 times faster than main memory. It looks sad in numbers, but not for every slip of the cache you have to pay the full price. You can reduce costs by hiding latency, dispatching, and so on, but this is beyond the scope of the post.


Memory Delay - Andreas Fredriksson

Each entry in the cache - the cache line - contains several adjacent words (64 bytes for AMD Jaguar or Core i7). When the CPU executes the instruction that retrieves or saves the value, the entire cache line is transferred to L1D. In the case of saving, the cache line to which an entry is made is marked as dirty until it is written back to RAM.


Write from register to memory

To be able to load new data into the cache, it is almost always necessary to first free up space by evicting the (evict) cache line .


Fresh processors from Intel and AMD use inclusive cache . At first, this may seem like a wrong decision, but it has two advantages:


Cash line collisions: although several cores can efficiently read cache lines, write operations can lead to poor performance. False sharing means that different cores can change independent data in the same cache line. According to cache coherence protocols, if the kernel writes to the cache line, then a line in another kernel referring to the same memory is considered invalid ( cache slip , cache trashing). As a result, at each write operation memory locks occur. False separation can be avoided by making different kernels work with different strings (using extra space - extra padding, aligning the structures with 64 bytes, and so on).


We avoid false division by writing data to different cache lines in each thread.

As you can see, understanding the hardware architecture is the key to finding and fixing problems that might otherwise go unnoticed.

Coreinfo is a command-line utility. It provides detailed information about all instruction sets stored in the processor, as well as reports which caches are assigned to each logical processor. Here is an example for a Core i5-3570K:

*--- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 *--- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 *--- Unified Cache 0, Level 2, 256 KB, Assoc 8, LineSize 64 **** Unified Cache 1, Level 3, 6 MB, Assoc 12, LineSize 64 -*-- Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 -*-- Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 -*-- Unified Cache 2, Level 2, 256 KB, Assoc 8, LineSize 64 --*- Data Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64 --*- Instruction Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64 --*- Unified Cache 3, Level 2, 256 KB, Assoc 8, LineSize 64 ---* Data Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64 ---* Instruction Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64 ---* Unified Cache 4, Level 2, 256 KB, Assoc 8, LineSize 64 

Here the 32 KB L1 cache, 32 KB L1 cache, 256 KB L2 cache, and 6 MB L3 cache. In this architecture, L1 and L2 are assigned to each core, and L3 is shared by all cores.

In the case of AMD Jaguar CPU, each core has a dedicated L1 cache, and L2 is shared by groups of 4 cores in clusters (there is no L3 in the Jaguar).


4-core cluster (AMD Jaguar)

When working with such clusters, special care should be taken. When a kernel writes to a cache line, it may become invalid in other kernels, which reduces performance. Moreover, with such an architecture, everything can become even worse: the kernel extracting data from the nearest L2 located in the same cluster takes about 26 cycles , and retrieving another cluster from L2 can take up to 190 cycles . Comparable to the extraction of data from RAM!


L2 latency in clusters in AMD Jaguar - Jason Gregory

For more information on cache consistency, see the Cache Coherency Primer article.

Assembly Basics


x86-64 bits, x64, IA-64, AMD64 ... or the birth of the x64 architecture


Intel and AMD have developed their own 64-bit architectures: AMD64 and IA-64. IA-64 is very different from x86-32 processors in the sense that it does not inherit anything from the x86 architecture. Applications for x86 should work on IA-64 through the emulation layer, therefore, they have poor performance on this architecture. Due to lack of compatibility with x86, IA-64 never took off, except for the commercial sphere. On the other hand, AMD has created a more conservative architecture, expanding its existing x86 with a new set of 64-bit instructions. Intel, who lost the 64-bit war , was forced to introduce the same extensions into its x86 processors . In this section, we look at x86-64 bits, also known as x64 architecture, or AMD64.

For many years, PC programmers have used x86 assembler to write high-performance code: mode'X ' , CPU-Skinning, collisions, software rasterizers (software rasterizers) ... But 32-bit computers were slowly replaced by 64-bit ones, and the assembler code also changed .

Know the assembler is necessary if you want to understand why some things work slowly and others quickly. It will also help you understand how to use intrinsic- functions to optimize critical parts of the code, and how to debug optimized (for example, -O3) code when debugging at the source code level no longer makes sense.

Registers


Registers are small fragments of very fast memory with an almost zero delay (usually one processor cycle). They are used as internal processor memory. They store data directly processed by processor instructions.

The x64 processor has 16 general-purpose registers (GPR). They are not used to store specific data types, and at the time of execution they contain operands and addresses.

In x64, eight x86 registers are expanded to 64 bits, and 8 new 64-bit registers are added. The names of 64-bit registers begin with r. For example, the 64-bit extension eax (32-bit) is called rax . New registers are named from r8 to r15 .


General architecture (software.intel.com)

The x64 registers include:


In newer processors:



Relationships between ZMM, YMM, and XMM Registers

For historical reasons, several GPRs are called differently. For example, ax was an Accumulator register, cx - Counter, dx - Data. Today, most of them have lost their specific purpose, with the exception of rsp (Stack Pointer) and rbp (Base Pointer), which are reserved for managing the hardware stack (hardware stack) (although rbp can often be “optimized” and used as a GRP - omit frame pointer in Clang).

The low bits of the x86 registers can be accessed using subregisters . In the case of the first eight x86 registers, legacy names are used. Newer registers (r8 — r15) use the same, only simplified approach:


Named scalar registers

Addressing


When assembler instructions require two operands, usually the first is the destination, and the second is the source. Each of them contains the data to be processed, or the address of the data. There are three main addressing modes:


dword ptr is called a size directive. It tells the assembler what size to take if there is an uncertainty in the size of the memory area to which it is referenced (for example: mov [rcx] , 5: should write byte? Dword?).
This can mean: byte (8-bit), word (16-bit), dword (32-bit), qword (64-bit), xmmword (128-bit), ymmword (256-bit), zmmword (512- bit).

SIMD instruction sets


Scalar implementation refers to operations with one pair of operands at a time. Vectorization is the process of transforming an algorithm, when instead of working with single pieces of data at a time, it begins to process several portions at a time (below we will see how it does it).

Modern processors can take advantage of a set of SIMD instructions (vector instructions) for parallel data processing.


SIMD processing

SIMD instruction sets, which are available in x86 processors:




Vector registers in x64 processors

Game engines typically spend 90% of their execution time on running small portions of the codebase, mainly iterating and processing data. In such scenarios, SIMD can make a big difference. SSE instructions are usually used for parallel processing of sets of four floating point values ​​packed into 128-bit vector registers.

SSE is mainly focused on the vertical representation (structure of arrays - Structure of Arrays, SoA) of data and their processing. But in general, the performance of SoA compared to Array of Structures (AoS) depends on the memory access patterns.


  // Array Of Structures struct Sphere { float x; float y; float z; double r; }; Sphere* AoS;    (   8 ): ------------------------------------------------------------------ | x | y | z | r | pad | x | y | z | r | pad | x | y | z | r | pad ------------------------------------------------------------------ // Structure Of Arrays struct SoA { float* x; float* y; float* z; double* r; size_t size; };   : ------------------------------------------------------------------ | x | x | x ..| pad | y | y | y ..| pad | z | z | z ..| pad | r.. ------------------------------------------------------------------ 

AVX is a natural extension of SSE. The size of vector registers is increased to 256 bits, which means that up to 8 floating point numbers can be packed and processed in parallel. Intel processors initially support 256-bit registers, and there may be problems with AMD. AMD's early AVX processors, such as Bulldozer and Jaguar, decompose 256-bit operations on 128-bit pairs, which increases latency compared to SSE.

In conclusion, I’ll say that it’s not so easy to rely solely on AVX (maybe for internal tools, if your computers run on Intel), and AMD processors for the most part do not support them natively. On the other hand, on any x64 processors, you can a priori rely on SSE2 (this is part of the specification).

Extraordinary performance


If the pipeline (pipeline) of the processor is operating in an extraordinary execution mode (Out-of-Order, OoO) and the execution of instructions is delayed due to the unavailability of the necessary input data, then the processor tries to find later instructions whose input data are ready to perform first out of turn.

The instruction cycle (or the cycle “receiving - decoding - execution”) is the process during which the processor receives instructions from the memory, determines what needs to be done with it, and executes it. The execution cycle of the command in the extraordinary execution mode looks like this:



AMD Jaguar processor architecture

In the AMD Jaguar processor architecture, we can detect all the above blocks. For integer conveyor:


Examples of micro-operations:

  µops add reg, reg 1: add add reg, [mem] 2: load, add addpd xmm, xmm 1: addpd addpd xmm, [mem] 2: load, addpd 

Looking at the AMD Jaguar section in the wonderful instruction table on the Agner website , we can understand what the execution pipeline for this code looks like:

   mov eax, [mem1] ; 1 - load imul eax, 5 ; 2 - mul add eax, [mem2] ; 3 - load, add mov [mem3], eax ; 4 - store   (Jaguar) I0 | I1 | LAGU | SAGU | FP0 | FP1 | | 1-load | | | 2-mul | | 3-load | | | | 3-add | | | | | | | 4-store | | 

Here, breaking instructions in micro-operations allow the processor to take advantage of parallel execution modules, partially or completely “hiding” the delay in executing the instruction ( 3-load and 2-mul are executed in parallel, in two different modules).

But this is not always possible. The chain of dependencies between 2-mul , 3-add and 4-store does not allow the processor to reorganize these micro-operations ( 4-store needs the result 3-add , and 3-add needs the result 2-mul ). So for efficient use of parallel execution modules, avoid long chains of dependencies.

Visual Studio Options


To illustrate the compiler-generated assembler, I will use msvc ++ 14.0 (VS2015) and Clang. I strongly recommend that you do the same and get used to comparing different compilers. This will help to better understand how all the components of the system interact with each other, and form their own opinion about the quality of the generated code.

A few goodies:


Basic examples of disassembling


Here we look at very simple C ++ code samples and their disassembly. All assembler code is reorganized and fully documented to make it easier for beginners, but I recommend checking if you have any doubts about what the instructions are doing.

For ease of perception, prologues and epilogues of functions have been removed; we will not discuss them here.

Note: local variables are declared on the stack. For example, mov dword ptr [rbp + 4], 0Ah; int b = 10 means that the local variable 'b' is pushed onto the stack (referenced by rbp) at a relative address (offset) 4 and initialized as 0Ah, or 10 in decimal expression.

Floating point arithmetic with simple precision

Floating-point arithmetic can be performed using x87 FPU (80-bit precision, scalar) or SSE (32-bit or 64-bit precision, vectorized). In x64, the SSE2 instruction set is always supported, and by default it is used for floating-point arithmetic.



Simple floating point arithmetic using SSE. msvc ++

Initialization


Calculates x * x


Calculates y * y and adds to x * x


Calculates z * z and adds to x * x + y * y


Saves the final result


In this example, XMM registers are used to store a single floating point value. SSE allows you to work with both single and multiple values, with different data types. Look at the SSE addition instruction:


Branching

An example of branching. msvc ++

Initialization


Condition


'then' result = a


'else' result = b


The cmp instruction compares the operand of the first source with the second, sets the status flags in the RFLAGS register according to the result. Register ®FLAGS is a register of the status of x86-processors, containing the current state of the processor. The cmp instruction is usually used in conjunction with a conditional branch (for example, jge ). The condition codes used by the transitions depend on the result of the cmp instruction ( RFLAGS condition codes).

Arithmetic operations with integer and 'for' loop

In assembly language, cycles are represented mainly as a series of conditional jumps (= if ... goto).



Arithmetic operations with integer and 'for' loop. msvc ++

Initialization


The part of the code responsible for incrementing i


The part of the code responsible for testing the exit condition (i> = k)


“Real Work”: sum + = i



Built-in functions (intrinsics) SSE

, SSE ( — ). , :




SSE, msvc++

(xmmword 128 dword)


dot(v[i], A) = xi * Ax + yi * Ay + zi * Az + wi * Aw , (vertices) :


( + )


AVX (256-, 8 ):

 _m256 Ax = _mm256_broadcast_ss(A); ... for (int i = 0; i < vertexCount; i+=8) // 8     (256-) { __m256 x4 = _mm256_load_ps(xs + i); .. __m256 dx = _mm256_mul_ps(Ax, x4); .. __m256 a0 = _mm256_add_ps(dx, dy); .. _mm256_store_ps(results + i, dots); } 


(switch)



. msvc++




Conditions


Case 0


Case 1


...

. ++- if-else, . .

useful links



Source: https://habr.com/ru/post/319194/


All Articles