📜 ⬆️ ⬇️

Optimization of games for iOS platform. Code vectorization

It has long been brewing a desire to write a couple of articles in which I can lay out my experience and knowledge on how to optimize games for ARMv7 CPU architecture and PowerVR SGX 5 series of GPUs, read the iOS platform. But all, or almost all, tips are equally applicable to other systems with the same hardware, read Androids. This material can be applied not only in games but also in most demanding applications - image processing, audio, video, etc. I will begin my first article with the most important, IMHO, optimization - vectorization of the code for NEON .

This article began as a report at a conference to be held on November 24th. A bunch of optimization tips for iPhone can be found here . The following articles will develop into the breadth and depth of the topics of material from this presentation.

What is NEON? NEON is a general purpose SIMD engine used in ARM processors. On board has 16 registers of 128 bits each, which can be considered as 32 registers of 64 bits. NEON shares its registers with VFP, although it has its own pipeline. As in the case of SSE data must be aligned to 16 bytes. NEON also knows how to work with non-aligned data, but usually it is 2 times slower.

NEON can work with:

It is great for multimedia tasks, including games.
')
Let's start with the core - the heart of every modern mobile system, a system on a chip, or a SoC (System on Chip). It is known that iOS devices use Apple A series of systems on a chip - A4, A5, A5x, A6 and A6x. The most important specifications of these chips are given in the table:
CPU specificationsA4A5A5xA6
ArchitectureARMv7ARMv7ARMv7ARMv7
CoreCortex A8Cortex A9Cortex A9Self-developed
# coresone222
Frequency, MHz800100010001300
ExtensionsVFPv3 (VFPLite), NEONVFPv3, NEONVFPv3, NEONVFPv4, NEON
GPU specifications
ModelPowerVR SGX 535PowerVR SGX 543MP2PowerVR SGX 543MP4PowerVR SGX 543MP3
Frequency, MHz200200200266
* Note: NEON runs on CPU frequency

It is easy to see that NEON has a 5-fold increase in frequency compared to the GPU! Of course, this does not mean that we will get a 5-fold increase in performance compared with the GPU - IPC, pipeline, etc. are significant. But NEON has one killer feature - it can process 4 32-bit floats in parallel, while PowerVR SGX has only one. It seems that the PowerVR SGX 5-series SIMD registers have a length of 64 bits, since the GPU can process 4 floats of half precision (16 bits) in parallel. Consider an example:

highp vec4 v1, v2; highp float s1, s2; //  v2 = (v1 * s1) * s2; //v1 * s1      – 4 ,       s2,     -  4 . //8    //  v2 = v1 * (s1 * s2); //s1 * s2 – 1    ;  * v1 – 4   . //5    

Now let's look at another example, executed on the vector GPU engine:
 mediump vec4 v1, v2, v3; highp vec4 s1, s2, s3; v3 = v1 * v2; //    – 1  s3 = s1 * s2; //    – 4  

You will need a highp specifier for your data, for example, the vertex position. The profit from NEON is obvious here.

We now turn to another advantage of NEON. It is known that PowerVR SGX 5th series have USSE , a shader processor, which no matter what type of shaders to process - vertex or pixel. This means that the programmer has a certain power budget and he has to decide what to spend on it - vertex or pixel processing. This is where NEON comes to the rescue - this is your new vertex processor. You might think that I forgot to insert a troll-face here, but everything is quite serious. The performance of almost every mobile system is limited by fill rate, especially in 2D games, especially in our time of the race for screen resolution. Having transferred all vertex processing to NEON, you release resources for pixel processing. In addition, NEON will help reduce the number of calls to draw - count the positions of all the vertices of one batch in world coordinates and draw N objects in one call.

The theory is over! Now for the hardcore! There are several ways to take advantage of NEON'a:

It is time to discover all the advantages and disadvantages of each method. For this, I wrote a simple demo - every frame of 10,000 sprites will change its position to a random one within the screen. The goal is to get the fastest code with a minimum load on the CPU - because in games you need to read a lot of things, in addition to the data for rendering.

All data is stored in one VBO. The Update method multiplies the projection matrix on the ModelView matrix of a random position. Then each vertex of each sprite will be multiplied by the resulting ModelViewProjection matrix. The final position of each vertex will simply be transferred to gl_Position in the vertex shader. All data is aligned to the 16 byte boundary.

Update method code:
 void Update() { GLKMatrix4 modelviewMat = { 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, }; const u32 QUADS_COUNT = 10000; const u32 VERTS_PER_QUAD = 4; const float Y_DELTA = 420.0f / QUADS_COUNT; //     Y float vertDelta = Y_DELTA; for (int i = 0; i < QUADS_COUNT * VERTS_PER_QUAD; i += VERTS_PER_QUAD) { float randX = random() % 260; //     modelviewMat.m[12] = randX; modelviewMat.m[13] = vertDelta; float32x4x4_t mvp; Matrix4ByMatrix4((float32x4x4_t*)proj.m, (float32x4x4_t*)modelviewMat.m, &mvp); for (int j = 0; j < 4; ++j) { Matrix4ByVec4(&mvp, &squareVertices[j], &data[i + j].pos); } vertDelta += Y_DELTA; } glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer); glBufferData(GL_ARRAY_BUFFER, sizeof(data), data, GL_STREAM_DRAW); } 

Well, now we come to the essence of this article - vectorization of the code. Next, the code used in the three compared approaches for the most frequently used operations in game devs will be presented - matrix multiplication by vector and matrix multiplication by matrix.

Copypasta with GLKMath:
 static __inline__ GLKVector4 GLKMatrix4MultiplyVector4(GLKMatrix4 matrixLeft, GLKVector4 vectorRight) { float32x4x4_t iMatrix = *(float32x4x4_t *)&matrixLeft; float32x4_t v; iMatrix.val[0] = vmulq_n_f32(iMatrix.val[0], (float32_t)vectorRight.v[0]); iMatrix.val[1] = vmulq_n_f32(iMatrix.val[1], (float32_t)vectorRight.v[1]); iMatrix.val[2] = vmulq_n_f32(iMatrix.val[2], (float32_t)vectorRight.v[2]); iMatrix.val[3] = vmulq_n_f32(iMatrix.val[3], (float32_t)vectorRight.v[3]); iMatrix.val[0] = vaddq_f32(iMatrix.val[0], iMatrix.val[1]); iMatrix.val[2] = vaddq_f32(iMatrix.val[2], iMatrix.val[3]); v = vaddq_f32(iMatrix.val[0], iMatrix.val[2]); return *(GLKVector4 *)&v; } static __inline__ GLKMatrix4 GLKMatrix4Multiply(GLKMatrix4 matrixLeft, GLKMatrix4 matrixRight) { float32x4x4_t iMatrixLeft = *(float32x4x4_t *)&matrixLeft; float32x4x4_t iMatrixRight = *(float32x4x4_t *)&matrixRight; float32x4x4_t m; m.val[0] = vmulq_n_f32(iMatrixLeft.val[0], vgetq_lane_f32(iMatrixRight.val[0], 0)); m.val[1] = vmulq_n_f32(iMatrixLeft.val[0], vgetq_lane_f32(iMatrixRight.val[1], 0)); m.val[2] = vmulq_n_f32(iMatrixLeft.val[0], vgetq_lane_f32(iMatrixRight.val[2], 0)); m.val[3] = vmulq_n_f32(iMatrixLeft.val[0], vgetq_lane_f32(iMatrixRight.val[3], 0)); m.val[0] = vmlaq_n_f32(m.val[0], iMatrixLeft.val[1], vgetq_lane_f32(iMatrixRight.val[0], 1)); m.val[1] = vmlaq_n_f32(m.val[1], iMatrixLeft.val[1], vgetq_lane_f32(iMatrixRight.val[1], 1)); m.val[2] = vmlaq_n_f32(m.val[2], iMatrixLeft.val[1], vgetq_lane_f32(iMatrixRight.val[2], 1)); m.val[3] = vmlaq_n_f32(m.val[3], iMatrixLeft.val[1], vgetq_lane_f32(iMatrixRight.val[3], 1)); m.val[0] = vmlaq_n_f32(m.val[0], iMatrixLeft.val[2], vgetq_lane_f32(iMatrixRight.val[0], 2)); m.val[1] = vmlaq_n_f32(m.val[1], iMatrixLeft.val[2], vgetq_lane_f32(iMatrixRight.val[1], 2)); m.val[2] = vmlaq_n_f32(m.val[2], iMatrixLeft.val[2], vgetq_lane_f32(iMatrixRight.val[2], 2)); m.val[3] = vmlaq_n_f32(m.val[3], iMatrixLeft.val[2], vgetq_lane_f32(iMatrixRight.val[3], 2)); m.val[0] = vmlaq_n_f32(m.val[0], iMatrixLeft.val[3], vgetq_lane_f32(iMatrixRight.val[0], 3)); m.val[1] = vmlaq_n_f32(m.val[1], iMatrixLeft.val[3], vgetq_lane_f32(iMatrixRight.val[1], 3)); m.val[2] = vmlaq_n_f32(m.val[2], iMatrixLeft.val[3], vgetq_lane_f32(iMatrixRight.val[2], 3)); m.val[3] = vmlaq_n_f32(m.val[3], iMatrixLeft.val[3], vgetq_lane_f32(iMatrixRight.val[3], 3)); return *(GLKMatrix4 *)&m; } 
It is easy to see that the implementation of these operations from Apple uses a far from optimal approach - transferring variables by value, copying variables. It looks pretty slow, at least in the debug build it will be such. Let's see how this code will show itself when profiling.

Assembly approach:
 inline void Matrix4ByVec4(float32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__ vec, float32x4_t* __restrict__ result) { asm ( "vldmia %0, { d24-d31 } \n\t" "vld1.32 {q1}, [%1]\n\t" "vmul.f32 q0, q12, d2[0]\n\t" "vmla.f32 q0, q13, d2[1]\n\t" "vmla.f32 q0, q14, d3[0]\n\t" "vmla.f32 q0, q15, d3[1]\n\t" "vstmia %2, { q0 }" : : "r" (mat), "r" (vec), "r" (result) : "memory", "q0", "q1", "q8", "q9", "q10", "q11" ); } inline void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2, float32x4x4_t* __restrict__ r) { asm ( "vldmia %1, { q0-q3 } \n\t" "vldmia %2, { q8-q11 }\n\t" "vmul.f32 q12, q8, d0[0]\n\t" "vmul.f32 q13, q8, d2[0]\n\t" "vmul.f32 q14, q8, d4[0]\n\t" "vmul.f32 q15, q8, d6[0]\n\t" "vmla.f32 q12, q9, d0[1]\n\t" "vmla.f32 q13, q9, d2[1]\n\t" "vmla.f32 q14, q9, d4[1]\n\t" "vmla.f32 q15, q9, d6[1]\n\t" "vmla.f32 q12, q10, d1[0]\n\t" "vmla.f32 q13, q10, d3[0]\n\t" "vmla.f32 q14, q10, d5[0]\n\t" "vmla.f32 q15, q10, d7[0]\n\t" "vmla.f32 q12, q11, d1[1]\n\t" "vmla.f32 q13, q11, d3[1]\n\t" "vmla.f32 q14, q11, d5[1]\n\t" "vmla.f32 q15, q11, d7[1]\n\t" "vstmia %0, { q12-q15 }" : : "r" (result), "r" (m2), "r" (m1) : "memory", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15" ); } 
For a person not familiar with the assembler, everything seems pretty scary - I myself am so, I can understand only the NEON assembler. But in fact, everything is simple here - q1-q15 is, actually, NEON registers. vldmia \ vld1.32 - download instructions; vstmia - save to memory; vmul.f32 \ vmla.f32 - multiply \ multiply and add.

Intrinsic method:
 inline void Matrix4ByVec4(float32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__ vec, float32x4_t* __restrict__ result) { (*result) = vmulq_n_f32((*mat).val[0], (*vec)[0]); (*result) = vmlaq_n_f32((*result), (*mat).val[1], (*vec)[1]); (*result) = vmlaq_n_f32((*result), (*mat).val[2], (*vec)[2]); (*result) = vmlaq_n_f32((*result), (*mat).val[3], (*vec)[3]); } inline void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2, float32x4x4_t* __restrict__ r) { (*r).val[0] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[0], 0)); (*r).val[1] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[1], 0)); (*r).val[2] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[2], 0)); (*r).val[3] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[3], 0)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[1], vgetq_lane_f32((*m2).val[0], 1)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[1], vgetq_lane_f32((*m2).val[1], 1)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[1], vgetq_lane_f32((*m2).val[2], 1)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[1], vgetq_lane_f32((*m2).val[3], 1)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[2], vgetq_lane_f32((*m2).val[0], 2)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[2], vgetq_lane_f32((*m2).val[1], 2)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[2], vgetq_lane_f32((*m2).val[2], 2)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[2], vgetq_lane_f32((*m2).val[3], 2)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[3], vgetq_lane_f32((*m2).val[0], 3)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[3], vgetq_lane_f32((*m2).val[1], 3)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[3], vgetq_lane_f32((*m2).val[2], 3)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[3], vgetq_lane_f32((*m2).val[3], 3)); } 
Almost the same code as in GLKMath, but there are some minor differences. Explanations: vmulq_n_f32 - multiplication of a vector by a scalar; vgetq_lane_f32 - a macro that selects a scalar from a vector; vmlaq_n_f32 - multiply by a scalar and add. This code is simply a reflection of the assembler on intrinsiki. Let's see how he will show himself in comparison with him.

I did a test on the iPod Touch 4. The table contains the results of the profiling Update function:
An approachExecution time, msCPU load,%
FPU6058 + 5067 *35-38
GLKMath278920-23
Assembler530423-25
Intrinsiki280318-20
* In the screenshot of Instruments, you can see that the Matrix4ByMatrix4 function has not been zailed.

Here's another tip - aggressively inline your performance-critical code. Prefer __attribute __ ((always_inline)) before the usual inline in such cases.

Updated results table:
An approachExecution time, msCPU load,%
FPU forceinlined620925-28
GLKMath278920-23
Assembler530423-25
Intrinsiki280318-20
Forced inline gave a very good performance boost! Let's see how code auto-vectorization will show itself. All we need is to add –mllvm –vectorize –mllvm –bb-vectorize-aligned-only to Other C Flags in the project settings.

Final results table:
An approachExecution time, msRuntime (vector), msCPU load,%CPU load (vector),%
FPU forceinlined6209502825-2822-24
GLKMath2789277620-2320-23
Assembler5304529123-2522-24
Intrinsiki2803278918-2018-20

Quite strange results can be observed in the case of the assembler and intrinsoks - in fact, the code is the same, but the result differs dramatically - almost 2 times! The answer to this question lies in the assembler listing (willing to look in for yourself). In the case of the assembler, we see in the listing exactly what we wrote. In the case of intrinsics, the compiler optimized the code. Slow, at first glance, the code GLKMath compiler perfectly optimized that gave the same time of execution of the code as that of manually written intrinsikov.


It is time to sum up. You can draw several conclusions:


Links
www.arm.com/products/processors/technologies/neon.php
blogs.arm.com/software-enablement/161-coding-for-neon-part-1-load-and-stores
code.google.com/p/math-neon
llvm.org/devmtg/2012-04-12/Slides/Hal_Finkel.pdf
Demo project

Source: https://habr.com/ru/post/156809/


All Articles