It has long been brewing a desire to write a couple of articles in which I can lay out my experience and knowledge on how to optimize games for ARMv7 CPU architecture and PowerVR SGX 5 series of GPUs, read the iOS platform. But all, or almost all, tips are equally applicable to other systems with the same hardware, read Androids. This material can be applied not only in games but also in most demanding applications - image processing, audio, video, etc. I will begin my first article with the most important, IMHO, optimization - vectorization of the code for
NEON .
This article began as a report at a conference to be held on November 24th. A bunch of optimization tips for iPhone can be found
here . The following articles will develop into the breadth and depth of the topics of material from this presentation.
What is NEON? NEON is a general purpose
SIMD engine used in ARM processors. On board has 16 registers of 128 bits each, which can be considered as 32 registers of 64 bits. NEON shares its registers with VFP, although it has its own pipeline. As in the case of SSE data must be aligned to 16 bytes. NEON also knows how to work with non-aligned data, but usually it is 2 times slower.
NEON can work with:
- Signed \ without signed 8 \ 16 \ 32 \ 64-bit integer data types;
- Single precision floating point numbers - 32-bit float.
It is great for multimedia tasks, including games.
')
Let's start with the core - the heart of every modern mobile system, a system on a chip, or a SoC (System on Chip). It is known that iOS devices use Apple A series of systems on a chip - A4, A5, A5x, A6 and A6x. The most important specifications of these chips are given in the table:
CPU specifications | A4 | A5 | A5x | A6 |
---|
Architecture | ARMv7 | ARMv7 | ARMv7 | ARMv7 |
Core | Cortex A8 | Cortex A9 | Cortex A9 | Self-developed |
# cores | one | 2 | 2 | 2 |
Frequency, MHz | 800 | 1000 | 1000 | 1300 |
Extensions | VFPv3 (VFPLite), NEON | VFPv3, NEON | VFPv3, NEON | VFPv4, NEON |
GPU specifications | | | | |
Model | PowerVR SGX 535 | PowerVR SGX 543MP2 | PowerVR SGX 543MP4 | PowerVR SGX 543MP3 |
Frequency, MHz | 200 | 200 | 200 | 266 |
* Note: NEON runs on CPU frequencyIt is easy to see that NEON has a 5-fold increase in frequency compared to the GPU! Of course, this does not mean that we will get a 5-fold increase in performance compared with the GPU - IPC, pipeline, etc. are significant. But NEON has one killer feature - it can process 4 32-bit floats in parallel, while PowerVR SGX has only one. It seems that the PowerVR SGX 5-series SIMD registers have a length of 64 bits, since the GPU can process 4 floats of half precision (16 bits) in parallel. Consider an example:
highp vec4 v1, v2; highp float s1, s2;
Now let's look at another example, executed on the vector GPU engine:
mediump vec4 v1, v2, v3; highp vec4 s1, s2, s3; v3 = v1 * v2;
You will need a
highp specifier for your data, for example, the vertex position. The profit from NEON is obvious here.
We now turn to another advantage of NEON. It is known that PowerVR SGX 5th series have
USSE , a shader processor, which no matter what type of shaders to process - vertex or pixel. This means that the programmer has a certain power budget and he has to decide what to spend on it - vertex or pixel processing. This is where NEON comes to the rescue - this is your new vertex processor. You might think that I forgot to insert a troll-face here, but everything is quite serious. The performance of almost every mobile system is limited by fill rate, especially in 2D games, especially in our time of the race for screen resolution. Having transferred all vertex processing to NEON, you release resources for pixel processing. In addition, NEON will help reduce the number of calls to draw - count the positions of all the vertices of one batch in world coordinates and draw N objects in one call.
The theory is over! Now for the hardcore! There are several ways to take advantage of NEON'a:
- Let the compiler vectorize the code for you. Bad way. The compiler may vectorize ... or it may not vectorize. Even if the compiler vectorizes the code, it’s far from a fact that this will be the optimal code. But, on the other hand, this method does not require any effort on your part, and you can get a profit. But still, you should not blindly hope for the compiler, but manually vectorize at least the most critical code.
- NEON assembler. And here he is, hardcore. The path of a true Jedi and all that. We'll have to learn dark magic, spend the night behind manuals from ARM, etc. It is also worth keeping in mind that the NEON code works in both ARM and Thumb-2 modes.
- NEON intrinsics (same as SSE for x86). In contrast to the assembler, where the compiler stupidly inserts what he was given, intrinsiki will be optimized. It is much easier to live with them - there is no need to learn the timings of instructions, shuffle them to avoid stagnation of the pipeline, etc.
- Use with already vectorized code - GLKMath, math neon.
It is time to discover all the advantages and disadvantages of each method. For this, I wrote a simple demo - every frame of 10,000 sprites will change its position to a random one within the screen. The goal is to get the fastest code with a minimum load on the CPU - because in games you need to read a lot of things, in addition to the data for rendering.
All data is stored in one VBO. The Update method multiplies the projection matrix on the ModelView matrix of a random position. Then each vertex of each sprite will be multiplied by the resulting ModelViewProjection matrix. The final position of each vertex will simply be transferred to gl_Position in the vertex shader. All data is aligned to the 16 byte boundary.
Update method code:
void Update() { GLKMatrix4 modelviewMat = { 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, }; const u32 QUADS_COUNT = 10000; const u32 VERTS_PER_QUAD = 4; const float Y_DELTA = 420.0f / QUADS_COUNT;
Well, now we come to the essence of this article - vectorization of the code. Next, the code used in the three compared approaches for the most frequently used operations in game devs will be presented - matrix multiplication by vector and matrix multiplication by matrix.
Copypasta with GLKMath:
static __inline__ GLKVector4 GLKMatrix4MultiplyVector4(GLKMatrix4 matrixLeft, GLKVector4 vectorRight) { float32x4x4_t iMatrix = *(float32x4x4_t *)&matrixLeft; float32x4_t v; iMatrix.val[0] = vmulq_n_f32(iMatrix.val[0], (float32_t)vectorRight.v[0]); iMatrix.val[1] = vmulq_n_f32(iMatrix.val[1], (float32_t)vectorRight.v[1]); iMatrix.val[2] = vmulq_n_f32(iMatrix.val[2], (float32_t)vectorRight.v[2]); iMatrix.val[3] = vmulq_n_f32(iMatrix.val[3], (float32_t)vectorRight.v[3]); iMatrix.val[0] = vaddq_f32(iMatrix.val[0], iMatrix.val[1]); iMatrix.val[2] = vaddq_f32(iMatrix.val[2], iMatrix.val[3]); v = vaddq_f32(iMatrix.val[0], iMatrix.val[2]); return *(GLKVector4 *)&v; } static __inline__ GLKMatrix4 GLKMatrix4Multiply(GLKMatrix4 matrixLeft, GLKMatrix4 matrixRight) { float32x4x4_t iMatrixLeft = *(float32x4x4_t *)&matrixLeft; float32x4x4_t iMatrixRight = *(float32x4x4_t *)&matrixRight; float32x4x4_t m; m.val[0] = vmulq_n_f32(iMatrixLeft.val[0], vgetq_lane_f32(iMatrixRight.val[0], 0)); m.val[1] = vmulq_n_f32(iMatrixLeft.val[0], vgetq_lane_f32(iMatrixRight.val[1], 0)); m.val[2] = vmulq_n_f32(iMatrixLeft.val[0], vgetq_lane_f32(iMatrixRight.val[2], 0)); m.val[3] = vmulq_n_f32(iMatrixLeft.val[0], vgetq_lane_f32(iMatrixRight.val[3], 0)); m.val[0] = vmlaq_n_f32(m.val[0], iMatrixLeft.val[1], vgetq_lane_f32(iMatrixRight.val[0], 1)); m.val[1] = vmlaq_n_f32(m.val[1], iMatrixLeft.val[1], vgetq_lane_f32(iMatrixRight.val[1], 1)); m.val[2] = vmlaq_n_f32(m.val[2], iMatrixLeft.val[1], vgetq_lane_f32(iMatrixRight.val[2], 1)); m.val[3] = vmlaq_n_f32(m.val[3], iMatrixLeft.val[1], vgetq_lane_f32(iMatrixRight.val[3], 1)); m.val[0] = vmlaq_n_f32(m.val[0], iMatrixLeft.val[2], vgetq_lane_f32(iMatrixRight.val[0], 2)); m.val[1] = vmlaq_n_f32(m.val[1], iMatrixLeft.val[2], vgetq_lane_f32(iMatrixRight.val[1], 2)); m.val[2] = vmlaq_n_f32(m.val[2], iMatrixLeft.val[2], vgetq_lane_f32(iMatrixRight.val[2], 2)); m.val[3] = vmlaq_n_f32(m.val[3], iMatrixLeft.val[2], vgetq_lane_f32(iMatrixRight.val[3], 2)); m.val[0] = vmlaq_n_f32(m.val[0], iMatrixLeft.val[3], vgetq_lane_f32(iMatrixRight.val[0], 3)); m.val[1] = vmlaq_n_f32(m.val[1], iMatrixLeft.val[3], vgetq_lane_f32(iMatrixRight.val[1], 3)); m.val[2] = vmlaq_n_f32(m.val[2], iMatrixLeft.val[3], vgetq_lane_f32(iMatrixRight.val[2], 3)); m.val[3] = vmlaq_n_f32(m.val[3], iMatrixLeft.val[3], vgetq_lane_f32(iMatrixRight.val[3], 3)); return *(GLKMatrix4 *)&m; }
It is easy to see that the implementation of these operations from Apple uses a far from optimal approach - transferring variables by value, copying variables. It looks pretty slow, at least in the debug build it will be such. Let's see how this code will show itself when profiling.
Assembly approach:
inline void Matrix4ByVec4(float32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__ vec, float32x4_t* __restrict__ result) { asm ( "vldmia %0, { d24-d31 } \n\t" "vld1.32 {q1}, [%1]\n\t" "vmul.f32 q0, q12, d2[0]\n\t" "vmla.f32 q0, q13, d2[1]\n\t" "vmla.f32 q0, q14, d3[0]\n\t" "vmla.f32 q0, q15, d3[1]\n\t" "vstmia %2, { q0 }" : : "r" (mat), "r" (vec), "r" (result) : "memory", "q0", "q1", "q8", "q9", "q10", "q11" ); } inline void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2, float32x4x4_t* __restrict__ r) { asm ( "vldmia %1, { q0-q3 } \n\t" "vldmia %2, { q8-q11 }\n\t" "vmul.f32 q12, q8, d0[0]\n\t" "vmul.f32 q13, q8, d2[0]\n\t" "vmul.f32 q14, q8, d4[0]\n\t" "vmul.f32 q15, q8, d6[0]\n\t" "vmla.f32 q12, q9, d0[1]\n\t" "vmla.f32 q13, q9, d2[1]\n\t" "vmla.f32 q14, q9, d4[1]\n\t" "vmla.f32 q15, q9, d6[1]\n\t" "vmla.f32 q12, q10, d1[0]\n\t" "vmla.f32 q13, q10, d3[0]\n\t" "vmla.f32 q14, q10, d5[0]\n\t" "vmla.f32 q15, q10, d7[0]\n\t" "vmla.f32 q12, q11, d1[1]\n\t" "vmla.f32 q13, q11, d3[1]\n\t" "vmla.f32 q14, q11, d5[1]\n\t" "vmla.f32 q15, q11, d7[1]\n\t" "vstmia %0, { q12-q15 }" : : "r" (result), "r" (m2), "r" (m1) : "memory", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15" ); }
For a person not familiar with the assembler, everything seems pretty scary - I myself am so, I can understand only the NEON assembler. But in fact, everything is simple here -
q1-q15 is, actually, NEON registers.
vldmia \ vld1.32 - download instructions;
vstmia - save to memory;
vmul.f32 \ vmla.f32 - multiply \ multiply and add.
Intrinsic method:
inline void Matrix4ByVec4(float32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__ vec, float32x4_t* __restrict__ result) { (*result) = vmulq_n_f32((*mat).val[0], (*vec)[0]); (*result) = vmlaq_n_f32((*result), (*mat).val[1], (*vec)[1]); (*result) = vmlaq_n_f32((*result), (*mat).val[2], (*vec)[2]); (*result) = vmlaq_n_f32((*result), (*mat).val[3], (*vec)[3]); } inline void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2, float32x4x4_t* __restrict__ r) { (*r).val[0] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[0], 0)); (*r).val[1] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[1], 0)); (*r).val[2] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[2], 0)); (*r).val[3] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[3], 0)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[1], vgetq_lane_f32((*m2).val[0], 1)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[1], vgetq_lane_f32((*m2).val[1], 1)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[1], vgetq_lane_f32((*m2).val[2], 1)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[1], vgetq_lane_f32((*m2).val[3], 1)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[2], vgetq_lane_f32((*m2).val[0], 2)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[2], vgetq_lane_f32((*m2).val[1], 2)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[2], vgetq_lane_f32((*m2).val[2], 2)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[2], vgetq_lane_f32((*m2).val[3], 2)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[3], vgetq_lane_f32((*m2).val[0], 3)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[3], vgetq_lane_f32((*m2).val[1], 3)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[3], vgetq_lane_f32((*m2).val[2], 3)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[3], vgetq_lane_f32((*m2).val[3], 3)); }
Almost the same code as in GLKMath, but there are some minor differences. Explanations:
vmulq_n_f32 - multiplication of a vector by a scalar;
vgetq_lane_f32 - a macro that selects a scalar from a vector;
vmlaq_n_f32 - multiply by a scalar and add. This code is simply a reflection of the assembler on intrinsiki. Let's see how he will show himself in comparison with him.
I did a test on the iPod Touch 4. The table contains the results of the profiling Update function:
An approach | Execution time, ms | CPU load,% |
---|
FPU | 6058 + 5067 * | 35-38 |
GLKMath | 2789 | 20-23 |
Assembler | 5304 | 23-25 |
Intrinsiki | 2803 | 18-20 |
* In the screenshot of Instruments, you can see that the Matrix4ByMatrix4 function has not been zailed.Here's another tip - aggressively inline your performance-critical code. Prefer
__attribute __ ((always_inline)) before the usual
inline in such cases.
Updated results table:
An approach | Execution time, ms | CPU load,% |
---|
FPU forceinlined | 6209 | 25-28 |
GLKMath | 2789 | 20-23 |
Assembler | 5304 | 23-25 |
Intrinsiki | 2803 | 18-20 |
Forced inline gave a very good performance boost! Let's see how code auto-vectorization will show itself. All we need is to add
–mllvm –vectorize –mllvm –bb-vectorize-aligned-only to
Other C Flags in the project settings.
Final results table:
An approach | Execution time, ms | Runtime (vector), ms | CPU load,% | CPU load (vector),% |
---|
FPU forceinlined | 6209 | 5028 | 25-28 | 22-24 |
GLKMath | 2789 | 2776 | 20-23 | 20-23 |
Assembler | 5304 | 5291 | 23-25 | 22-24 |
Intrinsiki | 2803 | 2789 | 18-20 | 18-20 |
Quite strange results can be observed in the case of the assembler and intrinsoks - in fact, the code is the same, but the result differs dramatically - almost 2 times! The answer to this question lies in the assembler listing (willing to look in for yourself). In the case of the assembler, we see in the listing exactly what we wrote. In the case of intrinsics, the compiler optimized the code. Slow, at first glance, the code GLKMath compiler perfectly optimized that gave the same time of execution of the code as that of manually written intrinsikov.
Results in the form of screenshots It is time to sum up. You can draw several conclusions:
- The engineers at the LLVM team did a great job. As a result, the compiler generates well optimized code for intrinsics. I did a similar test more than a year ago, when the only compiler in XCode was GCC 4.2 and it gave a very bad result - only 10-15% of the performance gain compared to the FPU code. This is great news - there is no need to learn assembler and I am extremely happy about it!
- Clang compiler can auto-vectorize code. For a programmer, this is a bonus in productivity by writing only 4 words. What else can I say besides the fact that this is a cool thing ?!
- NEON code gives a very good performance boost compared to normal C code - 2.22 times! According to the results of the done optimization, vertex processing became faster than copying those very peaks to the GPU side! If you look at the memcpy assembler, you can see that it also uses the NEON code. The lack of hardware prifetcha in Cortex A8, apparently, is the cause of slower code.
- Learning all these low level things is worth your time, especially if your goal is to become a professional.
Links
www.arm.com/products/processors/technologies/neon.phpblogs.arm.com/software-enablement/161-coding-for-neon-part-1-load-and-storescode.google.com/p/math-neonllvm.org/devmtg/2012-04-12/Slides/Hal_Finkel.pdfDemo project