📜 ⬆️ ⬇️

ARM NEON skinning

What is it?


What is ARM NEON? - ARM® NEON ™ is a SIMD engine ... - in other words, it is an extended set of instructions like x86 CPU SSE / SSE2 but for processors with an ARM architecture.

What for?


Everything was fine until I added support for the FSAA. After that, the FPS sank lower than 15.
After optimization, I again had about 25 FPS. But in memory, one function sat down that consumed 10% of the time per frame in which I did not already know what could be optimized.

Thanks to one friend of mine, who from time to time asked a question like “Would you like to use NEON in your engine”, I decided (with his support) to rewrite this function on NEON.
')

Original C code for skinning ( Matrix palette skinnig ).



Structures:

// ready to use with glSubData for vertex buffer struct PN { Math::Vec3f p; Math::Vec3f n; }; 


Transformation of one weight:

 forceinline void transformPointNormal4x3Weight_NoW(const Matrix44f& mat,const Vec3f& inV, const Vec3f& inN, BaseRenderScene::PN& outPN) { outPN.p.vec[0] = (inV.vec[0]*mat.mat[0][0] + inV.vec[1]*mat.mat[1][0] + inV.vec[2]*mat.mat[2][0] + mat.mat[3][0]); outPN.n.vec[0] = (inN.vec[0]*mat.mat[0][0] + inN.vec[1]*mat.mat[1][0] + inN.vec[2]*mat.mat[2][0]); outPN.p.vec[1] = (inV.vec[0]*mat.mat[0][1] + inV.vec[1]*mat.mat[1][1] + inV.vec[2]*mat.mat[2][1] + mat.mat[3][1]); outPN.n.vec[1] = (inN.vec[0]*mat.mat[0][1] + inN.vec[1]*mat.mat[1][1] + inN.vec[2]*mat.mat[2][1]); outPN.p.vec[2] = (inV.vec[0]*mat.mat[0][2] + inV.vec[1]*mat.mat[1][2] + inV.vec[2]*mat.mat[2][2] + mat.mat[3][2]); outPN.n.vec[2] = (inN.vec[0]*mat.mat[0][2] + inN.vec[1]*mat.mat[1][2] + inN.vec[2]*mat.mat[2][2]); } forceinline void transformPointNormal4x3Weight(const Matrix44f& mat,const Vec3f& inV, const Vec3f& inN, BaseRenderScene::PN& outPN,float w ) { outPN.p.vec[0] = (inV.vec[0]*mat.mat[0][0] + inV.vec[1]*mat.mat[1][0] + inV.vec[2]*mat.mat[2][0] + mat.mat[3][0])*w; outPN.n.vec[0] = (inN.vec[0]*mat.mat[0][0] + inN.vec[1]*mat.mat[1][0] + inN.vec[2]*mat.mat[2][0])*w; outPN.p.vec[1] = (inV.vec[0]*mat.mat[0][1] + inV.vec[1]*mat.mat[1][1] + inV.vec[2]*mat.mat[2][1] + mat.mat[3][1])*w; outPN.n.vec[1] = (inN.vec[0]*mat.mat[0][1] + inN.vec[1]*mat.mat[1][1] + inN.vec[2]*mat.mat[2][1])*w; outPN.p.vec[2] = (inV.vec[0]*mat.mat[0][2] + inV.vec[1]*mat.mat[1][2] + inV.vec[2]*mat.mat[2][2] + mat.mat[3][2])*w; outPN.n.vec[2] = (inN.vec[0]*mat.mat[0][2] + inN.vec[1]*mat.mat[1][2] + inN.vec[2]*mat.mat[2][2])*w; } forceinline void transformPointNormal4x3AddWeighted(const Matrix44f& mat,const Vec3f& inV, const Vec3f& inN, BaseRenderScene::PN& outPN,float w ) { outPN.p.vec[0] += (inV.vec[0]*mat.mat[0][0] + inV.vec[1]*mat.mat[1][0] + inV.vec[2]*mat.mat[2][0] + mat.mat[3][0])*w; outPN.n.vec[0] += (inN.vec[0]*mat.mat[0][0] + inN.vec[1]*mat.mat[1][0] + inN.vec[2]*mat.mat[2][0])*w; outPN.p.vec[1] += (inV.vec[0]*mat.mat[0][1] + inV.vec[1]*mat.mat[1][1] + inV.vec[2]*mat.mat[2][1] + mat.mat[3][1])*w; outPN.n.vec[1] += (inN.vec[0]*mat.mat[0][1] + inN.vec[1]*mat.mat[1][1] + inN.vec[2]*mat.mat[2][1])*w; outPN.p.vec[2] += (inV.vec[0]*mat.mat[0][2] + inV.vec[1]*mat.mat[1][2] + inV.vec[2]*mat.mat[2][2] + mat.mat[3][2])*w; outPN.n.vec[2] += (inN.vec[0]*mat.mat[0][2] + inN.vec[1]*mat.mat[1][2] + inN.vec[2]*mat.mat[2][2])*w; } 


Transformation of one vertex:

 const Vec3f& vx = pVerticies[v]; const Vec3f& vxN = pNormals[v]; float w = pVertexWeight[v].vec[0]; int boneIndex = pVertexBones[v].vec[0]; const Matrix44f& boneTM = pBoneTMList[boneIndex]; if( wCount==1 ) { transformPointNormal4x3Weight_NoW(boneTM,vx,vxN,skinTempPN[v]); } else { // 1st vertex without add transformPointNormal4x3Weight_N(boneTM,vx,vxN,skinTempPN[v],w); for(size_t i=1;i<wCount;i++) { // other verticies w = pVertexWeight[v].vec[i]; boneIndex = pVertexBones[v].vec[i]; const Matrix44f& boneTM = pBoneTMList[boneIndex]; transformPointNormal4x3AddWeighted_N(boneTM,vx,vxN,skinTempPN[v],w); } } 

A knowledgeable person will immediately notice that I store the number of non-zero weights for each vertex. In my case, about 30% of the peaks were with the same weight, which allowed us to win a little time.

ASM with NEON (xCode style)


The code below is asm / C code optimized using ARM NEON.
Some nuances:


 #if defined(__ARM_NEON__) #define USE_NEON #endif #if defined(USE_NEON) #ifdef __thumb__ #error "This file should be compiled in ARM mode only." // Note in Xcode, right click file, Get Info->Build, Other compiler flags = "-marm" #endif #define OP "q0" #define OPS0 "s0" #define OPS1 "s1" #define OPS2 "s2" #define ON "q1" #define ONS0 "s4" #define ONS1 "s5" #define ONS2 "s6" #define IP "q2" #define IN "q3" #define IPX "d4[0]" #define IPY "d4[1]" #define IPZ "d5[0]" #define IPW "d5[1]" #define INX "d6[0]" #define INY "d6[1]" #define INZ "d7[0]" #define INW "d7[1]" #define WQ "q4" #define W0D "d8[0]" #define W1D "d8[1]" #define W2D "d9[0]" #define W3D "d9[1]" #define QM0 q8 #define QM1 q9 #define QM2 q10 #define QM3 q11 #define QT "q14" // outP = mt.row0*pos + mt.row1*pos + mt.row2*pos + mt.row3*pos #define mat_pos(_RES) \ "vmul.f32 " _RES ", q8, " IPX "\n\t" \ "vmla.f32 " _RES ", q9, " IPY "\n\t" \ "vmla.f32 " _RES ", q10, " IPZ "\n\t" \ "vmla.f32 " _RES ", q11, " IPW "\n\t" #define mat_pos_w_set(_RES,_QT,_WD) \ mat_pos(_QT) \ "vmul.f32 " _RES ", " _QT ", " _WD "\n\t" #define mat_pos_w_add(_RES,_QT,_WD) \ mat_pos(_QT) \ "vmla.f32 " _RES ", " _QT ", " _WD "\n\t" // outN = mt.row0*nor + mt.row1*nor + mt.row2*nor #define mat_nor(_RES) \ "vmul.f32 " _RES ", q8, " INX "\n\t" \ "vmla.f32 " _RES ", q9, " INY "\n\t" \ "vmla.f32 " _RES ", q10, " INZ "\n\t" #define mat_nor_w_set(_RES,_QT,_WD) \ mat_nor(_QT) \ "vmul.f32 " _RES ", " _QT ", " _WD "\n\t" #define mat_nor_w_add(_RES,_QT,_WD) \ mat_nor(_QT) \ "vmla.f32 " _RES ", " _QT ", " _WD "\n\t" #define STORE3_P3N3(_R) \ "fsts "OPS0",[" _R "] \n\t" \ "fsts "OPS1",[" _R ",#4] \n\t" \ "fsts "OPS2",[" _R ",#8] \n\t" \ "fsts "ONS0",[" _R ",#12] \n\t" \ "fsts "ONS1",[" _R ",#16] \n\t" \ "fsts "ONS2",[" _R ",#20] \n\t" #define mat_load(_R) \ "vldmia " _R ", { q8-q11 } \n\t" __attribute__((always_inline)) void clalcSkin1( const Matrix44f* mat0, const Vec4f* posnorm, Vec3f* outPN) { // asm volatile ( // q4-q7 need to be preserved "vldmia %1, { " IP " - " IN " } \n\t" // pos norm // OP p temp // ON n temp // // mat0 mat_load("%0") mat_pos(OP) mat_nor(ON) STORE3_P3N3("%2") : // no output : "r" (mat0), "r" (posnorm), "r" (outPN) : "memory", IP, IN, WQ, QT, OP, ON, "q8", "q9", "q10", "q11" //clobber ); } __attribute__((always_inline)) void clalcSkin2( const Matrix44f* mat0, const Matrix44f* mat1, const Vec4f* posnorm, const Vec4f* weight, Vec3f* outPN) { // asm volatile ( // q4-q7 need to be preserved "vmov q15," WQ "\n\t" // "vldmia %2, { " IP " - " IN " } \n\t" // pos norm "vldmia %3, { " WQ " } \n\t" // weights // QT intermediate temp // OP p temp // ON n temp // // mat0 mat_load("%0") mat_pos_w_set(OP,QT,W0D) mat_nor_w_set(ON,QT,W0D) // mat 1 mat_load("%1") mat_pos_w_add(OP,QT,W1D) mat_nor_w_add(ON,QT,W1D) // output pos3f,norm3f STORE3_P3N3("%4") // restore q4 (WQ) "vmov " WQ ", q15 \n\t" : // no output : "r" (mat0), "r" (mat1), "r" (posnorm), "r" (weight), "r" (outPN) : "memory", IP, IN, WQ, QT , OP, ON, "q8", "q9", "q10", "q11", "q15" //clobber ); } __attribute__((always_inline)) void clalcSkin3( const Matrix44f* mat0, const Matrix44f* mat1, const Matrix44f* mat2, const Vec4f* posnorm, const Vec4f* weight, Vec3f* outPN) { // asm volatile ( // q4-q7 need to be preserved "vmov q15," WQ "\n\t" // "vldmia %3, { " IP " - " IN " } \n\t" // pos norm "vldmia %4, { " WQ " } \n\t" // weights // QT intermediate temp // OP p temp // ON n temp // // mat0 mat_load("%0") mat_pos_w_set(OP,QT,W0D) mat_nor_w_set(ON,QT,W0D) // mat 1 mat_load("%1") mat_pos_w_add(OP,QT,W1D) mat_nor_w_add(ON,QT,W1D) // mat 2 mat_load("%2") mat_pos_w_add(OP,QT,W2D) mat_nor_w_add(ON,QT,W2D) // output pos,normal STORE3_P3N3("%5") // restore q4 (WQ) "vmov " WQ ", q15 \n\t" : // no output : "r" (mat0), "r" (mat1), "r" (mat2),"r" (posnorm), "r" (weight), "r" (outPN) : "memory", IP, IN, WQ, QT, OP, ON, "q8", "q9", "q10", "q11", "q15" //clobber ); } __attribute__((always_inline)) void clalcSkin4( const Matrix44f* mat0, const Matrix44f* mat1, const Matrix44f* mat2, const Matrix44f* mat3, const Vec4f* posnorm, const Vec4f* weight, Vec3f* outPN) { // asm volatile ( // q4-q7 need to be preserved "vmov q15," WQ "\n\t" // "vldmia %4, { " IP " - " IN " } \n\t" // pos norm "vldmia %5, { " WQ " } \n\t" // weights // QT intermediate temp // OP p temp // ON n temp // // mat0 mat_load("%0") mat_pos_w_set(OP,QT,W0D) mat_nor_w_set(ON,QT,W0D) // mat 1 mat_load("%1") mat_pos_w_add(OP,QT,W1D) mat_nor_w_add(ON,QT,W1D) // mat 2 mat_load("%2") mat_pos_w_add(OP,QT,W2D) mat_nor_w_add(ON,QT,W2D) // mat 3 mat_load("%3") mat_pos_w_add(OP,QT,W3D) mat_nor_w_add(ON,QT,W3D) // output pos,normal STORE3_P3N3("%6") // restore q4 (WQ) "vmov " WQ ", q15\n\t" : // no output : "r" (mat0), "r" (mat1), "r" (mat2), "r" (mat3), "r" (posnorm), "r" (weight), "r" (outPN) : "memory", IP, IN, WQ, QT, OP, ON, "q8", "q9", "q10", "q11", "q15" //clobber ); } 


results


I did not do any synthetic tests - I checked everything on the working draft.

502ms (c ++) versus 307ms (arm neon) at ~ 10 second interval for iPhone 4 (39% faster than C).

Questions and Answers


Try to answer a few questions at once?

Q: Why isn't it described how ARM NEON works and what is it?
A: There is no sense to retell specs.

Q: Why not use a shader?
A: OpenGL 1.1

Q: Why not use OpenGL 2.0+?
A: Only after porting to Windows Phone 8 (there is no FF there and at that very moment I will add “shader” to the engine and then already on GL 2.0).

Q: Why not use GL_OES_matrix_palette for FF?
A: It is necessary to beat the model into groups of 11 (for iphone) matrices and there is no time for this - perhaps in the future.

Q: And where can I find out more and preferably with examples?
A: I advise you to look here . Watch out there LGPL.

Q: And how much did it take?
A: A week is exactly the reason for writing the article (if someone saves time, I will be happy).

Q: I did not understand anything, but can I read more?
A: You can (depending on the comments), but I tried to write very clear code.

Ps. Errors in lichku.

Source: https://habr.com/ru/post/153015/


All Articles