Accelerating the multiplication of float 4x4 matrices using SIMD

Many years have passed since I met the instructions of MMX, SSE, and later AVX on Intel processors. At one time, they seemed to be some kind of magic against the background of x86 assembler, which has long been something commonplace. They hooked me so much that a couple of years ago I had the idea to write my own software renderer for one famous game. What prompted me was the performance these instructions promised. At some point I even thought about writing about it. But writing the text turned out to be much more complicated than the code.

At that time, I wanted to avoid problems with support on different processors. I wanted to be able to check my renderer at the maximum available quantity. I still have friends with old AMD processors, and their ceiling was SSE3. Therefore, at that time I decided to limit myself to a maximum of SSE3. This is how a vector mathematical library appeared, a little less than fully implemented on SSE, with a rare inclusion up to SSE3. However, at some point I wondered what maximum performance I could squeeze out of the processor for a number of critical vector math operations. One such operation is the multiplication of float 4 by 4 matrices.

Actually, this business decided to do more for fun. It has long been written and I use matrix multiplication for my software render on SSE and it seems to be enough for me. But then I decided to see how many clock cycles I can squeeze out in principle from the multiplication of 2 matrices of float4x4. On my current SSE version it's 16 clocks. True, the recent transition to IACA 3 began to show 19, since I began to write 1 * instead of 0 * on some instructions. Apparently before it was just a flaw in the analyzer.
')

Briefly about the utilities used

To analyze the code used the well-known utility Intel Architecture Code Analyzer . For analysis, I use Haswell (HSW) architecture, as minimal with AVX2 support. For writing code, it is also very convenient to use: Intel Intrinsics Guide and Intel optimization manual .

For assembly I use MSVS 2017 Community from the console. I write the code in the version with intrinsiki. You write once, and usually it works immediately on different platforms. In addition, the x64 compiler on VC ++ does not support the inline assembler, but I want it to work under x64.

Since this article is already somewhat beyond the beginner level in SIMD programming, I will not describe registers, instructions, draw (or tyrit) beautiful pictures and try to learn to program using SIMD instructions. The Intel site is full of excellent, clear and detailed documentation.

I wanted to make everything easier. ... And it turned out as always

This is where the moment begins, which complicates the implementation as well as the article. Therefore, a little stop on it. I am not interested in writing matrix multiplication with standard line-by-line arrangement of elements. Who was needed, and so studied in universities or independently. Our goal is performance. First, I have long moved to the arrangement of the columns. My software renderer is based on the OpenGL API and therefore, in order to avoid unnecessary transpositions, I began to store elements in columns. This is also important because matrix multiplication is not so critical. Well multiplied 2-5-10 matrices. And that's all. And then multiply the finished matrix by thousands of millions of vertices. And this operation is much more critical. You can, of course, transpose each time. But why, if this can be avoided.

But back to the matrices only. With storage by columns decided. However, you can still complicate. It is more convenient for me to store the highest elements of vectors and matrix rows in SIMD registers so that x is in the highest float (index 3), and w in the youngest (index 0). Here, apparently, will have to again make a digression on why so.

The fact is that in the software renderer in the vector, the component w has to be manipulated more often ( 1 / z is stored there), and it is very convenient to do this through the _ss variant of the operation (operations exclusively with the component in the xmm y float register), without touching x, y, z . Therefore, in the SSE register, the vector is stored in a clear order of x, y, z, w , and in memory in the opposite w, z, y, x .

Further, all multiplication options are also implemented by separate functions. This is done because I use them to substitute the necessary variant depending on the type of instructions supported. Well described here.

We implement basic functionality

Multiplication with cycles, row ordered

Option for line layout of elements

for (int i = 0; i < 4; ++i) { for (int j = 0; j < 4; ++j) { r[i][j] = 0.f; for (int k = 0; k < 4; ++k) { r[i][j] += m[i][k] * n[k][j]; } } }

Everything is simple and clear. For each element we make 4 multiplications and 3 additions. In total, this is 64 multiplications and 48 additions. And this is without regard to reading the entry elements.

Everything is sad, in short. For this option for the internal cycle, IACA issued: 3.65 cycles for the x86 assembly and 2.97 for the x64 assembly . Do not ask why fractional numbers. I do not know. IACA 2.1 did not suffer from this. In any case, these figures should be multiplied by about 4 * 4 * 4 = 64. Even if you take x64, the result is about 192 cycles. It is clear that this is an approximate assessment. I do not see any reason to evaluate the performance for this option more precisely.

Implementation with cycles, column ordered

transpose matrix, swap row and column indexes

 for (int i = 0; i < 4; ++i) { for (int j = 0; j < 4; ++j) { r[j][i] = 0.f; for (int k = 0; k < 4; ++k) { r[j][i] += m[k][i] * n[j][k]; } } }

Cycle multiplication, SIMD oriented storage

add storage of lines in reverse order in memory

 for (int i = 0; i < 4; ++i) { for (int j = 0; j < 4; ++j) { r[j][3-i] = 0.f; for (int k = 0; k < 4; ++k) { r[j][3-i] += m[k][3-i] * n[j][3-k]; } } }

This implementation somewhat simplifies the understanding of what is happening inside, but it is clearly not enough.

Helper classes

For ease of understanding and writing reference and debugging code, it is convenient to implement a couple of auxiliary classes. Nothing more, everything is just for understanding. I will note that the implementation of full-fledged classes of a vector and a matrix is a separate difficult question, and is not included in the topic of this article.

Classes of vector and matrix

 struct alignas(sizeof(__m128)) vec4 { union { struct { float w, z, y, x; }; __m128 fmm; float arr[4]; }; vec4() {} vec4(float a, float b, float c, float d) : w(d), z(c), y(b), x(a) {} static bool equ(float const a, float const b, float t = .00001f) { return fabs(ab) < t; } bool operator == (vec4 const& v) const { return equ(x, vx) && equ(y, vy) && equ(z, vz) && equ(w, vw); } }; struct alignas(sizeof(__m256)) mtx4 { //           union { struct { float _30, _20, _10, _00, _31, _21, _11, _01, _32, _22, _12, _02, _33, _23, _13, _03; }; __m128 r[4]; __m256 s[2]; vec4 v[4]; }; //    mtx4() {} mtx4( float i00, float i01, float i02, float i03, float i10, float i11, float i12, float i13, float i20, float i21, float i22, float i23, float i30, float i31, float i32, float i33) : _00(i00), _01(i01), _02(i02), _03(i03) , _10(i10), _11(i11), _12(i12), _13(i13) , _20(i20), _21(i21), _22(i22), _23(i23) , _30(i30), _31(i31), _32(i32), _33(i33) {} //      operator __m128 const* () const { return r; } operator __m128* () { return r; } //   bool operator == (mtx4 const& m) const { return v[0]==mv[0] && v[1]==mv[1] && v[2]==mv[2] && v[3]==mv[3]; } //  static mtx4 identity() { return mtx4( 1.f, 0.f, 0.f, 0.f, 0.f, 1.f, 0.f, 0.f, 0.f, 0.f, 1.f, 0.f, 0.f, 0.f, 0.f, 1.f); } static mtx4 zero() { return mtx4( 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f); } };

Reference function for tests

Since the accepted order of elements in the matrix complicates understanding a lot, the reference clear function will also not prevent us, which will show in further implementations that everything works correctly. We will compare the subsequent results with it.

To create it, just take and expand the loop.

 void mul_mtx4_mtx4_unroll(__m128* const _r, __m128 const* const _m, __m128 const* const _n) { mtx4 const& m = **reinterpret_cast<mtx4 const* const*>(&_m); mtx4 const& n = **reinterpret_cast<mtx4 const* const*>(&_n); mtx4& r = **reinterpret_cast<mtx4* const*>(&_r); r._00 = m._00*n._00 + m._01*n._10 + m._02*n._20 + m._03*n._30; r._01 = m._00*n._01 + m._01*n._11 + m._02*n._21 + m._03*n._31; r._02 = m._00*n._02 + m._01*n._12 + m._02*n._22 + m._03*n._32; r._03 = m._00*n._03 + m._01*n._13 + m._02*n._23 + m._03*n._33; r._10 = m._10*n._00 + m._11*n._10 + m._12*n._20 + m._13*n._30; r._11 = m._10*n._01 + m._11*n._11 + m._12*n._21 + m._13*n._31; r._12 = m._10*n._02 + m._11*n._12 + m._12*n._22 + m._13*n._32; r._13 = m._10*n._03 + m._11*n._13 + m._12*n._23 + m._13*n._33; r._20 = m._20*n._00 + m._21*n._10 + m._22*n._20 + m._23*n._30; r._21 = m._20*n._01 + m._21*n._11 + m._22*n._21 + m._23*n._31; r._22 = m._20*n._02 + m._21*n._12 + m._22*n._22 + m._23*n._32; r._23 = m._20*n._03 + m._21*n._13 + m._22*n._23 + m._23*n._33; r._30 = m._30*n._00 + m._31*n._10 + m._32*n._20 + m._33*n._30; r._31 = m._30*n._01 + m._31*n._11 + m._32*n._21 + m._33*n._31; r._32 = m._30*n._02 + m._31*n._12 + m._32*n._22 + m._33*n._32; r._33 = m._30*n._03 + m._31*n._13 + m._32*n._23 + m._33*n._33; }

Here is a classic algorithm clearly painted, it is difficult to make a mistake (but you can :-)). On it IACA issued: x86 - 69.95 cycles, x64 - 64 cycles . Regarding 64 cycles, we will look at the acceleration of this operation in the future.

SSE implementation

Classic SSE algorithm

Why classic? Because it has long been in the implementation of FVec in the MSVS. To begin with, we will write how we present the elements of the matrix in the SSE registers. Here it already looks easier. Just transposed matrix.

 //     00, 10, 20, 30 // m[0] -  SIMD /   01, 11, 21, 31 // m[1] 02, 12, 22, 32 // m[2] 03, 13, 23, 33 // m[3]

We take the unroll code of the variant above. Some kind of he is unfriendly for SSE. The first group of rows consists of the results on the column of the resulting matrix: r._00, r._01, r._02, r._03 . We have a column, and we need a row. Yes, and m , n look inconvenient for calculations. Therefore, we rearrange the lines of the algorithm so that the result of r is line by line.

 //  ,     r[0] r00 = m00*n00 + m01*n10 + m02*n20 + m03*n30; r10 = m10*n00 + m11*n10 + m12*n20 + m13*n30; r20 = m20*n00 + m21*n10 + m22*n20 + m23*n30; r30 = m30*n00 + m31*n10 + m32*n20 + m33*n30; //  ,     r[1] r01 = m00*n01 + m01*n11 + m02*n21 + m03*n31; r11 = m10*n01 + m11*n11 + m12*n21 + m13*n31; r21 = m20*n01 + m21*n11 + m22*n21 + m23*n31; r31 = m30*n01 + m31*n11 + m32*n21 + m33*n31; //  ,     r[2] r02 = m00*n02 + m01*n12 + m02*n22 + m03*n32; r12 = m10*n02 + m11*n12 + m12*n22 + m13*n32; r22 = m20*n02 + m21*n12 + m22*n22 + m23*n32; r32 = m30*n02 + m31*n12 + m32*n22 + m33*n32; //  ,     r[3] r03 = m00*n03 + m01*n13 + m02*n23 + m03*n33; r13 = m10*n03 + m11*n13 + m12*n23 + m13*n33; r23 = m20*n03 + m21*n13 + m22*n23 + m23*n33; r33 = m30*n03 + m31*n13 + m32*n23 + m33*n33;

But so much better. What exactly do we see? By the columns of the algorithm in each group, we use the rows of the matrix m :

 m [0] = {00,10,20,30}, m [1] = {01,11,21,31}, m [2] = {02,12,22,32}, m [3] = {03,13,23,33},

which are multiplied by the same element of the matrix n . For example, for the first group it is: n._00, n._10, n._20, n._30 . And the elements of the matrix n for each group of rows of the algorithm again lie in one row of the matrix.

Then everything is simple: we simply take the rows of the matrix m by the index, but as for the elements n , we take its row and through the shuffle instruction we scatter it with all 4 register elements to multiply by the row of the matrix m in the register. For example, for the element n._00 (remember that its offset in the register has an index of 3) this would be:

  _mm_shuffle_ps (n [0], n [0], _MM_SHUFFLE (3,3,3,3))

In simplified form, the algorithm looks like this:

 //   n[0]={00,10,20,30} r[0] = m[0] * n00 + m[1] * n10 + m[2] * n20 + m[3] * n30; //   n[1]={01,11,21,31} r[1] = m[0] * n01 + m[1] * n11 + m[2] * n21 + m[3] * n31; //   n[2]={02,12,22,32} r[2] = m[0] * n02 + m[1] * n12 + m[2] * n22 + m[3] * n32; //   n[3]={03,13,23,33} r[3] = m[0] * n03 + m[1] * n13 + m[2] * n23 + m[3] * n33;

Basic SSE implementation

 void mul_mtx4_mtx4_sse_v1(__m128* const r, __m128 const* const m, __m128 const* const n) { r[0] = _mm_add_ps( _mm_add_ps( _mm_mul_ps(m[0], _mm_shuffle_ps(n[0], n[0], _MM_SHUFFLE(3,3,3,3))), _mm_mul_ps(m[1], _mm_shuffle_ps(n[0], n[0], _MM_SHUFFLE(2,2,2,2)))), _mm_add_ps( _mm_mul_ps(m[2], _mm_shuffle_ps(n[0], n[0], _MM_SHUFFLE(1,1,1,1))), _mm_mul_ps(m[3], _mm_shuffle_ps(n[0], n[0], _MM_SHUFFLE(0,0,0,0))))); r[1] = _mm_add_ps( _mm_add_ps( _mm_mul_ps(m[0], _mm_shuffle_ps(n[1], n[1], _MM_SHUFFLE(3,3,3,3))), _mm_mul_ps(m[1], _mm_shuffle_ps(n[1], n[1], _MM_SHUFFLE(2,2,2,2)))), _mm_add_ps( _mm_mul_ps(m[2], _mm_shuffle_ps(n[1], n[1], _MM_SHUFFLE(1,1,1,1))), _mm_mul_ps(m[3], _mm_shuffle_ps(n[1], n[1], _MM_SHUFFLE(0,0,0,0))))); r[2] = _mm_add_ps( _mm_add_ps( _mm_mul_ps(m[0], _mm_shuffle_ps(n[2], n[2], _MM_SHUFFLE(3,3,3,3))), _mm_mul_ps(m[1], _mm_shuffle_ps(n[2], n[2], _MM_SHUFFLE(2,2,2,2)))), _mm_add_ps( _mm_mul_ps(m[2], _mm_shuffle_ps(n[2], n[2], _MM_SHUFFLE(1,1,1,1))), _mm_mul_ps(m[3], _mm_shuffle_ps(n[2], n[2], _MM_SHUFFLE(0,0,0,0))))); r[3] = _mm_add_ps( _mm_add_ps( _mm_mul_ps(m[0], _mm_shuffle_ps(n[3], n[3], _MM_SHUFFLE(3,3,3,3))), _mm_mul_ps(m[1], _mm_shuffle_ps(n[3], n[3], _MM_SHUFFLE(2,2,2,2)))), _mm_add_ps( _mm_mul_ps(m[2], _mm_shuffle_ps(n[3], n[3], _MM_SHUFFLE(1,1,1,1))), _mm_mul_ps(m[3], _mm_shuffle_ps(n[3], n[3], _MM_SHUFFLE(0,0,0,0))))); }

Now, in the algorithm, we change the elements n to the corresponding shuffle , multiplication by _mm_mul_ps , the amount by _mm_add_ps , and everything is ready. It is working. The code looks, however, much worse than the algorithm itself looked like. IACA issued this code: x86 - 18.89, x64 - 16 cycles . This is 4 times faster than the previous one. In the SSE register, the 4th float. Almost linear dependence.

We decorate SSE implementation

Still, in the code, it looks awful. We will try to improve this by writing a little syntactic sugar.

Operators and Improvers

 //        ( -    namespace) __m128 operator + (__m128 const a, __m128 const b) { return _mm_add_ps(a, b); } __m128 operator - (__m128 const a, __m128 const b) { return _mm_sub_ps(a, b); } __m128 operator * (__m128 const a, __m128 const b) { return _mm_mul_ps(a, b); } __m128 operator / (__m128 const a, __m128 const b) { return _mm_div_ps(a, b); } //_mm_shuffle_ps(u, v, _MM_SHUFFLE(3,2,1,0))   shuf<3,2,1,0>(u, v) template <int a, int b, int c, int d> __m128 shuf(__m128 const u, __m128 const v) { return _mm_shuffle_ps(u, v, _MM_SHUFFLE(a, b, c, d)); } template <int a, int b, int c, int d> __m128 shuf(__m128 const v) { return _mm_shuffle_ps(v, v, _MM_SHUFFLE(a, b, c, d)); } //    template <int i> __m128 shuf(__m128 const u, __m128 const v) { return _mm_shuffle_ps(u, v, _MM_SHUFFLE(i, i, i, i)); } template <int i> __m128 shuf(__m128 const v) { return _mm_shuffle_ps(v, v, _MM_SHUFFLE(i, i, i, i)); } //  float       , //    ,    template <int a, int b, int c, int d> __m128 shufd(__m128 const v) { return _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(v), _MM_SHUFFLE(a, b, c, d))); } template <int i> __m128 shufd(__m128 const v) { return _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(v), _MM_SHUFFLE(i, i, i, i))); }

These functions the compiler can perfectly inline (although sometimes without __forceinline in any way).

So the code turns ...

 void mul_mtx4_mtx4_sse_v2(__m128* const r, __m128 const* const m, __m128 const* const n) { r[0] = m[0]*shuf<3>(n[0]) + m[1]*shuf<2>(n[0]) + m[2]*shuf<1>(n[0]) + m[3]*shuf<0>(n[0]); r[1] = m[0]*shuf<3>(n[1]) + m[1]*shuf<2>(n[1]) + m[2]*shuf<1>(n[1]) + m[3]*shuf<0>(n[1]); r[2] = m[0]*shuf<3>(n[2]) + m[1]*shuf<2>(n[2]) + m[2]*shuf<1>(n[2]) + m[3]*shuf<0>(n[2]); r[3] = m[0]*shuf<3>(n[3]) + m[1]*shuf<2>(n[3]) + m[2]*shuf<1>(n[3]) + m[3]*shuf<0>(n[3]); }

But this is already much better and more readable. On this IACA produced approximately the expected result: x86 - 19 (and why not fractional?), X64 - 16 . In fact, the performance has not changed, but the code is much prettier and clearer.

Small contribution to future optimization

We introduce one more improvement at the function level, which appeared in the iron version not too long ago. The operation multiple-add (fma) . fma (a, b, c) = a * b + c .

Implement multiple-add

 __m128 mad(__m128 const a, __m128 const b, __m128 const c) { return _mm_add_ps(_mm_mul_ps(a, b), c); }

Why is this necessary? First of all, for future optimization. For example, you can simply replace mad in the ready-made code with fma through the same macros as you like. But we will lay the foundation for optimization now:

Option with multiple-add

 void mul_mtx4_mtx4_sse_v3(__m128* const r, __m128 const* const m, __m128 const* const n) { r[0] = mad(m[0], shuf<3>(n[0]), m[1]*shuf<2>(n[0])) + mad(m[2], shuf<1>(n[0]), m[3]*shuf<0>(n[0])); r[1] = mad(m[0], shuf<3>(n[1]), m[1]*shuf<2>(n[1]) + mad(m[2], shuf<1>(n[1]), m[3]*shuf<0>(n[1])); r[2] = mad(m[0], shuf<3>(n[2]), m[1]*shuf<2>(n[2])) + mad(m[2], shuf<1>(n[2]), m[3]*shuf<0>(n[2])); r[3] = mad(m[0], shuf<3>(n[3]), m[1]*shuf<2>(n[3])) + mad(m[2], shuf<1>(n[3]), m[3]*shuf<0>(n[3])); }

IACA: x86 - 18.89, x64 - 16 . Again fractional. Still, IACA sometimes gives strange results. The code has not changed so much. Probably even a little bit worse. But optimization sometimes requires such sacrifices.

Go to save through _mm_stream

Different optimization guides recommend once again not to pull the cache for mass save operations. This is usually reasonable when you are processing vertices, which are thousands and more. But for matrices this is probably not so important. But still add.

Inline save option

 void mul_mtx4_mtx4_sse_v4(__m128* const r, __m128 const* const m, __m128 const* const n) { _mm_stream_ps(&r[0].m128_f32[0], mad(m[0], shuf<3>(n[0]), m[1]*shuf<2>(n[0])) + mad(m[2], shuf<1>(n[0]), m[3]*shuf<0>(n[0]))); _mm_stream_ps(&r[1].m128_f32[0], mad(m[0], shuf<3>(n[1]), m[1]*shuf<2>(n[1])) + mad(m[2], shuf<1>(n[1]), m[3]*shuf<0>(n[1]))); _mm_stream_ps(&r[2].m128_f32[0], mad(m[0], shuf<3>(n[2]), m[1]*shuf<2>(n[2])) + mad(m[2], shuf<1>(n[2]), m[3]*shuf<0>(n[2]))); _mm_stream_ps(&r[3].m128_f32[0], mad(m[0], shuf<3>(n[3]), m[1]*shuf<2>(n[3])) + mad(m[2], shuf<1>(n[3]), m[3]*shuf<0>(n[3]))); }

Nothing has changed in tact, from the word altogether. But, according to the recommendations, now we don’t touch the cache.

AVX implementation

Basic AVX option

Next, proceed to the next stage of optimization. The 4th float is in the SSE register, and already 8 in the AVX. That is, there is a theoretical chance to reduce the number of operations performed and to increase performance, if not twice, then at least 1.5 times. But something tells me that not everything will be so easy with the transition to AVX. Will we be able to get the necessary data from the dual registers?

Let's try to figure it out. Again, write out our multiplication algorithm used above. It would be possible not to do this, but it is more convenient to deal with the code when everything is close, and you do not have to scroll up half a page.

 //    : 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33 //   SSE: r0 = m0*n00 + m1*n10 + m2*n20 + m3*n30 r1 = m0*n01 + m1*n11 + m2*n21 + m3*n31 r2 = m0*n02 + m1*n12 + m2*n22 + m3*n32 r3 = m0*n03 + m1*n13 + m2*n23 + m3*n33

At the output, we expect to get the result in ymm = {r0: r1} and ymm = {r2: r3} . If, in the SSE version, our algorithm was generalized to columns, then now we need to generalize it to rows. So to act as in the case of the SSE option will not work.

If we take the matrix m into the registers ymm , we get ymm = {m0: m1} and ymm = {m2: m3}, respectively. Previously, we only had matrix columns in the register, and now both columns and rows.

If you try to act as before, then ymm = {m0: m1} should be multiplied by the register ymm = {n00, n00, n00, n00}: {n10, n10, n10, n10} . Since n00 and n01 are in the same row of the matrix n , then, judging by the available set of AVX instructions, it will be expensive to scatter them by ymm . Both shuffle and permute work separately for each of the two fours float ( major and minor xmm ) inside the ymm registers.

If we take ymm from matrix n , then we get both elements n00 and n10 in the highest of 2 xmm inside the ymm register. {n00, n10, n20, n30}: {n01, n11, n21, n31} . Usually, the index of the existing instructions is from 0 to 3. And it addresses the float only within one xmm register from two inside the ymm register. Throw n10 from the older xmm to the younger cheap will not work. And here also this trick should be repeated several times. With such a loss of cycles, we can not accept. We must come up with something else.

We used to generalize the columns, and now the rows. Therefore, try to go a little on the other side. We need to get the result in {r0: r1} . This means that the algorithm should be improved not by separate lines of the algorithm, but by two at once. And here, what was a minus in the work of shuffle and permute will be a plus for us. We look at what we will have in the registers ymm , when we consider the matrix n .

 n0n1 = {00, 10, 20, 30} : {01, 11, 21, 31} n2n3 = {02, 12, 22, 32} : {03, 13, 23, 33}

Aha, we notice that in different xmm parts of the ymm register we have elements 00 and 01 . They can be multiplied by case through the permute command in {_00, _00, _00, _00}: {_ 01, _01, _01, _01} , indicating only one index 3 for both xmm parts. This is exactly what we need. After all, the coefficients are also used in different lines. Only now in the corresponding ymm register for multiplication you will need to keep {m0: m0} , that is, the duplicated first row of the matrix m .

So, paint the algorithm in more detail. We read in ymm registers the double rows of the matrix m :

 mm[0] = {m0:m0} mm[1] = {m1:m1} mm[2] = {m2:m2} mm[3] = {m3:m3}

And then we will calculate the multiplication as:

 r0r1 = mm[0] * {n00,n00,n00,n00:n01,n01,n01,n01} + // permute<3,3,3,3>(n0n1) mm[1] * {n10,n10,n10,n10:n11,n11,n11,n11} + // permute<2,2,2,2>(n0n1) mm[2] * {n20,n20,n20,n20:n21,n21,n21,n21} + // permute<1,1,1,1>(n0n1) mm[3] * {n30,n30,n30,n30:n31,n31,n31,n31} // permute<0,0,0,0>(n0n1) r2r3 = mm[0] * {n02,n02,n02,n02:n03,n03,n03,n03} + // permute<3,3,3,3>(n2n3) mm[1] * {n12,n12,n12,n12:n13,n13,n13,n13} + // permute<2,2,2,2>(n2n3) mm[2] * {n22,n22,n22,n22:n23,n23,n23,n23} + // permute<1,1,1,1>(n2n3) mm[3] * {n32,n32,n32,n32:n33,n33,n33,n33} // permute<0,0,0,0>(n2n3)

Rewrite more clearly:

 r0r1 = mm[0]*n0n1<3,3,3,3>+mm[1]*n0n1<2,2,2,2>+mm[2]*n0n1<1,1,1,1>+mm[3]*n0n1<0,0,0,0> r2r3 = mm[0]*n2n3<3,3,3,3>+mm[1]*n2n3<2,2,2,2>+mm[2]*n2n3<1,1,1,1>+mm[3]*n2n3<0,0,0,0>

Or in a simplified form:

 r0r1 = mm[0]*n0n1<3> + mm[1]*n0n1<2> + mm[2]*n0n1<1> + mm[3]*n0n1<0> r2r3 = mm[0]*n2n3<3> + mm[1]*n2n3<2> + mm[2]*n2n3<1> + mm[3]*n2n3<0>

It seems everything is clear.

It remains only to write the implementation

 void mul_mtx4_mtx4_avx_v1(__m128* const r, __m128 const* const m, __m128 const* const n) { __m256 mm0 = _mm256_set_m128(m[0], m[0]); __m256 mm1 = _mm256_set_m128(m[1], m[1]); __m256 mm2 = _mm256_set_m128(m[2], m[2]); __m256 mm3 = _mm256_set_m128(m[3], m[3]); __m256 n0n1 = _mm256_load_ps(&n[0].m128_f32[0]); __m256 y1 = _mm256_permute_ps(n0n1, 0xFF);//3,3,3,3 __m256 y2 = _mm256_permute_ps(n0n1, 0xAA);//2,2,2,2 __m256 y3 = _mm256_permute_ps(n0n1, 0x55);//1,1,1,1 __m256 y4 = _mm256_permute_ps(n0n1, 0x00);//0,0,0,0 y1 = _mm256_mul_ps(y1, mm0); y2 = _mm256_mul_ps(y2, mm1); y3 = _mm256_mul_ps(y3, mm2); y4 = _mm256_mul_ps(y4, mm3); y1 = _mm256_add_ps(y1, y2); y3 = _mm256_add_ps(y3, y4); y1 = _mm256_add_ps(y1, y3); __m256 n2n3 = _mm256_load_ps(&n[2].m128_f32[0]); __m256 y5 = _mm256_permute_ps(n2n3, 0xFF); __m256 y6 = _mm256_permute_ps(n2n3, 0xAA); __m256 y7 = _mm256_permute_ps(n2n3, 0x55); __m256 y8 = _mm256_permute_ps(n2n3, 0x00); y5 = _mm256_mul_ps(y5, mm0); y6 = _mm256_mul_ps(y6, mm1); y7 = _mm256_mul_ps(y7, mm2); y8 = _mm256_mul_ps(y8, mm3); y5 = _mm256_add_ps(y5, y6); y7 = _mm256_add_ps(y7, y8); y5 = _mm256_add_ps(y5, y7); _mm256_stream_ps(&r[0].m128_f32[0], y1); _mm256_stream_ps(&r[2].m128_f32[0], y5); }

Here are the interesting numbers from IACA: x86 - 12.53, x64 - 12 . Although, of course, I wanted better. Something missed.

AVX optimization plus “syntactic sugar”

It seems that in the code above AVX was not used to its full potential. We find that instead of installing two identical lines in the ymm register, you can use broadcast , which can fill the ymm register with two identical xmm values. Also along the way, add a little “syntactic sugar” for AVX functions.

Improved AVX implementation

 __m256 operator + (__m256 const a, __m256 const b) { return _mm256_add_ps(a, b); } __m256 operator - (__m256 const a, __m256 const b) { return _mm256_sub_ps(a, b); } __m256 operator * (__m256 const a, __m256 const b) { return _mm256_mul_ps(a, b); } __m256 operator / (__m256 const a, __m256 const b) { return _mm256_div_ps(a, b); } template <int i> __m256 perm(__m256 const v) { return _mm256_permute_ps(v, _MM_SHUFFLE(i, i, i, i)); } template <int a, int b, int c, int d> __m256 perm(__m256 const v) { return _mm256_permute_ps(v, _MM_SHUFFLE(a, b, c, d)); } template <int i, int j> __m256 perm(__m256 const v) { return _mm256_permutevar_ps(v, _mm256_set_epi32(i, i, i, i, j, j, j, j)); } template <int a, int b, int c, int d, int e, int f, int g, int h> __m256 perm(__m256 const v) { return _mm256_permutevar_ps(v, _mm256_set_epi32(a, b, c, d, e, f, g, h)); } __m256 mad(__m256 const a, __m256 const b, __m256 const c) { return _mm256_add_ps(_mm256_mul_ps(a, b), c); } void mul_mtx4_mtx4_avx_v2(__m128* const r, __m128 const* const m, __m128 const* const n) { __m256 const mm[] { _mm256_broadcast_ps(m+0), _mm256_broadcast_ps(m+1), _mm256_broadcast_ps(m+2), _mm256_broadcast_ps(m+3) }; __m256 const n0n1 = _mm256_load_ps(&n[0].m128_f32[0]); _mm256_stream_ps(&r[0].m128_f32[0], mad(perm<3>(n0n1), mm[0], perm<2>(n0n1)*mm[1])+ mad(perm<1>(n0n1), mm[2], perm<0>(n0n1)*mm[3])); __m256 const n2n3 = _mm256_load_ps(&n[2].m128_f32[0]); _mm256_stream_ps(&r[2].m128_f32[0], mad(perm<3>(n2n3), mm[0], perm<2>(n2n3)*mm[1])+ mad(perm<1>(n2n3), mm[2], perm<0>(n2n3)*mm[3])); }

And here the results are more interesting. IACA gives the numbers: x86 - 10, x64 - 8.58 , which looks much better, but not 2 times.

AVX + FMA option (final)

We will make one more attempt. Now it would be logical to recall the FMA instruction set again, since it was added to the processors after AVX. Just change the individual mul + add for one operation. Although we still use the multiplication instruction to give the compiler more possibilities for optimization, and the processor for parallel execution of multiplications. I usually watch the generated code in assembler to see which option is better.

In this case, we need to calculate a * b + c * d + e * f + g * h . You can do it in the forehead: fma (a, b, fma (c, d, fma (e, f, g * h))) . But, as we see, it is impossible to perform a single operation here without completing the previous one. This means that we will not be able to use the ability to do paired multiplications, as this allows us to SIMD pipeline. If we transform the calculations fma (a, b, c * d) + fma (e, f, g * h) a little, we will see that it is possible to parallelize the calculations. First make two independent multiplications, and then two independent fma operations.

AVX + FMA implementation

 __m256 fma(__m256 const a, __m256 const b, __m256 const c) { return _mm256_fmadd_ps(a, b, c); } void mul_mtx4_mtx4_avx_fma(__m128* const r, __m128 const* const m, __m128 const* const n) { __m256 const mm[]{ _mm256_broadcast_ps(m + 0), _mm256_broadcast_ps(m + 1), _mm256_broadcast_ps(m + 2), _mm256_broadcast_ps(m + 3) }; __m256 const n0n1 = _mm256_load_ps(&n[0].m128_f32[0]); _mm256_stream_ps(&r[0].m128_f32[0], fma(perm<3>(n0n1), mm[0], perm<2>(n0n1)*mm[1])+ fma(perm<1>(n0n1), mm[2], perm<0>(n0n1)*mm[3])); __m256 const n2n3 = _mm256_load_ps(&n[2].m128_f32[0]); _mm256_stream_ps(&r[2].m128_f32[0], fma(perm<3>(n2n3), mm[0], perm<2>(n2n3)*mm[1])+ fma(perm<1>(n2n3), mm[2], perm<0>(n2n3)*mm[3])); }

IACA: x86 - 9.21, x64 - 8 . Now quite well. Probably, someone will say that you can do even better, but I do not know how.

Benchmarks

Immediately, I note that these figures should not be taken as the ultimate truth. Even with a fixed test, they swim within certain limits. And even more so they behave differently on different platforms. With any optimization, take measurements specifically for your case.

Function: the name of the function. The ending on s are functions with streaming, otherwise the usual mov (without streaming). Added for clarity, as it is quite important.
IACA cycles: number of ticks per function calculated by IACA
Measured cycles: the measured number of cycles (the smaller, the better)
IACA speedup: number of ticks in the zero line / number of ticks in the line
Measured speedup: the number of ticks in the zero line / the number of ticks in the line (the more, the better)

For loop_m, the clocks from the article were multiplied by 64. That is, this is a very approximate value. In fact, it turned out.

i3-3770:

x86

Function	IACA cycles	Measured cycles	IACA speedup	Measured speedup
unroll_m	70.00	50.75	1.00	1.00
loop_m	233.60	119.21	0.30	0.43
sse_v1m	18.89	27.51	3.70	1.84
sse_v2m	19.00	27.61	3.68	1.84
sse_v3m	18.89	27.22	3.70	1.86
sse_v4s	18.89	27.18	3.70	1.87
avx_v1m	13.00	19.21	5.38	2.64
avx_v1s	13.00	20.03	5.38	2.53
avx_v2m	10.00	12.91	6.99	3.93
avx_v2s	10.00	17.34	6.99	2.93

x64

Function	IACA cycles	Measured cycles	IACA speedup	Measured speedup
unroll_m	70	68.60	1.00	1.00
loop_m	233.60	119.37	0.30	0.57
sse_v1m	18.89	21.98	3.70	3.12
sse_v2m	19.00	21.09	3.68	3.25
sse_v3m	18.89	22.19	3.70	3.09
sse_v4s	18.89	22.39	3.70	3.06
avx_v1m	13.00	9.61	5.38	7.13
avx_v1s	13.00	16.90	5.38	4.06
avx_v2m	10.00	9.20	6.99	7.45
avx_v2s	10.00	14.64	6.99	4.68

i7-8700K:

x86

Function	IACA cycles	Measured cycles	IACA speedup	Measured speedup
unroll_m	69.95	40.25	1.00	1.00
loop_m	233.60	79.49	0.30	0.51
sse_v1m	18.89	19.31	3.70	2.09
sse_v2m	19.00	19.98	3.68	2.01
sse_v3m	18.89	19.69	3.70	2.04
sse_v4s	18.89	19.67	3.70	2.05
avx_v1m	13.00	14.22	5.38	2.83
avx_v1s	13.00	14.13	5.38	2.85
avx_v2m	10.00	11.73	6.99	3.43
avx_v2s	10.00	11.81	6.99	3.41
AVX+FMAm	9.21	10.38	7.60	3.88
AVX+FMAs	9.21	10.32	7.60	3.90

x64

Function	IACA cycles	Measured cycles	IACA speedup	Measured speedup
unroll_m	69.95	57.11	1.00	1.00
loop_m	233.60	75.73	0.30	0.75
sse_v1m	18.89	15.83	3.70	3.61
sse_v2m	19.00	17.22	3.68	3.32
sse_v3m	18.89	15.92	3.70	3.59
sse_v4s	18.89	16.18	3.70	3.53
avx_v1m	13.00	7.03	5.38	8.12
avx_v1s	13.00	12.98	5.38	4.40
avx_v2m	10.00	5.40	6.99	10.57
avx_v2s	10.00	11.39	6.99	5.01
AVX+FMAm	9.21	9.73	7.60	5.87
AVX+FMAs	9.21	9.81	7.60	5.82

Code of tests in the source. If there are reasonable suggestions how to improve them, write in the comments.

BONUS from science fiction

Actually, from the realm of fantasy, it is because if I saw processors with support for AVX512, then perhaps in the pictures. However, I tried to implement an algorithm. Here I will not explain anything, a complete analogy with AVX + FMA. The algorithm is the same, only operations are less.

As they say, I'll just leave it here.

 __m512 operator + (__m512 const a, __m512 const b) { return _mm512_add_ps(a, b); } __m512 operator - (__m512 const a, __m512 const b) { return _mm512_sub_ps(a, b); } __m512 operator * (__m512 const a, __m512 const b) { return _mm512_mul_ps(a, b); } __m512 operator / (__m512 const a, __m512 const b) { return _mm512_div_ps(a, b); } template <int i> __m512 perm(__m512 const v) { return _mm512_permute_ps(v, _MM_SHUFFLE(i, i, i, i)); } template <int a, int b, int c, int d> __m512 perm(__m512 const v) { return _mm512_permute_ps(v, _MM_SHUFFLE(a, b, c, d)); } __m512 fma(__m512 const a, __m512 const b, __m512 const c) { return _mm512_fmadd_ps(a, b, c); } void mul_mtx4_mtx4_avx512(__m128* const r, __m128 const* const m, __m128 const* const _n) { __m512 const mm[]{ _mm512_broadcast_f32x4(m[0]), _mm512_broadcast_f32x4(m[1]), _mm512_broadcast_f32x4(m[2]), _mm512_broadcast_f32x4(m[3]) }; __m512 const n = _mm512_load_ps(&_n[0].m128_f32[0]); _mm512_stream_ps(&r[0].m128_f32[0], fma(perm<3>(n), mm[0], perm<2>(n)*mm[1])+ fma(perm<1>(n), mm[2], perm<0>(n)*mm[3])); }

The numbers are fantastic: x86 - 4.79, x64 - 5.42 (IACA with SKX architecture). This is despite the fact that the algorithm has 64 multiplications and 48 additions.

PS Code from article

https://github.com/truthfinder/mul_m4_m4

This is my first experience writing an article. Thank you all for the comments. They help make the code and article better.

Source: https://habr.com/ru/post/418247/

All Articles