Many years have passed since I met the instructions of MMX, SSE, and later AVX on Intel processors. At one time, they seemed to be some kind of magic against the background of x86 assembler, which has long been something commonplace. They hooked me so much that a couple of years ago I had the idea to write my own software renderer for one famous game. What prompted me was the performance these instructions promised. At some point I even thought about writing about it. But writing the text turned out to be much more complicated than the code.
At that time, I wanted to avoid problems with support on different processors. I wanted to be able to check my renderer at the maximum available quantity. I still have friends with old AMD processors, and their ceiling was SSE3. Therefore, at that time I decided to limit myself to a maximum of SSE3. This is how a vector mathematical library appeared, a little less than fully implemented on SSE, with a rare inclusion up to SSE3. However, at some point I wondered what maximum performance I could squeeze out of the processor for a number of critical vector math operations. One such operation is the multiplication of float 4 by 4 matrices.

Actually, this business decided to do more for fun. It has long been written and I use matrix multiplication for my software render on SSE and it seems to be enough for me. But then I decided to see how many clock cycles I can squeeze out in principle from the multiplication of 2 matrices of float4x4. On my current SSE version it's 16 clocks. True, the recent transition to
IACA 3 began to show 19, since I began to write 1 * instead of 0 * on some instructions. Apparently before it was just a flaw in the analyzer.
')
Briefly about the utilities used
To analyze the code used the well-known utility
Intel Architecture Code Analyzer . For analysis, I use Haswell (HSW) architecture, as minimal with AVX2 support. For writing code, it is also very convenient to use:
Intel Intrinsics Guide and
Intel optimization manual .
For assembly I use MSVS 2017 Community from the console. I write the code in the version with intrinsiki. You write once, and usually it works immediately on different platforms. In addition, the x64 compiler on VC ++ does not support the inline assembler, but I want it to work under x64.
Since this article is already somewhat beyond the beginner level in SIMD programming, I will not describe registers, instructions, draw (or tyrit) beautiful pictures and try to learn to program using SIMD instructions. The Intel site is full of excellent, clear and detailed documentation.
I wanted to make everything easier. ... And it turned out as always
This is where the moment begins, which complicates the implementation as well as the article. Therefore, a little stop on it. I am not interested in writing matrix multiplication with standard line-by-line arrangement of elements. Who was needed, and so studied in universities or independently. Our goal is performance. First, I have long moved to the arrangement of the columns. My software renderer is based on the OpenGL API and therefore, in order to avoid unnecessary transpositions, I began to store elements in columns. This is also important because matrix multiplication is not so critical. Well multiplied 2-5-10 matrices. And that's all. And then multiply the finished matrix by thousands of millions of vertices. And this operation is much more critical. You can, of course, transpose each time. But why, if this can be avoided.
But back to the matrices only. With storage by columns decided. However, you can still complicate. It is more convenient for me to store the highest elements of vectors and matrix rows in SIMD registers so that
x is in the highest float (index 3), and
w in the youngest (index 0). Here, apparently, will have to again make a digression on why so.
The fact is that in the software renderer in the vector, the component
w has to be manipulated more often (
1 / z is stored there), and it is very convenient to do this through the
_ss variant of the operation (operations exclusively with the component in the
xmm y float register), without touching
x, y, z . Therefore, in the SSE register, the vector is stored in a clear order of
x, y, z, w , and in memory in the opposite
w, z, y, x .
Further, all multiplication options are also implemented by separate functions. This is done because I use them to substitute the necessary variant depending on the type of instructions supported.
Well described here.We implement basic functionality
Multiplication with cycles, row ordered
Option for line layout of elementsfor (int i = 0; i < 4; ++i) { for (int j = 0; j < 4; ++j) { r[i][j] = 0.f; for (int k = 0; k < 4; ++k) { r[i][j] += m[i][k] * n[k][j]; } } }
Everything is simple and clear. For each element we make 4 multiplications and 3 additions. In total, this is 64 multiplications and 48 additions. And this is without regard to reading the entry elements.
Everything is sad, in short. For this option for the internal cycle, IACA issued:
3.65 cycles for the x86 assembly and 2.97 for the x64 assembly . Do not ask why fractional numbers. I do not know. IACA 2.1 did not suffer from this. In any case, these figures should be multiplied by about 4 * 4 * 4 = 64. Even if you take x64, the result is about 192 cycles. It is clear that this is an approximate assessment. I do not see any reason to evaluate the performance for this option more precisely.
Implementation with cycles, column ordered
transpose matrix, swap row and column indexes for (int i = 0; i < 4; ++i) { for (int j = 0; j < 4; ++j) { r[j][i] = 0.f; for (int k = 0; k < 4; ++k) { r[j][i] += m[k][i] * n[j][k]; } } }
Cycle multiplication, SIMD oriented storage
add storage of lines in reverse order in memory for (int i = 0; i < 4; ++i) { for (int j = 0; j < 4; ++j) { r[j][3-i] = 0.f; for (int k = 0; k < 4; ++k) { r[j][3-i] += m[k][3-i] * n[j][3-k]; } } }
This implementation somewhat simplifies the understanding of what is happening inside, but it is clearly not enough.
Helper classes
For ease of understanding and writing reference and debugging code, it is convenient to implement a couple of auxiliary classes. Nothing more, everything is just for understanding. I will note that the implementation of full-fledged classes of a vector and a matrix is ​​a separate difficult question, and is not included in the topic of this article.
Classes of vector and matrix struct alignas(sizeof(__m128)) vec4 { union { struct { float w, z, y, x; }; __m128 fmm; float arr[4]; }; vec4() {} vec4(float a, float b, float c, float d) : w(d), z(c), y(b), x(a) {} static bool equ(float const a, float const b, float t = .00001f) { return fabs(ab) < t; } bool operator == (vec4 const& v) const { return equ(x, vx) && equ(y, vy) && equ(z, vz) && equ(w, vw); } }; struct alignas(sizeof(__m256)) mtx4 {
Reference function for tests
Since the accepted order of elements in the matrix complicates understanding a lot, the reference
clear function will also not prevent us, which will show in further implementations that everything works correctly. We will compare the subsequent results with it.
To create it, just take and expand the loop. void mul_mtx4_mtx4_unroll(__m128* const _r, __m128 const* const _m, __m128 const* const _n) { mtx4 const& m = **reinterpret_cast<mtx4 const* const*>(&_m); mtx4 const& n = **reinterpret_cast<mtx4 const* const*>(&_n); mtx4& r = **reinterpret_cast<mtx4* const*>(&_r); r._00 = m._00*n._00 + m._01*n._10 + m._02*n._20 + m._03*n._30; r._01 = m._00*n._01 + m._01*n._11 + m._02*n._21 + m._03*n._31; r._02 = m._00*n._02 + m._01*n._12 + m._02*n._22 + m._03*n._32; r._03 = m._00*n._03 + m._01*n._13 + m._02*n._23 + m._03*n._33; r._10 = m._10*n._00 + m._11*n._10 + m._12*n._20 + m._13*n._30; r._11 = m._10*n._01 + m._11*n._11 + m._12*n._21 + m._13*n._31; r._12 = m._10*n._02 + m._11*n._12 + m._12*n._22 + m._13*n._32; r._13 = m._10*n._03 + m._11*n._13 + m._12*n._23 + m._13*n._33; r._20 = m._20*n._00 + m._21*n._10 + m._22*n._20 + m._23*n._30; r._21 = m._20*n._01 + m._21*n._11 + m._22*n._21 + m._23*n._31; r._22 = m._20*n._02 + m._21*n._12 + m._22*n._22 + m._23*n._32; r._23 = m._20*n._03 + m._21*n._13 + m._22*n._23 + m._23*n._33; r._30 = m._30*n._00 + m._31*n._10 + m._32*n._20 + m._33*n._30; r._31 = m._30*n._01 + m._31*n._11 + m._32*n._21 + m._33*n._31; r._32 = m._30*n._02 + m._31*n._12 + m._32*n._22 + m._33*n._32; r._33 = m._30*n._03 + m._31*n._13 + m._32*n._23 + m._33*n._33; }
Here is a classic algorithm clearly painted, it is difficult to make a mistake (but you can :-)). On it IACA issued:
x86 - 69.95 cycles, x64 - 64 cycles . Regarding 64 cycles, we will look at the acceleration of this operation in the future.
SSE implementation
Classic SSE algorithm
Why classic? Because it has long been in the implementation of
FVec in the MSVS. To begin with, we will write how we present the elements of the matrix in the SSE registers. Here it already looks easier. Just transposed matrix.
// 00, 10, 20, 30 // m[0] - SIMD / 01, 11, 21, 31 // m[1] 02, 12, 22, 32 // m[2] 03, 13, 23, 33 // m[3]
We take the
unroll code of the variant above. Some kind of he is unfriendly for SSE. The first group of rows consists of the results on the column of the resulting matrix:
r._00, r._01, r._02, r._03 . We have a column, and we need a row. Yes, and
m ,
n look inconvenient for calculations. Therefore, we rearrange the lines of the algorithm so that the result of
r is line by line.
// , r[0] r00 = m00*n00 + m01*n10 + m02*n20 + m03*n30; r10 = m10*n00 + m11*n10 + m12*n20 + m13*n30; r20 = m20*n00 + m21*n10 + m22*n20 + m23*n30; r30 = m30*n00 + m31*n10 + m32*n20 + m33*n30; // , r[1] r01 = m00*n01 + m01*n11 + m02*n21 + m03*n31; r11 = m10*n01 + m11*n11 + m12*n21 + m13*n31; r21 = m20*n01 + m21*n11 + m22*n21 + m23*n31; r31 = m30*n01 + m31*n11 + m32*n21 + m33*n31; // , r[2] r02 = m00*n02 + m01*n12 + m02*n22 + m03*n32; r12 = m10*n02 + m11*n12 + m12*n22 + m13*n32; r22 = m20*n02 + m21*n12 + m22*n22 + m23*n32; r32 = m30*n02 + m31*n12 + m32*n22 + m33*n32; // , r[3] r03 = m00*n03 + m01*n13 + m02*n23 + m03*n33; r13 = m10*n03 + m11*n13 + m12*n23 + m13*n33; r23 = m20*n03 + m21*n13 + m22*n23 + m23*n33; r33 = m30*n03 + m31*n13 + m32*n23 + m33*n33;
But so much better. What exactly do we see? By the columns of the algorithm in each group, we use the rows of the matrix
m :
m [0] = {00,10,20,30}, m [1] = {01,11,21,31}, m [2] = {02,12,22,32}, m [3] = {03,13,23,33},
which are multiplied by the same element of the matrix
n . For example, for the first group it is:
n._00, n._10, n._20, n._30 . And the elements of the matrix
n for each group of rows of the algorithm again lie in one row of the matrix.
Then everything is simple: we simply take the rows of the matrix
m by the index, but as for the elements
n , we take its row and through the
shuffle instruction we scatter it with all 4 register elements to multiply by the row of the matrix
m in the register. For example, for the element
n._00 (remember that its offset in the register has an index of 3) this would be:
_mm_shuffle_ps (n [0], n [0], _MM_SHUFFLE (3,3,3,3))
In simplified form, the algorithm looks like this:
// n[0]={00,10,20,30} r[0] = m[0] * n00 + m[1] * n10 + m[2] * n20 + m[3] * n30; // n[1]={01,11,21,31} r[1] = m[0] * n01 + m[1] * n11 + m[2] * n21 + m[3] * n31; // n[2]={02,12,22,32} r[2] = m[0] * n02 + m[1] * n12 + m[2] * n22 + m[3] * n32; // n[3]={03,13,23,33} r[3] = m[0] * n03 + m[1] * n13 + m[2] * n23 + m[3] * n33;
Basic SSE implementation void mul_mtx4_mtx4_sse_v1(__m128* const r, __m128 const* const m, __m128 const* const n) { r[0] = _mm_add_ps( _mm_add_ps( _mm_mul_ps(m[0], _mm_shuffle_ps(n[0], n[0], _MM_SHUFFLE(3,3,3,3))), _mm_mul_ps(m[1], _mm_shuffle_ps(n[0], n[0], _MM_SHUFFLE(2,2,2,2)))), _mm_add_ps( _mm_mul_ps(m[2], _mm_shuffle_ps(n[0], n[0], _MM_SHUFFLE(1,1,1,1))), _mm_mul_ps(m[3], _mm_shuffle_ps(n[0], n[0], _MM_SHUFFLE(0,0,0,0))))); r[1] = _mm_add_ps( _mm_add_ps( _mm_mul_ps(m[0], _mm_shuffle_ps(n[1], n[1], _MM_SHUFFLE(3,3,3,3))), _mm_mul_ps(m[1], _mm_shuffle_ps(n[1], n[1], _MM_SHUFFLE(2,2,2,2)))), _mm_add_ps( _mm_mul_ps(m[2], _mm_shuffle_ps(n[1], n[1], _MM_SHUFFLE(1,1,1,1))), _mm_mul_ps(m[3], _mm_shuffle_ps(n[1], n[1], _MM_SHUFFLE(0,0,0,0))))); r[2] = _mm_add_ps( _mm_add_ps( _mm_mul_ps(m[0], _mm_shuffle_ps(n[2], n[2], _MM_SHUFFLE(3,3,3,3))), _mm_mul_ps(m[1], _mm_shuffle_ps(n[2], n[2], _MM_SHUFFLE(2,2,2,2)))), _mm_add_ps( _mm_mul_ps(m[2], _mm_shuffle_ps(n[2], n[2], _MM_SHUFFLE(1,1,1,1))), _mm_mul_ps(m[3], _mm_shuffle_ps(n[2], n[2], _MM_SHUFFLE(0,0,0,0))))); r[3] = _mm_add_ps( _mm_add_ps( _mm_mul_ps(m[0], _mm_shuffle_ps(n[3], n[3], _MM_SHUFFLE(3,3,3,3))), _mm_mul_ps(m[1], _mm_shuffle_ps(n[3], n[3], _MM_SHUFFLE(2,2,2,2)))), _mm_add_ps( _mm_mul_ps(m[2], _mm_shuffle_ps(n[3], n[3], _MM_SHUFFLE(1,1,1,1))), _mm_mul_ps(m[3], _mm_shuffle_ps(n[3], n[3], _MM_SHUFFLE(0,0,0,0))))); }
Now, in the algorithm, we change the elements
n to the corresponding
shuffle , multiplication by
_mm_mul_ps , the amount by
_mm_add_ps , and everything is ready. It is working. The code looks, however, much worse than the algorithm itself looked like. IACA issued this code:
x86 - 18.89, x64 - 16 cycles . This is 4 times faster than the previous one. In the SSE register, the 4th float. Almost linear dependence.
We decorate SSE implementation
Still, in the code, it looks awful. We will try to improve this by writing a little syntactic sugar.
These functions the compiler can perfectly inline (although sometimes without __forceinline in any way).
So the code turns ... void mul_mtx4_mtx4_sse_v2(__m128* const r, __m128 const* const m, __m128 const* const n) { r[0] = m[0]*shuf<3>(n[0]) + m[1]*shuf<2>(n[0]) + m[2]*shuf<1>(n[0]) + m[3]*shuf<0>(n[0]); r[1] = m[0]*shuf<3>(n[1]) + m[1]*shuf<2>(n[1]) + m[2]*shuf<1>(n[1]) + m[3]*shuf<0>(n[1]); r[2] = m[0]*shuf<3>(n[2]) + m[1]*shuf<2>(n[2]) + m[2]*shuf<1>(n[2]) + m[3]*shuf<0>(n[2]); r[3] = m[0]*shuf<3>(n[3]) + m[1]*shuf<2>(n[3]) + m[2]*shuf<1>(n[3]) + m[3]*shuf<0>(n[3]); }
But this is already much better and more readable. On this IACA produced approximately the expected result:
x86 - 19 (and why not fractional?), X64 - 16 . In fact, the performance has not changed, but the code is much prettier and clearer.
Small contribution to future optimization
We introduce one more improvement at the function level, which appeared in the iron version not too long ago. The operation
multiple-add (fma) .
fma (a, b, c) = a * b + c .
Implement multiple-add __m128 mad(__m128 const a, __m128 const b, __m128 const c) { return _mm_add_ps(_mm_mul_ps(a, b), c); }
Why is this necessary? First of all, for future optimization. For example, you can simply replace
mad in the ready-made code with
fma through the same macros as you like. But we will lay the foundation for optimization now:
Option with multiple-add void mul_mtx4_mtx4_sse_v3(__m128* const r, __m128 const* const m, __m128 const* const n) { r[0] = mad(m[0], shuf<3>(n[0]), m[1]*shuf<2>(n[0])) + mad(m[2], shuf<1>(n[0]), m[3]*shuf<0>(n[0])); r[1] = mad(m[0], shuf<3>(n[1]), m[1]*shuf<2>(n[1]) + mad(m[2], shuf<1>(n[1]), m[3]*shuf<0>(n[1])); r[2] = mad(m[0], shuf<3>(n[2]), m[1]*shuf<2>(n[2])) + mad(m[2], shuf<1>(n[2]), m[3]*shuf<0>(n[2])); r[3] = mad(m[0], shuf<3>(n[3]), m[1]*shuf<2>(n[3])) + mad(m[2], shuf<1>(n[3]), m[3]*shuf<0>(n[3])); }
IACA:
x86 - 18.89, x64 - 16 . Again fractional. Still, IACA sometimes gives strange results. The code has not changed so much. Probably even a little bit worse. But optimization sometimes requires such sacrifices.
Go to save through _mm_stream
Different optimization guides recommend once again not to pull the cache for mass save operations. This is usually reasonable when you are processing vertices, which are thousands and more. But for matrices this is probably not so important. But still add.
Inline save option void mul_mtx4_mtx4_sse_v4(__m128* const r, __m128 const* const m, __m128 const* const n) { _mm_stream_ps(&r[0].m128_f32[0], mad(m[0], shuf<3>(n[0]), m[1]*shuf<2>(n[0])) + mad(m[2], shuf<1>(n[0]), m[3]*shuf<0>(n[0]))); _mm_stream_ps(&r[1].m128_f32[0], mad(m[0], shuf<3>(n[1]), m[1]*shuf<2>(n[1])) + mad(m[2], shuf<1>(n[1]), m[3]*shuf<0>(n[1]))); _mm_stream_ps(&r[2].m128_f32[0], mad(m[0], shuf<3>(n[2]), m[1]*shuf<2>(n[2])) + mad(m[2], shuf<1>(n[2]), m[3]*shuf<0>(n[2]))); _mm_stream_ps(&r[3].m128_f32[0], mad(m[0], shuf<3>(n[3]), m[1]*shuf<2>(n[3])) + mad(m[2], shuf<1>(n[3]), m[3]*shuf<0>(n[3]))); }
Nothing has changed in tact, from the word altogether. But, according to the recommendations, now we don’t touch the cache.
AVX implementation
Basic AVX option

Next, proceed to the next stage of optimization. The 4th float is in the SSE register, and already 8 in the AVX. That is, there is a theoretical chance to reduce the number of operations performed and to increase performance, if not twice, then at least 1.5 times. But something tells me that not everything will be so easy with the transition to AVX. Will we be able to get the necessary data from the dual registers?
Let's try to figure it out. Again, write out our multiplication algorithm used above. It would be possible not to do this, but it is more convenient to deal with the code when everything is close, and you do not have to scroll up half a page.
// : 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33 // SSE: r0 = m0*n00 + m1*n10 + m2*n20 + m3*n30 r1 = m0*n01 + m1*n11 + m2*n21 + m3*n31 r2 = m0*n02 + m1*n12 + m2*n22 + m3*n32 r3 = m0*n03 + m1*n13 + m2*n23 + m3*n33
At the output, we expect to get the result in
ymm = {r0: r1} and
ymm = {r2: r3} . If, in the SSE version, our algorithm was generalized to columns, then now we need to generalize it to rows. So to act as in the case of the SSE option will not work.
If we take the matrix
m into the registers
ymm , we get
ymm = {m0: m1} and
ymm = {m2: m3}, respectively. Previously, we only had matrix columns in the register, and now both columns and rows.
If you try to act as before, then
ymm = {m0: m1} should be multiplied by the register
ymm = {n00, n00, n00, n00}: {n10, n10, n10, n10} . Since
n00 and
n01 are in the same row of the matrix
n , then, judging by the available set of AVX instructions, it will be expensive to
scatter them by
ymm . Both
shuffle and
permute work separately for each of the two fours float (
major and minor
xmm ) inside the
ymm registers.
If we take
ymm from matrix
n , then we get both elements
n00 and
n10 in the highest of 2
xmm inside the
ymm register.
{n00, n10, n20, n30}: {n01, n11, n21, n31} . Usually, the index of the existing instructions is from 0 to 3. And it addresses the float only within one
xmm register from two inside the
ymm register. Throw
n10 from the older
xmm to the younger
cheap will not work. And here also this trick should be repeated several times. With such a loss of cycles, we can not accept. We must come up with something else.
We used to generalize the columns, and now the rows. Therefore, try to go a little on the other side. We need to get the result in
{r0: r1} . This means that the algorithm should be improved not by separate lines of the algorithm, but by two at once. And here, what was a minus in the work of
shuffle and
permute will be a plus for us. We look at what we will have in the registers
ymm , when we consider the matrix
n .
n0n1 = {00, 10, 20, 30} : {01, 11, 21, 31} n2n3 = {02, 12, 22, 32} : {03, 13, 23, 33}
Aha, we notice that in different
xmm parts of the
ymm register we have elements
00 and
01 . They can be multiplied by case through the permute command in
{_00, _00, _00, _00}: {_ 01, _01, _01, _01} , indicating only one index 3 for both
xmm parts. This is exactly what we need. After all, the coefficients are also used in different lines. Only now in the corresponding
ymm register for multiplication you will need to keep
{m0: m0} , that is, the duplicated first row of the matrix
m .
So, paint the algorithm in more detail. We read in
ymm registers the double rows of the matrix
m :
mm[0] = {m0:m0} mm[1] = {m1:m1} mm[2] = {m2:m2} mm[3] = {m3:m3}
And then we will calculate the multiplication as:
r0r1 = mm[0] * {n00,n00,n00,n00:n01,n01,n01,n01} + // permute<3,3,3,3>(n0n1) mm[1] * {n10,n10,n10,n10:n11,n11,n11,n11} + // permute<2,2,2,2>(n0n1) mm[2] * {n20,n20,n20,n20:n21,n21,n21,n21} + // permute<1,1,1,1>(n0n1) mm[3] * {n30,n30,n30,n30:n31,n31,n31,n31} // permute<0,0,0,0>(n0n1) r2r3 = mm[0] * {n02,n02,n02,n02:n03,n03,n03,n03} + // permute<3,3,3,3>(n2n3) mm[1] * {n12,n12,n12,n12:n13,n13,n13,n13} + // permute<2,2,2,2>(n2n3) mm[2] * {n22,n22,n22,n22:n23,n23,n23,n23} + // permute<1,1,1,1>(n2n3) mm[3] * {n32,n32,n32,n32:n33,n33,n33,n33} // permute<0,0,0,0>(n2n3)
Rewrite more clearly:
r0r1 = mm[0]*n0n1<3,3,3,3>+mm[1]*n0n1<2,2,2,2>+mm[2]*n0n1<1,1,1,1>+mm[3]*n0n1<0,0,0,0> r2r3 = mm[0]*n2n3<3,3,3,3>+mm[1]*n2n3<2,2,2,2>+mm[2]*n2n3<1,1,1,1>+mm[3]*n2n3<0,0,0,0>
Or in a simplified form:
r0r1 = mm[0]*n0n1<3> + mm[1]*n0n1<2> + mm[2]*n0n1<1> + mm[3]*n0n1<0> r2r3 = mm[0]*n2n3<3> + mm[1]*n2n3<2> + mm[2]*n2n3<1> + mm[3]*n2n3<0>
It seems everything is clear.
It remains only to write the implementation void mul_mtx4_mtx4_avx_v1(__m128* const r, __m128 const* const m, __m128 const* const n) { __m256 mm0 = _mm256_set_m128(m[0], m[0]); __m256 mm1 = _mm256_set_m128(m[1], m[1]); __m256 mm2 = _mm256_set_m128(m[2], m[2]); __m256 mm3 = _mm256_set_m128(m[3], m[3]); __m256 n0n1 = _mm256_load_ps(&n[0].m128_f32[0]); __m256 y1 = _mm256_permute_ps(n0n1, 0xFF);
Here are the interesting numbers from IACA:
x86 - 12.53, x64 - 12 . Although, of course, I wanted better. Something missed.
AVX optimization plus “syntactic sugar”
It seems that in the code above AVX was not used to its full potential. We find that instead of installing two identical lines in the
ymm register, you can use
broadcast , which can fill the
ymm register with two identical
xmm values. Also along the way, add a little “syntactic sugar” for AVX functions.
Improved AVX implementation __m256 operator + (__m256 const a, __m256 const b) { return _mm256_add_ps(a, b); } __m256 operator - (__m256 const a, __m256 const b) { return _mm256_sub_ps(a, b); } __m256 operator * (__m256 const a, __m256 const b) { return _mm256_mul_ps(a, b); } __m256 operator / (__m256 const a, __m256 const b) { return _mm256_div_ps(a, b); } template <int i> __m256 perm(__m256 const v) { return _mm256_permute_ps(v, _MM_SHUFFLE(i, i, i, i)); } template <int a, int b, int c, int d> __m256 perm(__m256 const v) { return _mm256_permute_ps(v, _MM_SHUFFLE(a, b, c, d)); } template <int i, int j> __m256 perm(__m256 const v) { return _mm256_permutevar_ps(v, _mm256_set_epi32(i, i, i, i, j, j, j, j)); } template <int a, int b, int c, int d, int e, int f, int g, int h> __m256 perm(__m256 const v) { return _mm256_permutevar_ps(v, _mm256_set_epi32(a, b, c, d, e, f, g, h)); } __m256 mad(__m256 const a, __m256 const b, __m256 const c) { return _mm256_add_ps(_mm256_mul_ps(a, b), c); } void mul_mtx4_mtx4_avx_v2(__m128* const r, __m128 const* const m, __m128 const* const n) { __m256 const mm[] { _mm256_broadcast_ps(m+0), _mm256_broadcast_ps(m+1), _mm256_broadcast_ps(m+2), _mm256_broadcast_ps(m+3) }; __m256 const n0n1 = _mm256_load_ps(&n[0].m128_f32[0]); _mm256_stream_ps(&r[0].m128_f32[0], mad(perm<3>(n0n1), mm[0], perm<2>(n0n1)*mm[1])+ mad(perm<1>(n0n1), mm[2], perm<0>(n0n1)*mm[3])); __m256 const n2n3 = _mm256_load_ps(&n[2].m128_f32[0]); _mm256_stream_ps(&r[2].m128_f32[0], mad(perm<3>(n2n3), mm[0], perm<2>(n2n3)*mm[1])+ mad(perm<1>(n2n3), mm[2], perm<0>(n2n3)*mm[3])); }
And here the results are more interesting. IACA gives the numbers:
x86 - 10, x64 - 8.58 , which looks much better, but not 2 times.
AVX + FMA option (final)
We will make one more attempt. Now it would be logical to recall the FMA instruction set again, since it was added to the processors after AVX. Just change the individual
mul + add for one operation. Although we still use the multiplication instruction to give the compiler more possibilities for optimization, and the processor for parallel execution of multiplications. I usually watch the generated code in assembler to see which option is better.
In this case, we need to calculate
a * b + c * d + e * f + g * h . You can do it in the forehead:
fma (a, b, fma (c, d, fma (e, f, g * h))) . But, as we see, it is impossible to perform a single operation here without completing the previous one. This means that we will not be able to use the ability to do paired multiplications, as this allows us to SIMD pipeline. If we transform the calculations
fma (a, b, c * d) + fma (e, f, g * h) a little, we will see that it is possible to parallelize the calculations. First make two independent multiplications, and then two independent
fma operations.
AVX + FMA implementation __m256 fma(__m256 const a, __m256 const b, __m256 const c) { return _mm256_fmadd_ps(a, b, c); } void mul_mtx4_mtx4_avx_fma(__m128* const r, __m128 const* const m, __m128 const* const n) { __m256 const mm[]{ _mm256_broadcast_ps(m + 0), _mm256_broadcast_ps(m + 1), _mm256_broadcast_ps(m + 2), _mm256_broadcast_ps(m + 3) }; __m256 const n0n1 = _mm256_load_ps(&n[0].m128_f32[0]); _mm256_stream_ps(&r[0].m128_f32[0], fma(perm<3>(n0n1), mm[0], perm<2>(n0n1)*mm[1])+ fma(perm<1>(n0n1), mm[2], perm<0>(n0n1)*mm[3])); __m256 const n2n3 = _mm256_load_ps(&n[2].m128_f32[0]); _mm256_stream_ps(&r[2].m128_f32[0], fma(perm<3>(n2n3), mm[0], perm<2>(n2n3)*mm[1])+ fma(perm<1>(n2n3), mm[2], perm<0>(n2n3)*mm[3])); }
IACA:
x86 - 9.21, x64 - 8 . Now quite well. Probably, someone will say that you can do even better, but I do not know how.
Benchmarks
Immediately, I note that these figures should not be taken as the ultimate truth. Even with a fixed test, they swim within certain limits. And even more so they behave differently on different platforms.
With any optimization, take measurements specifically for your case.Table of contents
- Function: the name of the function. The ending on s are functions with streaming, otherwise the usual mov (without streaming). Added for clarity, as it is quite important.
- IACA cycles: number of ticks per function calculated by IACA
- Measured cycles: the measured number of cycles (the smaller, the better)
- IACA speedup: number of ticks in the zero line / number of ticks in the line
- Measured speedup: the number of ticks in the zero line / the number of ticks in the line (the more, the better)
For loop_m, the clocks from the article were multiplied by 64. That is, this is a very approximate value. In fact, it turned out.i3-3770:
x86Function | IACA cycles | Measured cycles | IACA speedup | Measured speedup |
---|
unroll_m | 70.00 | 50.75 | 1.00 | 1.00 |
---|
loop_m | 233.60 | 119.21 | 0.30 | 0.43 |
---|
sse_v1m | 18.89 | 27.51 | 3.70 | 1.84 |
---|
sse_v2m | 19.00 | 27.61 | 3.68 | 1.84 |
---|
sse_v3m | 18.89 | 27.22 | 3.70 | 1.86 |
---|
sse_v4s | 18.89 | 27.18 | 3.70 | 1.87 |
---|
avx_v1m | 13.00 | 19.21 | 5.38 | 2.64 |
---|
avx_v1s | 13.00 | 20.03 | 5.38 | 2.53 |
---|
avx_v2m | 10.00 | 12.91 | 6.99 | 3.93 |
---|
avx_v2s | 10.00 | 17.34 | 6.99 | 2.93 |
---|
x64Function | IACA cycles | Measured cycles | IACA speedup | Measured speedup |
---|
unroll_m | 70 | 68.60 | 1.00 | 1.00 |
---|
loop_m | 233.60 | 119.37 | 0.30 | 0.57 |
---|
sse_v1m | 18.89 | 21.98 | 3.70 | 3.12 |
---|
sse_v2m | 19.00 | 21.09 | 3.68 | 3.25 |
---|
sse_v3m | 18.89 | 22.19 | 3.70 | 3.09 |
---|
sse_v4s | 18.89 | 22.39 | 3.70 | 3.06 |
---|
avx_v1m | 13.00 | 9.61 | 5.38 | 7.13 |
---|
avx_v1s | 13.00 | 16.90 | 5.38 | 4.06 |
---|
avx_v2m | 10.00 | 9.20 | 6.99 | 7.45 |
---|
avx_v2s | 10.00 | 14.64 | 6.99 | 4.68 |
---|
i7-8700K:
x86Function | IACA cycles | Measured cycles | IACA speedup | Measured speedup |
---|
unroll_m | 69.95 | 40.25 | 1.00 | 1.00 |
---|
loop_m | 233.60 | 79.49 | 0.30 | 0.51 |
---|
sse_v1m | 18.89 | 19.31 | 3.70 | 2.09 |
---|
sse_v2m | 19.00 | 19.98 | 3.68 | 2.01 |
---|
sse_v3m | 18.89 | 19.69 | 3.70 | 2.04 |
---|
sse_v4s | 18.89 | 19.67 | 3.70 | 2.05 |
---|
avx_v1m | 13.00 | 14.22 | 5.38 | 2.83 |
---|
avx_v1s | 13.00 | 14.13 | 5.38 | 2.85 |
---|
avx_v2m | 10.00 | 11.73 | 6.99 | 3.43 |
---|
avx_v2s | 10.00 | 11.81 | 6.99 | 3.41 |
---|
AVX+FMAm | 9.21 | 10.38 | 7.60 | 3.88 |
---|
AVX+FMAs | 9.21 | 10.32 | 7.60 | 3.90 |
---|
x64Function | IACA cycles | Measured cycles | IACA speedup | Measured speedup |
---|
unroll_m | 69.95 | 57.11 | 1.00 | 1.00 |
---|
loop_m | 233.60 | 75.73 | 0.30 | 0.75 |
---|
sse_v1m | 18.89 | 15.83 | 3.70 | 3.61 |
---|
sse_v2m | 19.00 | 17.22 | 3.68 | 3.32 |
---|
sse_v3m | 18.89 | 15.92 | 3.70 | 3.59 |
---|
sse_v4s | 18.89 | 16.18 | 3.70 | 3.53 |
---|
avx_v1m | 13.00 | 7.03 | 5.38 | 8.12 |
---|
avx_v1s | 13.00 | 12.98 | 5.38 | 4.40 |
---|
avx_v2m | 10.00 | 5.40 | 6.99 | 10.57 |
---|
avx_v2s | 10.00 | 11.39 | 6.99 | 5.01 |
---|
AVX+FMAm | 9.21 | 9.73 | 7.60 | 5.87 |
---|
AVX+FMAs | 9.21 | 9.81 | 7.60 | 5.82 |
---|
Code of tests in the source. If there are reasonable suggestions how to improve them, write in the comments.BONUS from science fiction
Actually, from the realm of fantasy, it is because if I saw processors with support for AVX512, then perhaps in the pictures. However, I tried to implement an algorithm. Here I will not explain anything, a complete analogy with AVX + FMA. The algorithm is the same, only operations are less.As they say, I'll just leave it here. __m512 operator + (__m512 const a, __m512 const b) { return _mm512_add_ps(a, b); } __m512 operator - (__m512 const a, __m512 const b) { return _mm512_sub_ps(a, b); } __m512 operator * (__m512 const a, __m512 const b) { return _mm512_mul_ps(a, b); } __m512 operator / (__m512 const a, __m512 const b) { return _mm512_div_ps(a, b); } template <int i> __m512 perm(__m512 const v) { return _mm512_permute_ps(v, _MM_SHUFFLE(i, i, i, i)); } template <int a, int b, int c, int d> __m512 perm(__m512 const v) { return _mm512_permute_ps(v, _MM_SHUFFLE(a, b, c, d)); } __m512 fma(__m512 const a, __m512 const b, __m512 const c) { return _mm512_fmadd_ps(a, b, c); } void mul_mtx4_mtx4_avx512(__m128* const r, __m128 const* const m, __m128 const* const _n) { __m512 const mm[]{ _mm512_broadcast_f32x4(m[0]), _mm512_broadcast_f32x4(m[1]), _mm512_broadcast_f32x4(m[2]), _mm512_broadcast_f32x4(m[3]) }; __m512 const n = _mm512_load_ps(&_n[0].m128_f32[0]); _mm512_stream_ps(&r[0].m128_f32[0], fma(perm<3>(n), mm[0], perm<2>(n)*mm[1])+ fma(perm<1>(n), mm[2], perm<0>(n)*mm[3])); }
The numbers are fantastic: x86 - 4.79, x64 - 5.42 (IACA with SKX architecture). This is despite the fact that the algorithm has 64 multiplications and 48 additions.PS Code from article
This is my first experience writing an article. Thank you all for the comments. They help make the code and article better.