Often, the application requires the highest possible performance, and often it can help to achieve a transition to a new platform. And additional performance gains can be obtained not only due to architectural changes of the new platform, i.e. increasing the frequency of the processor, memory bandwidth and other changes, but also through the use of new sets of processor commands. This approach will be considered in this article on the example of vector instructions that appeared sequentially in Intel processors from SSE to the new AVX 3.1 instruction set, recently published in the corresponding specification on
the Intel company website .
The author of this article participates in the development of the
Intel IPP library, so this article discusses the performance gain using the example of one of the functions from this library. I would like to hope that this article will be useful to the following categories of readers:
- novice developers who want to get acquainted with a specific example of how using SIMD instructions can improve code performance;
- those who already have enough experience in using SIMD instructions and are interested in the new AVX3.1 instruction set for optimizing their applications.
In addition to these categories of readers, developers interested in optimizing their applications and wanting to minimize their costs of low-level optimization, this article can help to get acquainted closer with the IPP library.
The ippiSwapChannels_32f_C3C4R function
The IPP library contains optimized 32-bit and 64-bit versions for processors that support the SSSE3, SSE41, AVX, AVX2, AVX31 instruction sets. Therefore, consider the possibilities of optimization using these data sets. In general, IPP contains many functions for various data processing areas. These functions implement both fairly difficult to understand mathematical algorithms, for example, Fourier transform, and algorithms that are not particularly difficult to understand. It is this rather simple function
ippiSwapChannels_32f_C3C4R that we will consider in this article, trying to consistently apply to it the sets of instructions from SSSE3 to AVX31.
The API for this function is as follows.
int ippiSwapChannels_32f_C3C4R( const float* pSrc, int srcStep, float* pDst, int dstStep, int width, int height, const int dstOrder[4], float val )
The function reads the original float image by the
pSrc pointer in
srcStep increments between rows and
width x height , rearranges the channels in each pixel in accordance with the new
dstOrder order
, and records the resulting image by the
pDst pointer. The function can also write a user-specified fixed value
val to the channel. Duplication of channels in the output image is allowed. The scheme of the function is shown in Figure 1.
')
C code
The function code in C is fairly simple (see below). In each iteration of the cycle, one pixel of the output image is formed; one of 3 possible actions can occur with each channel of the output pixel:
- copy the pixel channel of the original image, if the index in dstOrder is less than 3;
- fit a fixed value val, if the index in dstOrder is 3;
- the old unmodified value is preserved if the index in dstOrder is greater than 3
Code int s, d, h, ord0 = dstOrder[0], ord1 = dstOrder[1], ord2 = dstOrder[2], ord3 = dstOrder[3]; for( h = 0; h < height; h++ ) { for( s = 0, d = 0; d < width*4; s += 3, d += 4 ) { if( ord0 < 3 ) { pDst[d] = pSrc[s + ord0]; } else if( ord0 == 3 ) { pDst[d] = val; } if( ord1 < 3 ) { pDst[d + 1] = pSrc[s + ord1]; } else if( ord1 == 3 ) { pDst[d + 1] = val; } if( ord2 < 3 ) { pDst[d + 2] = pSrc[s + ord2]; } else if( ord2 == 3 ) { pDst[d + 2] = val; } if( ord3 < 3 ) { pDst[d + 3] = pSrc[s + ord3]; } else if( ord3 == 3 ) { pDst[d + 3] = val; } } pSrc = (float*)((char*)pSrc + srcStep); pDst = (float*)((char*)pDst + dstStep); }
First version of SIMD code using SSSE3
And now, as planned, we will try to optimize this code using vector SIMD instructions up to AVX3.1. We will use Intel Composer, and measure the performance of all code variants on the internal simulator of the server version of SkyLake, which supports the AVX31 - SKX set, since you need to evaluate the advantages of the new instructions without the influence of architectural changes. For the compiler we also indicate the maximum possible level of optimization, the same for all variants of the code. Measured in this way, the performance of the C code is taken as a certain arbitrary value
P = 1 .
We hope that the reader is already familiar with the concept of vector registers, but if not, then you can
familiarize yourself with and use the convenient
navigator intrinsikov .
So, let's see how it would be possible to vectorize, i.e. rewrite the code for this function using SSE vector registers. In accordance with the logic of the function, we need to rearrange the float elements within the vector register, so we need the appropriate instructions for rearranging the elements. These instructions include the following instructions from the SSE kit:
SHUFPS xmm1, xmm2/m128, imm8 UNPCKHPS xmm1, xmm2/m128 UNPCKLPS xmm1, xmm2/m128 __m128 _mm_shuffle_ps(__m128 a, __m128 b, unsigned int mask). __m128 _mm_unpackhi_ps(__m128, __m128) __m128 _mm_unpacklo_ps(__m128, __m128)
These instructions are very useful and they can also be used for our task, but the following problem arises. The
SHUFPS instruction has an immediate imm8 operand that forms a permutation of the elements. The third parameter of this instruction is
imm8 - the value, i.e. some constant. If you try to use a variable in intrinsic, the compiler will generate an error. And the
UNPCKHPS / UNPCKLPS instructions alternate elements in registers in only one way.
There are quite a few possible permutations of the channels specified in
dstOrder and, therefore, to implement all possible combinations, we would have to develop a code with a lot of branches and a different value of the
mask parameter for every possible combination of channels. This is a rather tedious task; moreover, today the initial SSE instruction set has been significantly expanded and the code can be made much simpler. The generation closest to SSE, which contains a very useful instruction for our purposes, is called SSSE3. And we are talking about instructions:
PSHUFB xmm1, xmm2/m128 __m128i _mm_shuffle_epi8 (__m128i a, __m128i b)
This instruction can be effectively used in a variety of algorithms, where several neighboring input elements are used to calculate a single output element. This instruction rearranges the bytes of the receiver in a completely random order, and can also reset the byte when the high bit in the register is set to the mask, see Figure 2.
Despite the fact that the instruction swaps bytes, we can apply it to float data.
Code int s, d, h, ord0 = dstOrder[0], ord1 = dstOrder[1], ord2 = dstOrder[2], ord3 = dstOrder[3]; __m128 xmm0, xmm1, xmm2, xmm3, xmm4, xmm5,xmm6, xmm7; __m128i xmm8 = _mm_set_epi8(4*ord3+3,4*ord3+2,4*ord3+1,4*ord3, 4*ord2+3,4*ord2+2,4*ord2+1,4*ord2, 4*ord1+3,4*ord1+2,4*ord1+1,4*ord1, 4*ord0+3,4*ord0+2,4*ord0+1,4*ord0); __m128i xmm9 = _mm_set_epi32(0,0,0,0);//unmodified channel mask __m128i xmm10 = _mm_set_epi32(0,0,0,0);//val if(ord3 >= 3){ xmm8 = _mm_or_si128(xmm8, _mm_set_epi32(-1, 0, 0, 0)); if(ord3 > 3){ xmm9 = _mm_or_si128(xmm9, _mm_set_epi32(-1, 0, 0, 0));} else { xmm10 = _mm_or_si128(xmm10, _mm_set_epi32(val, 0, 0, 0));}; } if(ord2 >= 3){ xmm8 = _mm_or_si128(xmm8, _mm_set_epi32( 0,-1, 0, 0)); if(ord2 > 3){ xmm9 = _mm_or_si128(xmm9, _mm_set_epi32( 0, -1, 0, 0));} else { xmm10 = _mm_or_si128(xmm10, _mm_set_epi32( 0, val,0, 0));} } if(ord1 >=3){ xmm8 = _mm_or_si128(xmm8, _mm_set_epi32( 0, 0,-1, 0)); if(ord1 > 3){xmm9 = _mm_or_si128(xmm9, _mm_set_epi32( 0, 0,-1, 0));} else {xmm10 = _mm_or_si128(xmm10, _mm_set_epi32( 0, 0,val, 0));} } if(ord0 >= 3){ xmm8 = _mm_or_si128(xmm8, _mm_set_epi32( 0, 0, 0,-1)); if(ord0 > 3){xmm9 = _mm_or_si128(xmm9, _mm_set_epi32( 0, 0, 0,-1));} else {xmm10 = _mm_or_si128(xmm10, _mm_set_epi32( 0, 0, 0,val));} } for( h = 0; h < height; h++ ) { s = 0; d = 0; for(; d < width*4; s += 3*4, d += 4*4 ) { xmm0 = _mm_load_sd(pSrc+s ); // g0 r0 xmm1 = _mm_load_ss(pSrc+s+2 ); // b0 xmm2 = _mm_load_sd(pSrc+s +3); xmm3 = _mm_load_ss(pSrc+s+2+3); xmm4 = _mm_load_sd(pSrc+s +6); xmm5 = _mm_load_ss(pSrc+s+2+6); xmm6 = _mm_load_sd(pSrc+s +9); xmm7 = _mm_load_ss(pSrc+s+2+9); xmm0 = _mm_unpacklo_pd(xmm0, xmm1); // b0 g0 r0 xmm1 = _mm_unpacklo_pd(xmm2, xmm3); xmm2 = _mm_unpacklo_pd(xmm4, xmm5); xmm3 = _mm_unpacklo_pd(xmm6, xmm7); xmm4 = _mm_loadu_ps(pDst+d+0); // xmm5 = _mm_loadu_ps(pDst+d+4); xmm6 = _mm_loadu_ps(pDst+d+8); xmm7 = _mm_loadu_ps(pDst+d+12); xmm0 = _mm_shuffle_epi8(xmm0, xmm8); // xmm1 = _mm_shuffle_epi8(xmm1, xmm8); xmm2 = _mm_shuffle_epi8(xmm2, xmm8); xmm3 = _mm_shuffle_epi8(xmm3, xmm8); xmm4 = _mm_and_ps(xmm4,xmm9); // dst xmm5 = _mm_and_ps(xmm5,xmm9); xmm6 = _mm_and_ps(xmm6,xmm9); xmm7 = _mm_and_ps(xmm7,xmm9); xmm0 = _mm_or_ps(xmm0,xmm4); // xmm1 = _mm_or_ps(xmm1,xmm5); xmm2 = _mm_or_ps(xmm2,xmm6); xmm3 = _mm_or_ps(xmm3,xmm7); xmm0 = _mm_or_ps(xmm0,xmm10); // val xmm1 = _mm_or_ps(xmm1,xmm10); xmm2 = _mm_or_ps(xmm2,xmm10); xmm3 = _mm_or_ps(xmm3,xmm10); _mm_storeu_ps(pDst+d , xmm0); // _mm_storeu_ps(pDst+d+ 4, xmm1); _mm_storeu_ps(pDst+d+ 8, xmm2); _mm_storeu_ps(pDst+d+12, xmm3); } for(; d < width*4; s += 3, d += 4 ) { __m128 xmm0, xmm1, xmm4; xmm0 = _mm_load_sd(pSrc+s ); xmm1 = _mm_load_ss(pSrc+s+2); xmm0 = _mm_unpacklo_pd(xmm0, xmm1); xmm4 = _mm_loadu_ps(pDst+d); xmm0 = _mm_shuffle_epi8(xmm0, xmm8); xmm4 = _mm_and_ps(xmm4,xmm9); //apply unmodified mask xmm0 = _mm_or_ps(xmm0,xmm4); xmm0 = _mm_or_ps(xmm0,xmm10); _mm_storeu_ps(pDst+d, xmm0); } pSrc = (float*)((char*)pSrc + srcStep); pDst = (float*)((char*)pDst + dstStep); }
As you can see, our SSSE3 code uses quite a few SIMD instructions, and the comment line explains why they are used. The
_mm_unpacklo_pd instructions
are used to merge into one XMM register the values read into different registers. The instructions
_mm_and_ps ,
_mm_or_ps are used to merge by mask with the unmodified channel and the
val value. We directly use the
_mm_shuffle_epi8 instruction to rearrange the elements. We measure how much time it takes to execute and get
P = 0.4 . Despite the fact that the code looks more cumbersome compared to the C code, the performance increase was approximately 2.5X. And this is a pretty good result.
Second version of SIMD code using AVX
Today, top-of-the-line Intel processors have long supported at least the AVX instruction set. Vector registers used in AVX instructions have a length of 256 bits and, accordingly, 2 times wider than 128 bit SSE registers. However, AVX contains two significant limitations for us. First, AVX contains 256-bit permutation instructions only for float and double data types and we cannot use the 256-bit analogue
_mm_shuffle_epi8 . Secondly, after the release of the AVX instruction set, the so-called lane concept for vector instructions was introduced. In this concept, one 256-bit AVX register is a bundle of two 128-bit registers, called the word lane. Most 256-bit extensions of 128-bit horizontal-access instructions work only inside each lane. Thus, often the AVX version of the code that already has the SSE version looks like it is shown in Figure. 3
Therefore, the AVX version of our code might look something like this:
Code int s, d, h, ord0 = dstOrder[0], ord1 = dstOrder[1], ord2 = dstOrder[2], ord3 = dstOrder[3]; __m256 ymm0, ymm1, ymm2, ymm3, ymm4, ymm5,ymm6, ymm7; __m128 xmm0, xmm1, xmm2,xmm3; __m256 ymm10= _mm256_setzero_ps();//val __m256i ymm8 = _mm256_set_epi32( dstOrder[3],dstOrder[2],dstOrder[1],dstOrder[0], dstOrder[3],dstOrder[2],dstOrder[1],dstOrder[0]); __m256 ymm9 = _mm256_setzero_ps(); __m256 ymm11 = _mm256_setzero_ps(); __m256i YMM_N[4], YMM_V[4]; YMM_N[0]= _mm256_set_m128(_mm_set_epi32( 0, 0, 0,-1), _mm_set_epi32( 0, 0, 0,-1)); YMM_N[1]= _mm256_set_m128(_mm_set_epi32( 0, 0,-1, 0), _mm_set_epi32( 0, 0,-1, 0)); YMM_N[2]= _mm256_set_m128(_mm_set_epi32( 0,-1, 0, 0), _mm_set_epi32( 0,-1, 0, 0)); YMM_N[3]= _mm256_set_m128(_mm_set_epi32(-1, 0, 0, 0), _mm_set_epi32(-1, 0, 0, 0)); YMM_V[0]= _mm256_set_m128(_mm_set_epi32( 0 , 0, 0,val), _mm_set_epi32( 0 , 0, 0,val)); YMM_V[1]= _mm256_set_m128(_mm_set_epi32( 0 , 0,val, 0), _mm_set_epi32( 0 , 0,val, 0)); YMM_V[2]= _mm256_set_m128(_mm_set_epi32( 0 ,val, 0, 0), _mm_set_epi32( 0 ,val, 0, 0)); YMM_V[3]= _mm256_set_m128(_mm_set_epi32(val, 0, 0, 0), _mm_set_epi32(val, 0, 0, 0)); // for(s=0;s<4;s++){ if(dstOrder[s] >= 3){ ymm9 = _mm256_or_ps(ymm9, YMM_N[s]); if(dstOrder[s] > 3){ ymm11 = _mm256_or_si256(ymm11, YMM_N[s]);} // else { ymm10 = _mm256_or_si256(ymm10, YMM_V[s]);}; //val } } for( h = 0; h < height; h++ ) { s = 0; d = 0; for(; d < (width&(~1))*4; s += 3*2, d += 4*2 ) {// 2 xmm0 = _mm_loadu_ps (pSrc+s ); // src xmm2 = _mm_load_sd (pSrc+s +3); xmm3 = _mm_load_ss (pSrc+s+2 +3); xmm2 = _mm_unpacklo_pd (xmm2, xmm3 ); ymm0 = _mm256_set_m128 (xmm2, xmm0 ); // 256 ymm4 = _mm256_loadu_ps (pDst+d ); // ymm0 = _mm256_permutevar_ps(ymm0, ymm8 ); // ymm4 = _mm256_and_ps (ymm4, ymm11); // ymm0 = _mm256_andnot_ps (ymm9, ymm0 ); // val ymm0 = _mm256_or_ps (ymm0, ymm4 ); // ymm0 = _mm256_or_ps (ymm0, ymm10 ); // val _mm256_storeu_ps(pDst+d, ymm0); } // 1 pSrc = (float *)((char*)pSrc + srcStep); pDst = (float *)((char*)pDst + dstStep); }
Naturally, the question may arise: why in our 256-bit AVX code is loading using 128-bit instructions? The point is that the
VPERMPS instruction in the string was used in our code
ymm0 = _mm256_permutevar_ps (ymm0, ymm8)
As already explained above, this instruction rearranges the elements within the lower and upper parts of the register (or, alternatively, lane) independently. Therefore, despite the fact that 2 pixels lie in the memory side by side, we are forced to read them in such a way that they fall into different lane. Notice how the ymm8 register-mask is formed; it contains exactly the same values in the upper and lower part of the permutation. The code is written in such a way that the number of iterations in the cycles is preserved and does not affect the final performance. Measure performance:
P = 0.31 . Thus, the increase in AVX code in relation to C amounted to 3.2X, while SSSE3 had 2.48X
Third version of SIMD code using AVX2
The development of AVX was the AVX2 set. The length of vector registers in it has not changed, but new instructions have appeared, which will help us to speed up the code. In AVX2 appeared acrosslane permutation instructions
_mm256_permutevar8x32_ps . These instructions allow you to rearrange the elements within the 256-bit register in an arbitrary way, and our code could look like this.
Code __m256i ymm8 = _mm256_set_epi32( dstOrder[3]+3,dstOrder[2]+3,dstOrder[1]+3,dstOrder[0]+3, dstOrder[3],dstOrder[2],dstOrder[1],dstOrder[0]); for(; d < (width&(~1))*4; s += 3*2, d += 4*2 ) { ymm0 = _mm256_loadu_ps (pSrc+s); ymm4 = _mm256_loadu_ps (pDst+d ); ymm0 = _mm256_permutevar8x32_ps(ymm0, ymm8); ymm4 = _mm256_and_ps (ymm4, ymm11); //apply unmodified mask ymm0 = _mm256_andnot_ps (ymm9, ymm0 ); //apply unmodified and val mask with inverse ymm0 = _mm256_or_ps (ymm0, ymm4 ); //join with unmodified ymm0 = _mm256_or_ps (ymm0, ymm10 ); //join with val _mm256_storeu_ps(pDst+d, ymm0); }
Notice that the permutation mask in the
ymm8 register has changed completely and is
pass -through throughout the register, i.e. indices in it refer to the whole 256 bit register in general.
However, there is a problem with this code in how the
ymm0 register is
read . It loads the extra 2 float elements. It’s wrong to do so, because reading abroad of an image can happen and the read instruction on the mask that appears in AVX2 will help us again
__m256i _mm256_maskload_epi32 (int const *, __m256i);
This instruction reads the necessary elements from memory and does not read anything superfluous, which prevents possible memory segmentation error. Now our correct code will look like this.
Code __m256i ld_mask = _mm256_set_epi32(0,0,-1,-1,-1,-1,-1,-1); for(; d < (width&(~1))*4; s += 3*2, d += 4*2 ) { ymm0 = _mm256_maskload_ps (pSrc+s, ld_mask); ymm4 = _mm256_loadu_ps (pDst+d ); ymm0 = _mm256_permutevar8x32_ps(ymm0, ymm8); ymm4 = _mm256_and_ps (ymm4, ymm11); //apply unmodified mask ymm0 = _mm256_andnot_ps (ymm9, ymm0 ); //apply unmodified and val mask with inverse ymm0 = _mm256_or_ps (ymm0, ymm4 ); //join with unmodified ymm0 = _mm256_or_ps (ymm0, ymm10 ); //join with val _mm256_storeu_ps(pDst+d, ymm0); }
The performance of this code is
P = 0.286 . The acceleration thus changed from k = 3.2X to k = 3.5X. In addition to reading the mask, we can use and save the unmodified channel without modification, using the instruction
void _mm256_maskstore_ps (float *, __m256i, __m256);
Code __m256 ymmNA = _mm256_set_epi32( 0,-1,-1,-1, 0,-1,-1,-1); for(; d < (width&(~1))*4; s += 3*2, d += 4*2 ) { ymm0 = _mm256_maskload_ps (pSrc+s, ld_mask); ymm0 = _mm256_permutevar8x32_ps(ymm0, ymm8); ymm0 = _mm256_andnot_ps (ymm9, ymm0 ); //apply unmodified and val mask with inverse ymm0 = _mm256_or_ps (ymm0, ymm10 ); //join with val _mm256_maskstore_epi32(pDst+d, ymmNA, ymm0); }
Please note that the 128-bit predecessor of this instruction
void _mm_maskmoveu_si128 (__ m128i, __m128i, char *)
saves the data using a non-temporal record, but in AVX versions
_mm256_maskstore_ps and
_mm_maskstore_ps it was decided to abandon this mechanism. What is a non-temporal recording and how it is used in the IPP library will be discussed in one of the following articles. We measure the performance and get a small increase,
P = 0.27 , k = 3.7X.
SIMD version with AVX512
So, we turn to the most interesting part of our article - code optimization on AVX512.
As the name suggests, the length of vector registers has doubled from 256 bits to 512. The lane concept has also been preserved. The second innovation is that mask operations have become widespread and now affect most vector operations. For this special mask registers were introduced. Mask registers have their own data type
__mmask8 ,
__mmask16 ,
__mmask32 ,
__mmask64 , but you can operate with them as integer data types, since they are also declared in the header file
zmmintrin.h . The use of masks makes the AVX512 cycle-processing feature quite minimalistic.
Code int s, d, h, ord0 = dstOrder[0], ord1 = dstOrder[1], ord2 = dstOrder[2], ord3 = dstOrder[3]; __mmask16 mskv; //mask for joing with val __mmask16 mska; //mask for unmodified channel __m512 z0,z1,z2,z3,z4,z5; __m512i zv = _mm512_set1_epi32(val); __m512i zi0 = _mm512_set_epi32(ord3+9,ord2+9,ord1+9,ord0+9, ord3+6,ord2+6,ord1+6,ord0+6, ord3+3,ord2+3,ord1+3,ord0+3, ord3+0,ord2+0,ord1+0,ord0+0); mska = mskv = 0xFFFF; for (int i = 0; i < 4; i++){ if(dstOrder[i] == 3) { mskv ^= (0x1111<<i); } if(dstOrder[i] > 3) { mska ^= (0x1111<<i); } } for( h = 0; h < height; h++ ) { s = 0; d = 0; for(; d < ((width*4) & (~15)); s += 3*4, d += 4*4 ) { z0 = _mm512_mask_loadu_ps(_mm512_setzero_ps(), 0xFFF,(float *)(pSrc+s)); z0 = _mm512_mask_permutexvar_ps(zv,mskv,zi0, z0); _mm512_mask_storeu_ps(pDst+d, mska, z0); } pSrc = (float *)((char*)pSrc + srcStep); pDst = (float *)((char *)pDst + dstStep); }
The number of intrinsics in one iteration of the cycle has been reduced to three, and each of them uses a mask. First intrinsic
z0 = _mm512_mask_loadu_ps (_mm512_setzero_ps (), 0xFFF, (float *) (pSrc + s))
uses the mask 0xFFF to read from memory 4 RGB pixels or 12 floats, in binary form
0xFFF = 1111111111111111b . Next intrinsic
z0 = _mm512_mask_permutexvar_ps (zv, mskv, zi0, z0)
performs 2 functions at once: rearranges the input pixel channels and combines them according to the mask with the value
val . The
mskv mask
is formed before the start of the cycle and the mask bits are used as follows: the bit value equal to '1' means “put the result of operations in the result register”, and the '0' - “save the value of the register that is specified before the mask”, in our case zv . Third intrinsic
_mm512_mask_storeu_ps (pDst + d, mska, z0)
It stores the received 4 pixels in memory, while it does not write anything to the unmodified channel. The performance of the code is
P = 0.183 , k = 5.44X.
The reader can compare the SSSE3 code with the AVX512 code, which uses only 3 intrinsics. It is the brevity of the avx512 code that inspired the author to write this article.
Summary table of productivity growth.
Conclusion
Of course, the use of masks in this algorithm is quite obvious, but they can be used in other, more complex algorithms, and this will be another article.
After reading this article, the reader can try to use SIMD code in their applications, and the reader, who already has this experience, can appreciate the advantages of crosslane permutation instructions and instructions with masks in new Intel processors that support the AVX512 instruction set.
I would also like to note that this article presents an insignificant part of the optimization techniques used in the IPP library and allowing it to achieve its unique performance.
The reader may try to use
the IPP library in his own application, freeing up the possibilities for solving his applied problems. An additional effect of using IPP will be that after entering the market for processors with the AVX512 instruction set, the IPP library will already have a version of the most requested functions optimized for AVX512.