⬆️ ⬇️

Gradual increase in performance when applying vector permutation instructions from SSE to AVX3.1

Often, the application requires the highest possible performance, and often it can help to achieve a transition to a new platform. And additional performance gains can be obtained not only due to architectural changes of the new platform, i.e. increasing the frequency of the processor, memory bandwidth and other changes, but also through the use of new sets of processor commands. This approach will be considered in this article on the example of vector instructions that appeared sequentially in Intel processors from SSE to the new AVX 3.1 instruction set, recently published in the corresponding specification on the Intel company website .



The author of this article participates in the development of the Intel IPP library, so this article discusses the performance gain using the example of one of the functions from this library. I would like to hope that this article will be useful to the following categories of readers:



In addition to these categories of readers, developers interested in optimizing their applications and wanting to minimize their costs of low-level optimization, this article can help to get acquainted closer with the IPP library.



The ippiSwapChannels_32f_C3C4R function

The IPP library contains optimized 32-bit and 64-bit versions for processors that support the SSSE3, SSE41, AVX, AVX2, AVX31 instruction sets. Therefore, consider the possibilities of optimization using these data sets. In general, IPP contains many functions for various data processing areas. These functions implement both fairly difficult to understand mathematical algorithms, for example, Fourier transform, and algorithms that are not particularly difficult to understand. It is this rather simple function ippiSwapChannels_32f_C3C4R that we will consider in this article, trying to consistently apply to it the sets of instructions from SSSE3 to AVX31.



The API for this function is as follows.

int ippiSwapChannels_32f_C3C4R( const float* pSrc, int srcStep, float* pDst, int dstStep, int width, int height, const int dstOrder[4], float val ) 


The function reads the original float image by the pSrc pointer in srcStep increments between rows and width x height , rearranges the channels in each pixel in accordance with the new dstOrder order , and records the resulting image by the pDst pointer. The function can also write a user-specified fixed value val to the channel. Duplication of channels in the output image is allowed. The scheme of the function is shown in Figure 1.

image

')

C code

The function code in C is fairly simple (see below). In each iteration of the cycle, one pixel of the output image is formed; one of 3 possible actions can occur with each channel of the output pixel:



Code
  int s, d, h, ord0 = dstOrder[0], ord1 = dstOrder[1], ord2 = dstOrder[2], ord3 = dstOrder[3]; for( h = 0; h < height; h++ ) { for( s = 0, d = 0; d < width*4; s += 3, d += 4 ) { if( ord0 < 3 ) { pDst[d] = pSrc[s + ord0]; } else if( ord0 == 3 ) { pDst[d] = val; } if( ord1 < 3 ) { pDst[d + 1] = pSrc[s + ord1]; } else if( ord1 == 3 ) { pDst[d + 1] = val; } if( ord2 < 3 ) { pDst[d + 2] = pSrc[s + ord2]; } else if( ord2 == 3 ) { pDst[d + 2] = val; } if( ord3 < 3 ) { pDst[d + 3] = pSrc[s + ord3]; } else if( ord3 == 3 ) { pDst[d + 3] = val; } } pSrc = (float*)((char*)pSrc + srcStep); pDst = (float*)((char*)pDst + dstStep); } 




First version of SIMD code using SSSE3

And now, as planned, we will try to optimize this code using vector SIMD instructions up to AVX3.1. We will use Intel Composer, and measure the performance of all code variants on the internal simulator of the server version of SkyLake, which supports the AVX31 - SKX set, since you need to evaluate the advantages of the new instructions without the influence of architectural changes. For the compiler we also indicate the maximum possible level of optimization, the same for all variants of the code. Measured in this way, the performance of the C code is taken as a certain arbitrary value P = 1 .

We hope that the reader is already familiar with the concept of vector registers, but if not, then you can familiarize yourself with and use the convenient navigator intrinsikov .

So, let's see how it would be possible to vectorize, i.e. rewrite the code for this function using SSE vector registers. In accordance with the logic of the function, we need to rearrange the float elements within the vector register, so we need the appropriate instructions for rearranging the elements. These instructions include the following instructions from the SSE kit:

  SHUFPS xmm1, xmm2/m128, imm8 UNPCKHPS xmm1, xmm2/m128 UNPCKLPS xmm1, xmm2/m128 __m128 _mm_shuffle_ps(__m128 a, __m128 b, unsigned int mask). __m128 _mm_unpackhi_ps(__m128, __m128) __m128 _mm_unpacklo_ps(__m128, __m128) 


These instructions are very useful and they can also be used for our task, but the following problem arises. The SHUFPS instruction has an immediate imm8 operand that forms a permutation of the elements. The third parameter of this instruction is imm8 - the value, i.e. some constant. If you try to use a variable in intrinsic, the compiler will generate an error. And the UNPCKHPS / UNPCKLPS instructions alternate elements in registers in only one way. There are quite a few possible permutations of the channels specified in dstOrder and, therefore, to implement all possible combinations, we would have to develop a code with a lot of branches and a different value of the mask parameter for every possible combination of channels. This is a rather tedious task; moreover, today the initial SSE instruction set has been significantly expanded and the code can be made much simpler. The generation closest to SSE, which contains a very useful instruction for our purposes, is called SSSE3. And we are talking about instructions:

  PSHUFB xmm1, xmm2/m128 __m128i _mm_shuffle_epi8 (__m128i a, __m128i b) 


This instruction can be effectively used in a variety of algorithms, where several neighboring input elements are used to calculate a single output element. This instruction rearranges the bytes of the receiver in a completely random order, and can also reset the byte when the high bit in the register is set to the mask, see Figure 2.

image



Despite the fact that the instruction swaps bytes, we can apply it to float data.

Code
 int s, d, h, ord0 = dstOrder[0], ord1 = dstOrder[1], ord2 = dstOrder[2], ord3 = dstOrder[3]; __m128 xmm0, xmm1, xmm2, xmm3, xmm4, xmm5,xmm6, xmm7; __m128i xmm8 = _mm_set_epi8(4*ord3+3,4*ord3+2,4*ord3+1,4*ord3, 4*ord2+3,4*ord2+2,4*ord2+1,4*ord2, 4*ord1+3,4*ord1+2,4*ord1+1,4*ord1, 4*ord0+3,4*ord0+2,4*ord0+1,4*ord0); __m128i xmm9 = _mm_set_epi32(0,0,0,0);//unmodified channel mask __m128i xmm10 = _mm_set_epi32(0,0,0,0);//val if(ord3 >= 3){ xmm8 = _mm_or_si128(xmm8, _mm_set_epi32(-1, 0, 0, 0)); if(ord3 > 3){ xmm9 = _mm_or_si128(xmm9, _mm_set_epi32(-1, 0, 0, 0));} else { xmm10 = _mm_or_si128(xmm10, _mm_set_epi32(val, 0, 0, 0));}; } if(ord2 >= 3){ xmm8 = _mm_or_si128(xmm8, _mm_set_epi32( 0,-1, 0, 0)); if(ord2 > 3){ xmm9 = _mm_or_si128(xmm9, _mm_set_epi32( 0, -1, 0, 0));} else { xmm10 = _mm_or_si128(xmm10, _mm_set_epi32( 0, val,0, 0));} } if(ord1 >=3){ xmm8 = _mm_or_si128(xmm8, _mm_set_epi32( 0, 0,-1, 0)); if(ord1 > 3){xmm9 = _mm_or_si128(xmm9, _mm_set_epi32( 0, 0,-1, 0));} else {xmm10 = _mm_or_si128(xmm10, _mm_set_epi32( 0, 0,val, 0));} } if(ord0 >= 3){ xmm8 = _mm_or_si128(xmm8, _mm_set_epi32( 0, 0, 0,-1)); if(ord0 > 3){xmm9 = _mm_or_si128(xmm9, _mm_set_epi32( 0, 0, 0,-1));} else {xmm10 = _mm_or_si128(xmm10, _mm_set_epi32( 0, 0, 0,val));} } for( h = 0; h < height; h++ ) { s = 0; d = 0; for(; d < width*4; s += 3*4, d += 4*4 ) { xmm0 = _mm_load_sd(pSrc+s ); // g0 r0 xmm1 = _mm_load_ss(pSrc+s+2 ); // b0 xmm2 = _mm_load_sd(pSrc+s +3); xmm3 = _mm_load_ss(pSrc+s+2+3); xmm4 = _mm_load_sd(pSrc+s +6); xmm5 = _mm_load_ss(pSrc+s+2+6); xmm6 = _mm_load_sd(pSrc+s +9); xmm7 = _mm_load_ss(pSrc+s+2+9); xmm0 = _mm_unpacklo_pd(xmm0, xmm1); // b0 g0 r0    xmm1 = _mm_unpacklo_pd(xmm2, xmm3); xmm2 = _mm_unpacklo_pd(xmm4, xmm5); xmm3 = _mm_unpacklo_pd(xmm6, xmm7); xmm4 = _mm_loadu_ps(pDst+d+0); //   xmm5 = _mm_loadu_ps(pDst+d+4); xmm6 = _mm_loadu_ps(pDst+d+8); xmm7 = _mm_loadu_ps(pDst+d+12); xmm0 = _mm_shuffle_epi8(xmm0, xmm8); //      xmm1 = _mm_shuffle_epi8(xmm1, xmm8); xmm2 = _mm_shuffle_epi8(xmm2, xmm8); xmm3 = _mm_shuffle_epi8(xmm3, xmm8); xmm4 = _mm_and_ps(xmm4,xmm9); //     dst xmm5 = _mm_and_ps(xmm5,xmm9); xmm6 = _mm_and_ps(xmm6,xmm9); xmm7 = _mm_and_ps(xmm7,xmm9); xmm0 = _mm_or_ps(xmm0,xmm4); //      xmm1 = _mm_or_ps(xmm1,xmm5); xmm2 = _mm_or_ps(xmm2,xmm6); xmm3 = _mm_or_ps(xmm3,xmm7); xmm0 = _mm_or_ps(xmm0,xmm10); //    val  xmm1 = _mm_or_ps(xmm1,xmm10); xmm2 = _mm_or_ps(xmm2,xmm10); xmm3 = _mm_or_ps(xmm3,xmm10); _mm_storeu_ps(pDst+d , xmm0); //  _mm_storeu_ps(pDst+d+ 4, xmm1); _mm_storeu_ps(pDst+d+ 8, xmm2); _mm_storeu_ps(pDst+d+12, xmm3); } for(; d < width*4; s += 3, d += 4 ) { __m128 xmm0, xmm1, xmm4; xmm0 = _mm_load_sd(pSrc+s ); xmm1 = _mm_load_ss(pSrc+s+2); xmm0 = _mm_unpacklo_pd(xmm0, xmm1); xmm4 = _mm_loadu_ps(pDst+d); xmm0 = _mm_shuffle_epi8(xmm0, xmm8); xmm4 = _mm_and_ps(xmm4,xmm9); //apply unmodified mask xmm0 = _mm_or_ps(xmm0,xmm4); xmm0 = _mm_or_ps(xmm0,xmm10); _mm_storeu_ps(pDst+d, xmm0); } pSrc = (float*)((char*)pSrc + srcStep); pDst = (float*)((char*)pDst + dstStep); } 




As you can see, our SSSE3 code uses quite a few SIMD instructions, and the comment line explains why they are used. The _mm_unpacklo_pd instructions are used to merge into one XMM register the values ​​read into different registers. The instructions _mm_and_ps , _mm_or_ps are used to merge by mask with the unmodified channel and the val value. We directly use the _mm_shuffle_epi8 instruction to rearrange the elements. We measure how much time it takes to execute and get P = 0.4 . Despite the fact that the code looks more cumbersome compared to the C code, the performance increase was approximately 2.5X. And this is a pretty good result.



Second version of SIMD code using AVX

Today, top-of-the-line Intel processors have long supported at least the AVX instruction set. Vector registers used in AVX instructions have a length of 256 bits and, accordingly, 2 times wider than 128 bit SSE registers. However, AVX contains two significant limitations for us. First, AVX contains 256-bit permutation instructions only for float and double data types and we cannot use the 256-bit analogue _mm_shuffle_epi8 . Secondly, after the release of the AVX instruction set, the so-called lane concept for vector instructions was introduced. In this concept, one 256-bit AVX register is a bundle of two 128-bit registers, called the word lane. Most 256-bit extensions of 128-bit horizontal-access instructions work only inside each lane. Thus, often the AVX version of the code that already has the SSE version looks like it is shown in Figure. 3

image

Therefore, the AVX version of our code might look something like this:

Code
  int s, d, h, ord0 = dstOrder[0], ord1 = dstOrder[1], ord2 = dstOrder[2], ord3 = dstOrder[3]; __m256 ymm0, ymm1, ymm2, ymm3, ymm4, ymm5,ymm6, ymm7; __m128 xmm0, xmm1, xmm2,xmm3; __m256 ymm10= _mm256_setzero_ps();//val __m256i ymm8 = _mm256_set_epi32( dstOrder[3],dstOrder[2],dstOrder[1],dstOrder[0], dstOrder[3],dstOrder[2],dstOrder[1],dstOrder[0]); __m256 ymm9 = _mm256_setzero_ps(); __m256 ymm11 = _mm256_setzero_ps(); __m256i YMM_N[4], YMM_V[4]; YMM_N[0]= _mm256_set_m128(_mm_set_epi32( 0, 0, 0,-1), _mm_set_epi32( 0, 0, 0,-1)); YMM_N[1]= _mm256_set_m128(_mm_set_epi32( 0, 0,-1, 0), _mm_set_epi32( 0, 0,-1, 0)); YMM_N[2]= _mm256_set_m128(_mm_set_epi32( 0,-1, 0, 0), _mm_set_epi32( 0,-1, 0, 0)); YMM_N[3]= _mm256_set_m128(_mm_set_epi32(-1, 0, 0, 0), _mm_set_epi32(-1, 0, 0, 0)); YMM_V[0]= _mm256_set_m128(_mm_set_epi32( 0 , 0, 0,val), _mm_set_epi32( 0 , 0, 0,val)); YMM_V[1]= _mm256_set_m128(_mm_set_epi32( 0 , 0,val, 0), _mm_set_epi32( 0 , 0,val, 0)); YMM_V[2]= _mm256_set_m128(_mm_set_epi32( 0 ,val, 0, 0), _mm_set_epi32( 0 ,val, 0, 0)); YMM_V[3]= _mm256_set_m128(_mm_set_epi32(val, 0, 0, 0), _mm_set_epi32(val, 0, 0, 0)); //  for(s=0;s<4;s++){ if(dstOrder[s] >= 3){ ymm9 = _mm256_or_ps(ymm9, YMM_N[s]); if(dstOrder[s] > 3){ ymm11 = _mm256_or_si256(ymm11, YMM_N[s]);} //   else { ymm10 = _mm256_or_si256(ymm10, YMM_V[s]);}; //val  } } for( h = 0; h < height; h++ ) { s = 0; d = 0; for(; d < (width&(~1))*4; s += 3*2, d += 4*2 ) {//  2    xmm0 = _mm_loadu_ps (pSrc+s ); // src xmm2 = _mm_load_sd (pSrc+s +3); xmm3 = _mm_load_ss (pSrc+s+2 +3); xmm2 = _mm_unpacklo_pd (xmm2, xmm3 ); ymm0 = _mm256_set_m128 (xmm2, xmm0 ); //    256  ymm4 = _mm256_loadu_ps (pDst+d ); //   ymm0 = _mm256_permutevar_ps(ymm0, ymm8 ); // ymm4 = _mm256_and_ps (ymm4, ymm11); //    ymm0 = _mm256_andnot_ps (ymm9, ymm0 ); //     val  ymm0 = _mm256_or_ps (ymm0, ymm4 ); //    ymm0 = _mm256_or_ps (ymm0, ymm10 ); //  val  _mm256_storeu_ps(pDst+d, ymm0); } //   1  pSrc = (float *)((char*)pSrc + srcStep); pDst = (float *)((char*)pDst + dstStep); } 




Naturally, the question may arise: why in our 256-bit AVX code is loading using 128-bit instructions? The point is that the VPERMPS instruction in the string was used in our code

ymm0 = _mm256_permutevar_ps (ymm0, ymm8)

As already explained above, this instruction rearranges the elements within the lower and upper parts of the register (or, alternatively, lane) independently. Therefore, despite the fact that 2 pixels lie in the memory side by side, we are forced to read them in such a way that they fall into different lane. Notice how the ymm8 register-mask is formed; it contains exactly the same values ​​in the upper and lower part of the permutation. The code is written in such a way that the number of iterations in the cycles is preserved and does not affect the final performance. Measure performance: P = 0.31 . Thus, the increase in AVX code in relation to C amounted to 3.2X, while SSSE3 had 2.48X



Third version of SIMD code using AVX2

The development of AVX was the AVX2 set. The length of vector registers in it has not changed, but new instructions have appeared, which will help us to speed up the code. In AVX2 appeared acrosslane permutation instructions _mm256_permutevar8x32_ps . These instructions allow you to rearrange the elements within the 256-bit register in an arbitrary way, and our code could look like this.

Code
  __m256i ymm8 = _mm256_set_epi32( dstOrder[3]+3,dstOrder[2]+3,dstOrder[1]+3,dstOrder[0]+3, dstOrder[3],dstOrder[2],dstOrder[1],dstOrder[0]); for(; d < (width&(~1))*4; s += 3*2, d += 4*2 ) { ymm0 = _mm256_loadu_ps (pSrc+s); ymm4 = _mm256_loadu_ps (pDst+d ); ymm0 = _mm256_permutevar8x32_ps(ymm0, ymm8); ymm4 = _mm256_and_ps (ymm4, ymm11); //apply unmodified mask ymm0 = _mm256_andnot_ps (ymm9, ymm0 ); //apply unmodified and val mask with inverse ymm0 = _mm256_or_ps (ymm0, ymm4 ); //join with unmodified ymm0 = _mm256_or_ps (ymm0, ymm10 ); //join with val _mm256_storeu_ps(pDst+d, ymm0); } 




Notice that the permutation mask in the ymm8 register has changed completely and is pass -through throughout the register, i.e. indices in it refer to the whole 256 bit register in general.

However, there is a problem with this code in how the ymm0 register is read . It loads the extra 2 float elements. It’s wrong to do so, because reading abroad of an image can happen and the read instruction on the mask that appears in AVX2 will help us again

__m256i _mm256_maskload_epi32 (int const *, __m256i);

This instruction reads the necessary elements from memory and does not read anything superfluous, which prevents possible memory segmentation error. Now our correct code will look like this.

Code
  __m256i ld_mask = _mm256_set_epi32(0,0,-1,-1,-1,-1,-1,-1); for(; d < (width&(~1))*4; s += 3*2, d += 4*2 ) { ymm0 = _mm256_maskload_ps (pSrc+s, ld_mask); ymm4 = _mm256_loadu_ps (pDst+d ); ymm0 = _mm256_permutevar8x32_ps(ymm0, ymm8); ymm4 = _mm256_and_ps (ymm4, ymm11); //apply unmodified mask ymm0 = _mm256_andnot_ps (ymm9, ymm0 ); //apply unmodified and val mask with inverse ymm0 = _mm256_or_ps (ymm0, ymm4 ); //join with unmodified ymm0 = _mm256_or_ps (ymm0, ymm10 ); //join with val _mm256_storeu_ps(pDst+d, ymm0); } 




The performance of this code is P = 0.286 . The acceleration thus changed from k = 3.2X to k = 3.5X. In addition to reading the mask, we can use and save the unmodified channel without modification, using the instruction

void _mm256_maskstore_ps (float *, __m256i, __m256);

Code
  __m256 ymmNA = _mm256_set_epi32( 0,-1,-1,-1, 0,-1,-1,-1); for(; d < (width&(~1))*4; s += 3*2, d += 4*2 ) { ymm0 = _mm256_maskload_ps (pSrc+s, ld_mask); ymm0 = _mm256_permutevar8x32_ps(ymm0, ymm8); ymm0 = _mm256_andnot_ps (ymm9, ymm0 ); //apply unmodified and val mask with inverse ymm0 = _mm256_or_ps (ymm0, ymm10 ); //join with val _mm256_maskstore_epi32(pDst+d, ymmNA, ymm0); } 




Please note that the 128-bit predecessor of this instruction

void _mm_maskmoveu_si128 (__ m128i, __m128i, char *)

saves the data using a non-temporal record, but in AVX versions _mm256_maskstore_ps and _mm_maskstore_ps it was decided to abandon this mechanism. What is a non-temporal recording and how it is used in the IPP library will be discussed in one of the following articles. We measure the performance and get a small increase, P = 0.27 , k = 3.7X.



SIMD version with AVX512

So, we turn to the most interesting part of our article - code optimization on AVX512.

As the name suggests, the length of vector registers has doubled from 256 bits to 512. The lane concept has also been preserved. The second innovation is that mask operations have become widespread and now affect most vector operations. For this special mask registers were introduced. Mask registers have their own data type __mmask8 , __mmask16 , __mmask32 , __mmask64 , but you can operate with them as integer data types, since they are also declared in the header file zmmintrin.h . The use of masks makes the AVX512 cycle-processing feature quite minimalistic.

Code
  int s, d, h, ord0 = dstOrder[0], ord1 = dstOrder[1], ord2 = dstOrder[2], ord3 = dstOrder[3]; __mmask16 mskv; //mask for joing with val __mmask16 mska; //mask for unmodified channel __m512 z0,z1,z2,z3,z4,z5; __m512i zv = _mm512_set1_epi32(val); __m512i zi0 = _mm512_set_epi32(ord3+9,ord2+9,ord1+9,ord0+9, ord3+6,ord2+6,ord1+6,ord0+6, ord3+3,ord2+3,ord1+3,ord0+3, ord3+0,ord2+0,ord1+0,ord0+0); mska = mskv = 0xFFFF; for (int i = 0; i < 4; i++){ if(dstOrder[i] == 3) { mskv ^= (0x1111<<i); } if(dstOrder[i] > 3) { mska ^= (0x1111<<i); } } for( h = 0; h < height; h++ ) { s = 0; d = 0; for(; d < ((width*4) & (~15)); s += 3*4, d += 4*4 ) { z0 = _mm512_mask_loadu_ps(_mm512_setzero_ps(), 0xFFF,(float *)(pSrc+s)); z0 = _mm512_mask_permutexvar_ps(zv,mskv,zi0, z0); _mm512_mask_storeu_ps(pDst+d, mska, z0); } pSrc = (float *)((char*)pSrc + srcStep); pDst = (float *)((char *)pDst + dstStep); } 




The number of intrinsics in one iteration of the cycle has been reduced to three, and each of them uses a mask. First intrinsic

z0 = _mm512_mask_loadu_ps (_mm512_setzero_ps (), 0xFFF, (float *) (pSrc + s))

uses the mask 0xFFF to read from memory 4 RGB pixels or 12 floats, in binary form 0xFFF = 1111111111111111b . Next intrinsic

z0 = _mm512_mask_permutexvar_ps (zv, mskv, zi0, z0)

performs 2 functions at once: rearranges the input pixel channels and combines them according to the mask with the value val . The mskv mask is formed before the start of the cycle and the mask bits are used as follows: the bit value equal to '1' means “put the result of operations in the result register”, and the '0' - “save the value of the register that is specified before the mask”, in our case zv . Third intrinsic

_mm512_mask_storeu_ps (pDst + d, mska, z0)

It stores the received 4 pixels in memory, while it does not write anything to the unmodified channel. The performance of the code is P = 0.183 , k = 5.44X.

The reader can compare the SSSE3 code with the AVX512 code, which uses only 3 intrinsics. It is the brevity of the avx512 code that inspired the author to write this article.



Summary table of productivity growth.





Conclusion

Of course, the use of masks in this algorithm is quite obvious, but they can be used in other, more complex algorithms, and this will be another article.

After reading this article, the reader can try to use SIMD code in their applications, and the reader, who already has this experience, can appreciate the advantages of crosslane permutation instructions and instructions with masks in new Intel processors that support the AVX512 instruction set.

I would also like to note that this article presents an insignificant part of the optimization techniques used in the IPP library and allowing it to achieve its unique performance.

The reader may try to use the IPP library in his own application, freeing up the possibilities for solving his applied problems. An additional effect of using IPP will be that after entering the market for processors with the AVX512 instruction set, the IPP library will already have a version of the most requested functions optimized for AVX512.

Source: https://habr.com/ru/post/245755/



All Articles