To help the compiler in a vectorization? - It’s better not to interfere

This is a free translation of my recent post on the English version of the Intel Software Network. So those ~~who Victoria Zhislina likes more vikky13~~ who have already seen this post, can immediately read the first and last paragraphs that are missing in the original.

- Hello to everyone, I need a translator from Russian to program code in C ++. Well, that is, I am writing a task, and the translator implements its solution in C ++. Where can I find one? If for C there is no, maybe there is for other languages?

- There is, called the head of the development department. You write a task in Russian - you give it to your subordinates and that's it, the code is ready! Though on Si, even on Delphi, even on Java. I checked, works!
')
They say that this is not a joke, but a real question on a programmer's forum. They also say that a person is much smarter than a machine, which means that it can help her - to share the mind. But there are many cases when it is definitely not worth doing this. The result will be the reverse of the expected.

Here is a vivid example from the well-known open source library OpenCV :

cvtScale_( const Mat&amp; srcmat, Mat&amp; dstmat, double _scale, double _shift ) { Op op; typedef typename Op::type1 WT; typedef typename Op::rtype DT; Size size = getContinuousSize( srcmat, dstmat, srcmat.channels() ); WT scale = saturate_cast&lt;WT&gt;(_scale), shift = saturate_cast&lt;WT&gt;(_shift); for( int y = 0; y < size.height; y++ ) { const T* src = (const T*)(srcmat.data + srcmat.step*y); DT* dst = (DT*)(dstmat.data + dstmat.step*y); int x = 0; for(; x <= size.width - 4; x += 4 ) { DT t0, t1; t0 = op(src[x]*scale + shift); t1 = op(src[x+1]*scale + shift); dst[x] = t0; dst[x+1] = t1; t0 = op(src[x+2]*scale + shift); t1 = op(src[x+3]*scale + shift); dst[x+2] = t0; dst[x+3] = t1; } for( ; x &lt; size.width; x++ ) dst[x] = op(src[x]*scale + shift); } }

This is a simple template function that works with char, short, float and double.
Its authors decided to help the compiler with SSE vectorization by expanding the inner loop of 4 and processing the remaining data tail separately.
Do you think modern compilers (under Windows) generate optimized code in accordance with the authors' intention?
Let's check by compiling this code using Intel Compiler 12.0, with the / QxSSE2 key (verified that using other SSEx and AVX options will give the same result)

And the result will be quite unexpected. The assembler listing at the compiler output irrefutably shows that the unwrapped cycle is NOT vectorized. The compiler generates SSE instructions, but only scalar ones, not vector ones. But the rest of the data - the “tail”, containing only 1-3 data elements in the non-expanded loop, is vectorized with the full program!

If we remove the unrolling cycle:

 for( int y = 0; y < size.height; y++ ) { const T* src = (const T*)(srcmat.data + srcmat.step*y); DT* dst = (DT*)(dstmat.data + dstmat.step*y); int x = 0; for( ; x < size.width; x++ ) dst[x] = op(src[x]*scale + shift); }

... and look at the assembler again (I will not frighten you), then we find that the cycle is now fully vectorized for all data types, which undoubtedly increases productivity.

Conclusion : More work - less productivity. Less work, more. That would always be the case.

Note that Microsoft Compiler, Visual Studio 2010 and 2008 with the / arch: SSE2 key does NOT vectorize the above code either in expanded or collapsed form. The code they produce is very similar in appearance and performance in both cases. That is, if the deployment of the cycle is harmful for the Intel compiler, then for Microsoft it is simply useless :).

And what if you still want to save the deployment cycle - ~~it is dear to you as a memory~~ , but also want vectorization?

Then use the Intel compiler pragmas as shown below:
#pragma simd

 for(x=0; x <= size.width - 4; x += 4 ) { DT t0, t1; t0 = op(src[x]*scale + shift); t1 = op(src[x+1]*scale + shift); dst[x] = t0; dst[x+1] = t1; t0 = op(src[x+2]*scale + shift); t1 = op(src[x+3]*scale + shift); dst[x+2] = t0; dst[x+3] = t1; }

#pragma novector

 for( ; x <size.width; x++ ) dst[x] = op(src[x]*scale + shift); }

And the last. By itself, loop unwinding can have a positive effect on performance. But, firstly, the possible gain from vectorization still exceeds this positive effect, and, secondly, deployment can be entrusted to the compiler, then vectorization will not suffer from this. Among other things, I plan to touch on this topic at the webinar on October 27.

Source: https://habr.com/ru/post/131159/

All Articles

To help the compiler in a vectorization? - It’s better not to interfere

More articles: