"The Simdsons" - the final

Did you know that Homer Simpson, the head of the family of “The Simpsons”, once really officially worked for Intel, and quite successfully? Namely, he advertised ~~donuts~~ processor Pentium II. You can see how it was here .
Well, under the cut you can see the completion of the previous post with 21 interesting facts about Intel SIMD.

Although the first 128-bit SSE instructions appeared a long time ago - in the Pentium III (yes, there was such a processor in the last century), but in fact, only 64 bits of data were loaded and processed per processor cycle. And only with the beginning of the release of processors on the Intel Core microarchitecture, SSE became really 128-bit per clock, that is, the performance of SSE-SSE3 arithmetic instructions actually doubled. Intel marketers happily called this change " Advanced Digital Media Boost "
MMX instructions contrary to their respectable age and now "very much alive." So, up to SSSE3, all SSE instruction sets have not only 128 but 64 bit versions as well. However, using MMX and 64-bit SSE is not recommended unless absolutely necessary. This may not speed up, but slow down the code on modern CPUs. So, there is information that it is for this reason (old MMX code) that inhibits playback of Adobe flash video on the Intel Atom.
Moreover, 64-bit integer SIMD instructions use MMX registers that share their state with x87 floating-point registers. Therefore, mixing 64-bit integer SIMDs with x87 floating-point operations in a code requires calling special functions to restore the state of the registers - ensuring functional correctness.
SSSE3 is not a typo in SSE3, but Supplemental Streaming SIMD Extension 3 is a set of vector instructions for working with packed integers that first appeared in Intel Core and also supported in Intel Atom. A set of 16 instructions contains very useful operations for swapping bytes in packed data, working with a sign, as well as horizontal addition-subtraction.
SSSE3 is not so much additional performance (horizontal operations on xmm registers are not particularly fast at all), but rather convenience of use in combination with other vector instructions. The processor with SSSE3 support is a prerequisite for running the Snow Leopard - Mac OS X 10.6
As you know, the performance of processor instructions is described by two parameters - latency and throughput (I will not risk translating into Russian). Latency is the instruction execution time (clocks), and throughput is the number of instructions executed per unit of time. Moreover, throughput is not (1 / latency) at all! The fact is that the instructions are usually pipelined, that is, the following can be executed before the completion of the previous one. This is important information for evaluating performance.
Information on latency and throughput Intel SIMD (as well as other IA instructions) is in the Intel® 64 and IA-32 Architectures Optimization Reference Manual . But the data in this document is not for all instructions and processors. Therefore, those wishing to compare the latency and throughput SSE instructions (unfortunately, only integers) of various Intel processors, including Atom, as well as AMD processors, can be addressed to this document created by an independent researcher.
Among the SIMD instructions for working with floating-points, SISDs are encountered and are not even hidden, that is, non-vector instructions that process only single data. Their distinguishing feature is the letter s at the end of the name. For example, rsqrtss , which calculates the approximate value of the inverse square root with the corresponding intrinsic __m128 r = _mm_rsqrt_ss (__m128 a)
{r0: = recip (sqrt (a0)); r1: = a1; r2: = a2; r3: = a3}. Interestingly, even despite the need to first load the data into the SSE register, and then unload, use
In many cases, these instructions are preferable to the corresponding “normal” x87 floating-point. Such "vector instructions for non-vectors" can exceed x87 in both latency and throughput.
And more about improving performance. Denormalized (very small) floating-point numbers negatively affect the performance of SSE computations in the same way as regular x87 instructions. Therefore, if not all elements are used in the SIMD register, then before using the register it is better to zero it in order not to waste considerable time on processing the “garbage”, which is very likely to be a small number. Another solution is to install special modes for the Flush-To-Zero processor (FTZ) and Denormals are Zero (DAZ). Those who wish to understand the details (in English) can be found here .
Even if the performance of your application suits you, but there are some chances that the application will be used on a mobile computer, you should try using SIMD because it will help to ~~save electricity and~~ increase the battery life of the computer, which users will like very much.
Although there are no instructions for SSE trigonometric, exponential, logarithmic and other mathematical functions in SSE, we can assume that they are there: the Compiler Intel includes the Short Vector Math Library ( SVML ), which implements the above-mentioned functions for floating point numbers single and double precision as standard SSE intrinsic functions.
Initially, the library was developed for the internal needs of the compiler team of Intel - to implement the automatic vectorization of loops with mat. functions, but it turned out to be so useful that it was carried outside. In the latest Intel compiler, the use of SVML intrinsics is described in Help, and for use in previous versions of the compiler or attempts to tie SVML to another compiler, you can look here - software.intel.com/en-us/articles/how-to-implement-the -short-vector-math-library
Binary conditional transitions in the code from the very appearance of SSE usually do not significantly interfere with vectorization: a vector command is called to check the transition condition, which gives a mask at the output, after which the result of both branches “true” and “false” and, Finally, the AND condition mask check is taken with the result of the “true” branch and is combined with the expression “(inverted condition check mask AND result of the“ false ”branch) by OR.” In SSE4.1, this task is simplified. The instructions of the blend family allow you to directly mix two vectors by mask (in this case, condition checks), that is, the need for all AND and OR disappears, which undoubtedly simplifies and speeds up the code.
Moreover, in the vector instructions Larrabee all arithmetic operations have their own masked versions, i.e. in fact, they can be executed (write the corresponding element in the destination register) only for elements that satisfy the condition that forms the mask. This is a further improvement in branch support.
But in the AVX instruction set , despite the support of three-operand instructions, there is no such possibility.
As you know, Intel does not plan further development of SSE. Instead of this instruction set, a new, 256-bit AVX ( Advanced Vector Extensions ) should appear soon. All details about AVX are on the corresponding Intel site . To this we can add that AVX support is planned in AMD processors, and the first engineering samples of AVX processors are already available inside Intel.

Source: https://habr.com/ru/post/94789/

All Articles

"The Simdsons" - the final

More articles: