📜 ⬆️ ⬇️

Command history of a single processor. Part 1. Differences between assembled instructions lddqu and movdqu


In the not so distant 2000, Intel introduced the NetBurst microarchitecture for Pentium 4 processors in the market. In 2004, when Prescott core processors appeared, the LDDQU command was implemented in the SSE3 instruction set.

However, it was intended for one application, namely, video encoding, and if in detail:

The largest amount of computation in video coding is usually required by the motion estimation mechanism ( Motion Estimation - ME), which compares the blocks of the current frame with the blocks of the previous frame and looks for the best match. When searching for the best match, multiple metrics can be used. The most common is the L1 metric - the sum of absolute differences. The ME mechanism works so that block loads of the previous frame are not aligned, while block loads of the current frame are aligned. Due to unaligned downloads, there are two types of delays due to:

NetBurst microarchitecture does not support microoperations for loading 128-bit unaligned data. For this reason, commands for 128-bit unaligned downloads, such as movups and movdqu, are emulated in microcode using two 64-bit downloads, the results of which are combined into a 128-bit result. In addition to the emulation cost, unaligned loadings cause the cost of processing split cache lines if access goes beyond the 64-byte boundary.
To solve the problem of splitting cache lines at 128-bit unaligned loads, the lddqu command was added to the SSE3 instruction set. This command loads a 32-byte block aligned on a 16-byte boundary and extracts 16 bytes corresponding to unallocated access. Since the command loads more bytes than requested, certain usage restrictions are imposed. The lddqu command should not be used in areas of uncached memory (Uncached - UC) and combined write (Write-Combining - USWC) areas of the address space. In addition, due to the specifics of the implementation of the lddqu command, it should not be used in situations where read write redirection is expected. In situations where only loading is performed and areas of the UC and USWC memory address spaces are not used, the lddqu command can successfully replace the movdqu / movups / movupd commands.
The code below is an example of using a new command. Both code sequences are similar, except that the old command for unaligned loading (movdqu) is replaced with a new command (lddqu). Assuming that 25% of unallocated downloads pass through a cache-memory line, the new team can improve the performance of the ME mechanism by 30%. MPEG-4 encoders showed an acceleration of more than 10%.
')
Motion Estimator without SSE3:
movdqa xmm0, < >
movdqu xmm1, < >
psadbw xmm0, xmm1
paddw xmm2, xmm0


Motion Estimator with SSE3:
movdqa xmm0, < >
lddqu xmm1, < >
psadbw xmm0, xmm1
paddw xmm2, xmm0


More details are available at the link: download.intel.com/technology/itj/2004/volume08issue01/art01_microarchitecture/vol8iss1_art01.pdf

And also, the most interesting discussions:


In summary, it can be said that since the advent of the Intel Core 2 model (this applies to the Core micro-architecture, which appeared in mid-2006, and Merom processors and later) and for all future models, the lddqu command performs the same actions as the movdqu command .
In other words, if the processor supports the Supplemental Streaming SIMD Extensions 3 (SSSE3) command set, then the lddqu command performs the same actions as the movdqu command. If the processor does not support the SSSE3 instruction set, but does support SSE3, then use the lddqu command (and do not forget the details about the types of memory used).

And finally, regarding the patents: pay attention to the availability of the patent number 6721866, which also describes some of the details of implementation and use.

PS: For reference, pay attention to a useful article that contains data on all Intel microarchitectures: en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures
(as always, as always - Wikipedia)

Source: https://habr.com/ru/post/141416/


All Articles