About the history of memcpy implementations and their performance

void * memcpy (void * destination, const void * source, size_t num);
It would seem that there is difficult? And about the implementations of this function, you can write a whole story.

When I look at the window of my favorite desktop tool, the Vtune XE profiler, I often see that he once again discovered that he had spent a lot of time copying the memory. It is usually written like this: clock ticks spent in libgcc / [g] libc / kernel memcpy - XX%.

Perhaps that is why memcpy often corresponded, for example, similar threads often appear in lkml . (More implementations, most likely, have only sorts). It would seem, unlike sorting, where there are many options and algorithms for copying memory, everything is simple. In fact, even if we are talking about correctness, not productivity, options are possible. (In confirmation of this - a discussion of the epic bug with Linus Torvalds and Ulrich Drepper).
')
At the time of 8086, that is, thirty-four years ago, the following code was inside the memcpy implementation:
mov [E] SI, src
mov [E] DI, ptr_dst
mov [E] CX, len
rep movsb
(all checks, etc. are omitted hereafter for simplicity)

What has changed since then? Under the cut assembler code and a single image.

In Agner Fogh’s Optimizing Assembly’s classic work, there is a good (but not very detailed) explanation of most aspects of memcpy performance.

In the mid-90s, programmers discovered that, despite the fact that new instructions do not always speed up the Internet, it is possible to use extended SIMD registers to copy memory faster than REP MOVS does.

First, MMx registers were used as intermediate storage, then XMM. Looking ahead, I will say that YMM has not reached.
movups XMM [0-4], [src] (x4, full cache line)
movups [dst], XMM [0-4]

Then various combinations were added from [not] aligned reading, [aligning], and [not] aligned writing to memory, at best (SSE4.1) it turned out something like
mov [nt] dqa XMM2, [src + i * 2]
mov [nt] dqa XMM1, [src + i * 2 + 1]
movdqa XMM1, XMM0
movdqa XMM0, XMM3
palignr XMM3, XMM2, shift
palignr XMM2, XMM1, shift
mov [nt] dqa [dst + i * 2], XMM2
mov [nt] dqa [dst + i * 2 + 1], XMM3
A minor complication is that the alignment instruction exists only with the immediate last operand (shift), because of this, the code is well blown up (see glibc). By the way, starting with the Nehalem architecture, the unaligned access to the memory that the above code is fighting against is no longer the most important cause of the brakes, although it is not free.

Thus, several memcpy implementations appeared, each of which worked faster on some processors, but more slowly on the others. After some time, several options and code that chooses which one to invoke have snuck into glibc. Probably the CLR and JVM environments can also choose an efficient implementation of System.arraycopy on the fly. In contrast to glibc, such SSE code doesn’t just vperyurit into the kernel. It's still more interesting there, SIMD registers need to be saved, and this is not such a quick matter. What is in Linux, what is in Windows.

And recently, suddenly, once, and everything returned to normal. (Maybe in order not to force rewriting memcpy to AVX?) For the latest processors, the classic implementation of memcpy is again the fastest. So if someone has slept for 34 years, it's time to pull out the old code, and look victoriously at the colleagues who copied memcpy sequentially on MMX, SSE2, SSE3, SSE4.1.

By the way, to make it even more interesting to test copying performance (especially in the context of real software), it can also be affected by non temporal reading and writing, restrictions on the speed of memory access, effects associated with the last-level shared cache, and DTLB misses.

Findings.
1. Writing the next implementation is now useless, std :: memcpy will remain effective using rep movs.
2. For a history and performance review on older platforms, see this article and Agner Fog.
3. On Atom, other X86 platforms and old (before Nekhalem) rep mov processors are still slow.

Upd:
As repeatedly asked in the comments, I drove a simple microbenchmark with a memcpy kilobyte pair with all the alignment combinations in the loop.
The number is how many times the most advanced SSE4.1 code is faster than std :: memcpy, implemented through rep movs
Bulldozer - 1.22x (thanks to stepmex for the data)
Penryn - 1.6x
Nehalem - 1.5x
Sandy Bridge - 1.008x
This benchmark is not particularly accurate, in the real software many other factors play a role, which I enumerated above in brief.

Source: https://habr.com/ru/post/133962/

All Articles

About the history of memcpy implementations and their performance

More articles: