Extended processor instructions in .NET or "C # Intrinsics"

In chess programs, “bit boards” (bitboards 1 , 2 ) are widely used to represent pieces on the board. As well as for other games on the same 8 × 8 board, and even for card games. With bitboards often perform various operations, for example, to find the first bit set or count the number of bits set. Many “tricky” algorithms have been invented for these operations, and on modern processors some of these operations are available in an extended set of instructions . For example, popcnt , available in SSE4.2 ABM. There are also a couple of bsf / bsr instructions that have been available for a long time, but there is no access to them from the JIT compiler.
Of course, all serious chess programs are written in C ++, but for prototyping any algorithms, I would like to use C #, because I know it better and I have less chance to shoot myself in the leg. But the performance also does not want to lose just like that, in C / C ++ the instructions of interest to us are available so-called built - in functions . I tried to make a similar decision for C #.

There are ready-made libraries for SIMD instructions, but not for bit manipulations.

What to do? We must somehow patch the existing code.
Let us dwell on the fact that we use a 64-bit processor (on a 32-bit processor it’s not so fun to work with bitboards), and this processor is x86_64. For incompatible architectures, you need to have fallbacks (implementation in C #). Under such conditions, bsf / bsr will always be available, and popcnt is not available only on a rather ancient gland, but more on that later.
The easiest option is to write a DLL in C ++ and export the functions we need from there. But these “functions” will consist of one instruction, and around them there will be some extra jumps that will arise during PInvoke.
The second option: use GetDelegateForFunctionPointer , it is similar to the first, but does not require a separate native DLL.
The third is to replace the already compiled method. Many interesting articles have been written about this . But then there are problems with the fact that it is a) difficult b) there are dependencies on the internals of runtime, and, therefore, of its version. As a result, we get a cumbersome and unstable decision, this is not our way.
The last option remains: patch the compiled JIT code.

Implementation details

First we need to make sure that the method is compiled: RuntimeHelpers.PrepareMethod () . And to get to the machine code, Microsoft carefully provided us with the RuntimeMethodHandle.GetFunctionPointer () method. But according to this link, depending on the runtime, both the method itself and the springboard can be used. Well, that the springboard is only one type - JMP. We will take advantage of this:

static byte* GetMethodPtr(MethodInfo method) { RuntimeHelpers.PrepareMethod(method.MethodHandle); var ptr = (byte*)method.MethodHandle.GetFunctionPointer(); // skip all jumps while (*ptr == 0xE9) { var delta = *(int*)(ptr + 1) + 5; ptr += delta; } return ptr; }

So, we have received the address, we will try to write down something there. It is necessary to take into account the Call-Convention, the benefit of x86_64 is one (the first four arguments in RCX, RDX, R8, R9, the result in RAX):

  //bsr rax, rcx; ret *ptr++ = 0x48; *ptr++ = 0x0F; *ptr++ = 0xBD; *ptr++ = 0xC1; *ptr++ = 0xC3;

We try ... Hurray, it works!
')
But now our version is no different in performance from PInvoke / GetDelegateForFunctionPointer: there is an extra call / jmp from the caller. In addition, if our function-fallback compiler zayinlaynil, then the patch of this place will not help.
Inline can be easily discarded using the [ MethodImpl ( MethodImplOptions .NoInlining)] )] attribute.
Also, as in C ++ - intrinsics, you must completely abandon call / jmp. It is also quite easy to cope with this, although it will have to be a little dirty with the assembler. We will patch the call site on the fly. To do this, we need to replace our fallback code not just with the necessary instructions, but with a call to the “patcher” with the necessary parameters: at what address and what to replace.
Let's start by defining the address: we have a return address in the stack (in dword ptr [rax]), usually the call to our method happens just before it. If this is a call instruction (and usually it is), then we need to check the previous 5 bytes, and if it really was a call to the desired method, then replace them entirely with the opcode of our instruction. The patcher method will look like this:

 static void PatchCallSite(byte* rsp, ulong code) { var ptr = rsp - 5; // check call opcode if (*ptr != 0xE8) { // throw ERROR } const ulong Mask5Bytes = (1uL << 40) - 1; *(ulong*)ptr = *(ulong*)ptr & ~Mask5Bytes | (code & Mask5Bytes); }

And we will call it manually from the assembler, as a result, our fallback method will be replaced with such a thing:

 bsr rax, rcx ;   nop ;   5  push rax ;    sub rsp, 0x20 ;   register spilling mov rcx, QWORD PTR [rsp+0x28] ;   —   ;        , ;   «»  0x20 + 8  mov rdx, QWORD PTR [rip-0x16] ;   —    ;       movabs rax, <PatchCallSite> call rax add rsp, 0x20 ;    pop rax ;     ret

All this is done by the Patcher.PatchMethod function, which we will call for all the methods that need to be patched.
Here is how it looks on a live example (from PerfTest):

(if the picture is not visible)

and after the first pass:

(if the picture is not visible)

Now there are trifles: JIT can zainlaynit something very strongly, or make tail-call optimization, then the return address will not really correspond to the call of our function. Unfortunately, I could not find examples of such aggressive optimizations, but just in case this should also be checked. Knowing the place where the call was made from, we can verify that he actually called. Here, too, you need to jump over all the jmp-s, and check that they lead us to the right place. And due to the fact that we placed the replacement instructions at the very beginning of the patch in an unchanged form, this is very easy to do:

 // check call target var calltarget = ptr + *(int*)(ptr + 1) + 5; while (*calltarget == 0xE9) { var delta = *(int*)(calltarget + 1) + 5; calltarget += delta; } code &= Mask5Bytes; if ((*(ulong*)calltarget & Mask5Bytes) != code) { // throw ERROR }

We also don’t want to touch anything if we suddenly find ourselves on an unsupported architecture. First, let's check that it is generally 64-bit: ( IntPtr .Size == 8 ). Next you need to make sure that it is x86_64. I didn’t find anything better than comparing the prolog of any function with obviously correct instructions (see Patcher.VerifyPrologue in the code).

Some instructions, such as popcnt , are not available on all processors. If speed is important to you, then it is better not to use such processors at all, but it is advisable to leave a fallback for them in the library. In the Intel manual it is written that you can check for the presence of the popcnt instruction by calling cpuid with eax = 01H and checking the ECX.POPCNT bit [Bit 23]. This is where stackoverflow-driven development comes to the rescue, where you can find a ready-made piece of code to call cpuid from C # .

It remains only to wrap it all up beautifully: we make the attribute [ ReplaceWith ( "F3 48 0F B8 C1" , Cpiud1ECX = 1 << 23)] , which we glue to the Folback functions. It contains the code with which the given intrinsic should be replaced, as well as the availability condition of the necessary instruction.

Benchmarks

The benchmarks are pretty primitive: we generate an array of pseudo-random numbers, then we drive it 100 million times each method. To eliminate the influence of the cycle and other strapping, we first measure the time of the empty cycle. The special case of “rarefied” boards was tested separately: quite often the board contains up to 8 figures or labeled cells, and on such boards, the currently selected falback should work faster.
I tried and tested on two processors that were at hand: the i7 notebook and the good old Q6600. Here are the details of the tests.

i7-3667U

 Testing Rng ...
 78ms, sum = 369736839

 Testing BitScanForward ...
 123ms, sum = 99768073

 Testing FallBack.BitScanForward ...
 239ms, sum = 99768073

 Testing PopCnt64 ...
 106ms, sum = -1092098543

 Testing FallBack.PopCnt64 ...
 3143ms, sum = -1092098543

 Testing PopCnt32 ...
 106ms, sum = 1598980778

 Testing FallBack.PopCnt32 ...
 2139ms, sum = 1598980778

 Testing PopCnt64 on sparse data ...
 106ms, sum = 758117720

 Testing FallBack.PopCnt64 on sparse data ...
 952ms, sum = 758117720

Q6600

 Patch status: Ok
 Feature PopCnt is not supported by this CPU
 Feature PopCnt is not supported by this CPU

 Testing Rng ...
 92ms, sum = 369736839

 Testing BitScanForward ...
 154ms, sum = 99768073

 Testing FallBack.BitScanForward ...
 323ms, sum = 99768073

 Testing PopCnt64 ...
 3832ms, sum = -1092098543

 Testing FallBack.PopCnt64 ...
 3583ms, sum = -1092098543

 Testing PopCnt32 ...
 2414ms, sum = 1598980778

 Testing FallBack.PopCnt32 ...
 3249ms, sum = 1598980778

 Testing PopCnt64 on sparse data ...
 1662ms, sum = 758117720

 Testing FallBack.PopCnt64 on sparse data ...
 1076ms, sum = 758117720

Summary table of results, the “idle” cycle time has already been deducted here:

	bsf	bsf fallback	popcnt64	popcnt64 fallback	popcnt32	popcnt32 fallback	sparse popcnt64	sparse popcnt64 fallback
i7-3667U	45ms	161ms (~ 3.6x slower)	28ms	3065ms (~ 100x slower)	28ms	2061ms (~ 70x slower)	28ms	874ms (~ 30x slower)
Q6600	62ms	231ms (~ 3.7x slower)	N / A	3491ms	N / A	3157ms	N / A	984ms

As we can see, on the bsf / bsr instructions our fallback works quite fast, so the gain from intrinsic will be small, and even imperceptible compared to other parts of the algorithm. But for popcnt it is already much better, if you have a Core i5 / i7, then it will accelerate tenfold.
In relation to the final algorithm, I received an acceleration of only 15%, because there is still a lot of code that not only counts bits. As they say, YMMV , check on your code.

Where to get

Code can be taken on

githabe .
NuGet: Install-Package NetIntrinsics

Source: https://habr.com/ru/post/239619/

All Articles

Extended processor instructions in .NET or "C # Intrinsics"

Implementation details

Benchmarks

Where to get

More articles: