C # coprocessor programming? Yes!

Probably everyone knows about the existence of the FPU coprocessor. How to write code for it read on . FPU - floating point unit - part of the CPU, specifically designed to work with data types that represent floating point numbers, or differently with float and double types. This module as a part of processors was born after the birth of Intel 486DX (thank you for the amendment), and that's so long ago. And since then it is he who performs the work of calculating various mathematical expressions, or rather, in the form of a code in assembly language. In other words, the compiler does not convert all program code into a standard set of instructions of type mov, sub and others, but also in fld, fstp, fsub, fadd ..., if we are talking about calculations involving double types. As you can see, the instructions for the FPU have the prefix “f”, by which we can immediately distinguish the code intended for it. All information on FPU you can find on the Internet, google it by name, also recommend the site wasm.ru - section "Processors". A coprocessor is a very interesting thing and programming is a very interesting activity, I would even say exciting - I don’t know how you feel, but I was delighted when I managed to “jam” the code, giving commands directly to the processor without intermediary compilers, CLR- Wednesday and others. Why "conjure"? More on this later.
I borrowed the term “conjure” from the author of the wonderful articles on the site. This is a series of articles about "Spell Code", which I recommend you read after reading my article.
Now I will show you how to write a simple code spell example for FPU. I must immediately warn you that at least C # will be involved at the end, you need C ++ for the spell itself.
Suppose we need to calculate the following expression: result = arg1 - arg2 + arg3.
There are several options for compiling code. In order not to complicate the understanding of what is happening, I will first show one, a little later I will show another.
So, the first option looks like this:

fld [arg1]
fld [arg2]
fsubp
fld [arg3]
faddp
fstp [result]
ret

Now I will explain. In square brackets we must indicate the addresses of the variables arg1, arg2, arg3, result.
The fld instruction loads to the top of the stack (FPU works with the stack, and it has some special features) the value of the double variable, the address of which goes immediately after the instruction; fsubp - subtracts the value lying 1 position in the stack below, the value lying on the top of the stack and freeing the top of the stack, thus the result is written in place of the value from which it is subtracted, the result is now on top of the stack; faddp - works by analogy with fsubp, but does not subtract, but adds the values; fstp - unloads the double value from the top of the stack, unloads it into the cell at the address specified below; well, the ret instruction — intuitively clear — terminates the execution of the function and transfers control to the function that called it. To make it clearer, I'll show the work of our code in pictures:
')
Job code

The result is recorded in the memory cell, where it can be picked up. I hope the instructions are clear. Now let's see how we create such code from a C ++ program.

double ExecuteMagic( double arg1, double arg2, double arg3) { short * code; short * code_cursor; short * code_end; double * data; double * data_cursor; SYSTEM_INFO si; GetSystemInfo(&si); DWORD region_size = si.dwAllocationGranularity; code = ( short *)VirtualAlloc(NULL, region_size * 2, MEM_COMMIT, PAGE_EXECUTE_READWRITE); code_cursor = code; code_end = ( short *)(( char *)code + region_size); data = ( double *)code_end; data_cursor = data; *data_cursor = arg1; *code_cursor++ = ( short )0x05DDu; //fld *( int *)code_cursor = ( int )(INT_PTR)(data_cursor); //1.0 code_cursor = ( short *)(( char *)code_cursor + sizeof ( int )); // data_cursor++; *data_cursor = arg2; *code_cursor++ = ( short )0x05DDu; //fld *( int *)code_cursor = ( int )(INT_PTR)data_cursor++; //-2.0 code_cursor = ( short *)(( char *)code_cursor + sizeof ( int )); // *code_cursor++ = ( short )0xE9DEu; //fsubp *data_cursor = arg3; *code_cursor++ = ( short )0x05DDu; //fld *( int *)code_cursor = ( int )(INT_PTR)data_cursor++; //2.0 code_cursor = ( short *)(( char *)code_cursor + sizeof ( int )); // *code_cursor++ = ( short )0xC1DEu; //faddp double *result = data_cursor; *code_cursor++ = ( short )0x1DDDu; //fstp *( int *)code_cursor = ( int )(INT_PTR)data_cursor++; // code_cursor = ( short *)(( char *)code_cursor + sizeof ( int )); // *code_cursor++ = ( short )0x90C3u; //ret void (*function)() = ( void (*)())code; //1-(-2)+2=5 function(); return *result; } * This source code was highlighted with Source Code Highlighter .

Now let's look at the most delicious here. So, we use the VirtualAlloc function to allocate a certain amount of memory to our code (namely, according to the value of the structure
SYSTEM_INFO. dwAllocationGranularity, as if the system memory partitioning value); Pay attention to what arguments the function accepts at the input, namely PAGE_EXECUTE_READWRITE - this parameter allows you to access the newly created memory section not only for read / write, but also for code execution, i.e. we can transfer control to this section of memory and the processor will read further instructions from here.
We allocate half of this created array for the code, the second half for the data is a kind of code segment and data segment. All that remains is to fill these data segments with the necessary values. To fill the array with code, it is necessary to simply write to this array opcodes (processor instructions) in hexadecimal. Let's sort everything in order.
The FLD instruction has an opcode DD / 0. Yes, by the way, let me tell you right away that you can see the values of opcodes and their mnemonic writing in the documentation on processor architecture. Let's continue, FSTP also has the opcode DD, but already with the prefix / 3 it is an extension of the opcode - mod r / m bytes. Here is a table of the mod r / m byte values [http://www.sandpile.org/ia32/opc_rm32.htm] (Inquiring minds, in the presence of interest, can figure out all this, believe me). Since the FLD and FSTP instructions can operate with different types of operands, i.e. cells, processor registers, then for this there is an opcode extension. We need an operand to see the address of the number double, so in that table we look at the value for [sdword]. For FLD, this value is 05h, for FSTP 1Dh. We add these values to the opcodes and get: FLD = DD05h, FSTP = DD1Dh. The FSUBP instruction has the opcode DE / 5, and again we have to refer to the opcode extension table and look at the extension value for XMM1 (this is the link element of the FPU stack) and see that it is equal to E9h, i.e. FSUBP = DEE9h. FADDP, like FSUBP, has opcode DE, but already / 0, which is C1h for XMM1, i.e. FADDP = DEC1h. The RET instruction has an opcode C390h.
It should be noted that the processor reads instructions from the end, so they should be written in reverse, taking into account the fact that they are 2 bytes and paired, i.e. FLD = DD05h should be recorded not 50DDh, but 05DDh, this is important!
Well, that's basically all opkodam. C ++ code above shows how to fill an array with instructions. We first write the instructions, then, if necessary, the address of the cells. Note that the address is 4 bytes long (32 bits) for 32-bit systems, so after writing the address to the code array, you must move the pointer 4 bytes forward, instead of 2 bytes in the case of instructions.
The culmination of this miracle is the execution of stored code. How to execute code from our array? For help we refer to the pointer to the function, here the C ++ language helps out. Create a pointer to a function of type void with void parameters, then assign it a pointer to the beginning of the array of code. Everything! We run our function pointer, we get the result of the program working right in memory, the processor did everything exactly as we told it in our code array.
Now I remind you that this is 1 way to transfer parameters and return the result. The second way is to create a pointer to a function of type double (void), i.e. so that the result was written to us not in memory and we pulled it out ourselves, and so that the result created by our dynamically returned function already returned to us. To do this, simply change the code to:

fld [arg1]
fld [arg2]
fsubp
fld [arg3]
faddp
// fstp [result]
ret

Those. just leave the result at the top of the stack. And our function pointer will return the result from the top of the stack. It's simple.

From the middle of the article, the reader asks the question: “And what about C # ??? One C ++ and Assembler, incomprehensible numbers ... ". Fair enough, but you have to be patient :).

So, we all know that we can perform functions from C # written in C ++, Delphi, etc.
This can be implemented using the extern keyword and the [DllImport ("*. Dll") attribute].
There is also an option and easier. Programmers of the .NET platform were able to make friends managed and unmanaged. Thus, we simply create a new class in C ++ using the aforementioned technique that implements code generation, a code spell. Then we simply connect this library to a project using managed C # code and use it completely freely. That's exactly what I did. How glad I was when the result was not long in coming! :)

Here is what I did:

#include <windows.h> #pragma once using namespace System; namespace smallcodelib { public ref class CodeMagics { public : static double ExecuteMagic( double arg1, double arg2, double arg3) { short * code; short * code_cursor; short * code_end; double * data; double * data_cursor; SYSTEM_INFO si; GetSystemInfo(&si); DWORD region_size = si.dwAllocationGranularity; code = ( short *)VirtualAlloc(NULL, region_size * 2, MEM_COMMIT, PAGE_EXECUTE_READWRITE); code_cursor = code; code_end = ( short *)(( char *)code + region_size); data = ( double *)code_end; data_cursor = data; *data_cursor = arg1; *code_cursor++ = ( short )0x05DDu; //fld *( int *)code_cursor = ( int )(INT_PTR)(data_cursor); //1.0 code_cursor = ( short *)(( char *)code_cursor + sizeof ( int )); // data_cursor++; *data_cursor = arg2; *code_cursor++ = ( short )0x05DDu; //fld *( int *)code_cursor = ( int )(INT_PTR)data_cursor++; //-2.0 code_cursor = ( short *)(( char *)code_cursor + sizeof ( int )); // *code_cursor++ = ( short )0xE9DEu; //fsubp *data_cursor = arg3; *code_cursor++ = ( short )0x05DDu; //fld *( int *)code_cursor = ( int )(INT_PTR)data_cursor++; //2.0 code_cursor = ( short *)(( char *)code_cursor + sizeof ( int )); // *code_cursor++ = ( short )0xC1DEu; //faddp double *result = data_cursor; *code_cursor++ = ( short )0x1DDDu; //fstp *( int *)code_cursor = ( int )(INT_PTR)data_cursor++; // code_cursor = ( short *)(( char *)code_cursor + sizeof ( int )); // *code_cursor++ = ( short )0x90C3u; //ret void (*function)() = ( void (*)())code; //1-(-2)+2=5 function(); return *result; } }; } ++. : using System; using System.Collections. Generic ; using System.Linq; using System.Text; using System.Runtime.InteropServices; using smallcodelib; namespace test_smallcodelib { class Program { static void Main( string [] args) { Console .WriteLine( " ! (* )" ); while (! Console .ReadLine().Equals( "*" )) { double arg1; double arg2; double arg3; Console .Write( "arg1?: " ); arg1 = Convert .ToDouble( Console .ReadLine()); Console .Write( "arg2?: " ); arg2 = Convert .ToDouble( Console .ReadLine()); Console .Write( "arg3?: " ); arg3 = Convert .ToDouble( Console .ReadLine()); double result = CodeMagics.ExecuteMagic(arg1, arg2, arg3); Console .WriteLine( String .Format( "Result of arg1 - arg2 + arg3 = {0}" , result)); } } } } * This source code was highlighted with Source Code Highlighter .

This is already in C #!

Check it out! Everything is working!

It's clear that there is more code in C ++, but if interested people have a certain talent and interest to suffer in this area, then you can write some C ++ wrapper that will generate such code dynamically, and use this wrapper from C # , filling it with necessary variables and parameters, well, etc. You can get a pretty interesting thing.

Add a couple more amenities.
The article is written with reference to the programming coprocessor. In fact, you can write whatever your heart desires, for this you need to study the architecture of the memory and computer processor, instructions. Technologically advanced programmers who know what SSE is (and it is almost 5 already) can write code using all the innovations of processor technologies and the most pleasant thing is to use it in C #. Everything is limited to fantasy =). Good luck in your endeavors!

I want to express my deep gratitude to my friend Peter Kankovsky, who helped me in all this in due time! He has his own wiki site, where he and his colleagues and friends discuss various ways to optimize code, etc. [http://www.strchr.com/]

UPD: Here is a simple example of the same principle of generating native code, but completely in C #. Thanks to lastmsu for the tip-off on Marshal.GetDelegateForFunctionPointer ().

Thank you for attention! Good luck!

Source: https://habr.com/ru/post/79883/

All Articles

C # coprocessor programming? Yes!

More articles: