⬆️ ⬇️

Intel® Graphics Technology. Part II: “unload” calculations on graphics



We continue our discussion about Intel® Graphics Technology, namely, what we have in terms of writing code: offload and offload_attribute for offloading, target (gfx) and target (gfx_kernel) attributes , __GFX__ and __INTEL_OFFLOAD macros , intrinsic and a set of API functions for asynchronous offload. This is all we need for happiness. I almost forgot: of course, we need a compiler from Intel and the magic option / Qoffload .



But about everything in order. One of the main ideas is the relatively easy modification of the existing code executed on the CPU to run it on the graphics integrated into the processor.



The easiest way to show this is with a simple example of summing two arrays:



void vector_add(float *a, float *b, float *c){ for(int i = 0; i < N; i++) c[i] = a[i] + b[i]; return; } 


With Intel® Cilk ™ Plus technology, we can easily make it parallel by replacing the for loop with cilk_for :

')

 void vector_add(float *a, float *b, float *c){ cilk_for(int i = 0; i < N; i++) c[i] = a[i] + b[i]; return; } 


Well, in the next step we already ship the calculations to the graphics using the #pragma offload directive in synchronous mode:



 void vector_add(float *a, float *b, float *c){ #pragma offload target(gfx) pin(a, b, c:length(N)) cilk_for(int i = 0; i < N; i++) c[i] = a[i] + b[i]; return; } 


Or we create a kernel for asynchronous execution using the __ declpec ((gfx_kernel)) specifier before the function:



 __declspec(target(gfx_kernel)) void vector_add(float *a, float *b, float *c){ cilk_for(int i = 0; i < N; i++) c[i] = a[i] + b[i]; return; } 


By the way, everywhere there is a set of letters GFX , which should lead us to believe that we are working with integrated graphics ( GFX - Graphics ), and not with the GPU , which is often referred to as discrete graphics.



As you already understood, the procedure has a number of features. Well, first of all, everything works only with cilk_for cycles. It is clear that for good work there should be a parallel version of our code, but for the time being it is the mechanism for working with cycles from Cilk that is supported, that is, the same OpenMP goes past the cash register. It must be remembered that the graphics do not work very well with 64-bit “fleets” and integers — especially hardware, so you shouldn’t expect high performance with such operations.



There are two main modes for computing on the graph: synchronous and asynchronous. For the implementation of the first, compiler directives are used, and for the second, a set of API functions are used, while for the implementation of the offload it will be necessary to put the function (kernel) declared in this way into the execution queue.



Synchronous mode

It is implemented using the #pragma offload target (gfx) directive in front of the cilk_for cycle that interests us.

In a real application in this loop, there may well be a call to a certain function, so it must also be declared with __ declspec (target (gfx)) .

The synchronism is that the thread executing the code on the host (CPU) will wait until the end of the calculations on the graph. At the same time, the compiler generates code for both the host and graphics, which allows for greater flexibility when working with different hardware. If offload is not supported, then the entire code is executed on the CPU. In the first post we already talked about how this is implemented.

The directive can specify the following parameters:



Important note - using pin can significantly reduce the overhead of using offload. Instead of copying data back and forth, we will organize access to physical memory available to both the host (CPU) and integrated graphics. If the size of the data is insignificant, then we will not see a large increase.

Since the OS does not know that the processor graphics uses memory, the obvious solution was to make the used memory pages impossible to swap, in order to avoid an unpleasant situation. Therefore, you need to be careful and do not pinch a lot - otherwise you will get many pages for which you cannot make a swap. Naturally, the speed of the system as a whole will not increase.



In our example of summing two arrays, we just use the pin (a, b, c: length (N)) parameter:



 #pragma offload target(gfx) pin(a, b, c:length(N)) 


That is, arrays a and b are not copied into the graphics memory, but remain available in shared memory, and the corresponding page is not swapped until we finish the work.

By the way, to ignore pragmas, the / Qoffload- option is used. Well, if we suddenly get tired of offload. By the way, ifdef'y nobody canceled, and this technique is still very relevant:



 #ifdef __INTEL_OFFLOAD cout << "\nThis program is built with __INTEL_OFFLOAD.\n" << "The target(gfx) code will be executed on target if it is available\n"; #else cout << "\nThis program is built without __INTEL_OFFLOAD\n"; << "The target(gfx) code will be executed on CPU only.\n"; #endif 


Asynchronous mode

Consider now another mode offload, which is based on the use of API functions. Graphics have their own queue for execution, and all we need is to create the kernels (gfx_kernel) and put them into this queue. You can create a kernel with the __declspec (target (gfx_kernel)) specifier before the function. In this case, when a thread on the host sends the kernel to the queue for execution, it continues execution. However, it is possible to wait until the execution is completed on the graph using the _GFX_wait () function.



In the synchronous mode of operation, every time we enter the region with offload, we pin the memory (if we don’t want to copy, of course), and when we exit the cycle, we stop this process. This happens implicitly and does not require any construction. Therefore, if offload is performed in some cycle, then we will get very large overheads (overhead). In the asynchronous case, we can explicitly specify when to start pinning the memory and when to end using the API functions.



In addition, in asynchronous mode, code generation is not provided for both the host and the graphics. So you have to take care of implementing the code only for the host itself.



Here is the code for calculating the sum of arrays in asynchronous mode (the asynchronous version of the code for vec_add was presented above):



  float *a = new float[TOTALSIZE]; float *b = new float[TOTALSIZE]; float *c = new float[TOTALSIZE]; float *d = new float[TOTALSIZE]; a[0:TOTALSIZE] = 1; b[0:TOTALSIZE] = 1; c[0:TOTALSIZE] = 0; d[0:TOTALSIZE] = 0; _GFX_share(a, sizeof(float)*TOTALSIZE); _GFX_share(b, sizeof(float)*TOTALSIZE); _GFX_share(c, sizeof(float)*TOTALSIZE); _GFX_share(d, sizeof(float)*TOTALSIZE); _GFX_enqueue("vec_add", c, a, b, TOTALSIZE); _GFX_enqueue("vec_add", d, c, a, TOTALSIZE); _GFX_wait(); _GFX_unshare(a); _GFX_unshare(b); _GFX_unshare(c); _GFX_unshare(d); 


So, we declare and initialize 4 arrays. Using the _GFX_share function, we explicitly say that this memory (starting address and length in bytes are set by the function parameters) needs to be pinned, that is, we will use the shared memory for the CPU and graphics. After that, we put in the queue the desired function vec_add , which is defined using __ declspec (target (gfx_kernel)) . In it, as always, the cilk_for loop is used . The thread on the host puts the second miscalculation of the vec_add function with new parameters into the queue without waiting for the first one to complete. With _GFX_wait, we expect all cores in the queue to be executed. And at the end, we clearly stop the pinning of memory using _GFX_unshare .



Do not forget that to use the API functions, we need the header file gfx_rt.h . In addition, to use cilk_for you need to connect cilk / cilk.h.

An interesting point is that the default installed compiler could not find gfx_rt.h - I had to set the path to its daddy ( C: \ Program Files (x86) \ Intel \ Composer XE 2015 \ compiler \ include \ gfx in my case) with pens in the project settings.



I also found one interesting option, which I did not mention in the previous post, when I talked about code generation by the compiler. So, if we know in advance what kind of hardware we will work on, then we can indicate this to the compiler explicitly with the help of the / Qgpu-arch option. So far there are only two options: / Qgpu-arch: ivybridge or / Qgpu-arch: haswell . As a result, the linker will call the compiler to translate the code from the vISA architecture to the one we need, and we will save on JIT'ing.



And finally, an important note about the work offload on Windows 7 (and DirectX 9). It is critical that the display is active, otherwise it will not work. In Windows 8, there is no such limit.



Well, remember that we are talking about integrated graphics in the processor. The described constructions do not work with discrete graphics - we use OpenCL for it.

Source: https://habr.com/ru/post/250545/



All Articles