OpenCL. Practice

Hello, dear habrasoobschestvo.

In previous articles, we looked at OpenCL as a whole, then we thoroughly delved into the essence of the standard and disassembled on what ideas this technology is based.
OpenCL. What is it and why is it needed? (if there is a CUDA)
OpenCL. Technology details
Now it is time to feel this technology alive.

Cooking

So, for work we need: the standard specification , SDK ( AMD or NVidia ) and, optionally, the OpenCL literature, for example, from here .
If you install the Nvidia Computing SDK, you will automatically receive all the necessary documents. In addition to the bonus, you will receive many interesting examples of programs (30 pieces in the latest SDK release). Thanks to these examples, it is easy to learn how to use OpenCL correctly, use several OpenCL devices at the same time, use the OpenCL-OpenGL bundle (this interaction is specified by the standard!) And so on.
')
The OpenCL compiler is built into the driver, so the choice of IDE for development is in no way limited, therefore I will not describe the process of setting up a particular IDE. All you have to do is set the paths to the headers and libraries that the SDK installs.

Go.

We write a simple program for summing two vectors. Such a program is in the examples of the SDK for CUDA and OpenCL, but our program will be slightly different (remove the error code checks at each step, and slightly simplify the program, leaving only the essence).

GPU part

Let's start with the most interesting, and, perhaps, the simplest in this example - with the code that will be executed on the GPU.

The OpenCL syntax for writing kernels does not in itself constitute anything special and is slightly different from the syntax of the same CUDA - this is all the good old C with some modifications.
Create a vectorAdd.cl file - our kernel will be located here.

__kernel void VectorAdd(__global const float * a, __global const float * b, __global float * c, int iNumElements) { // get index into global data array int iGID = get_global_id(0); // bound check (equivalent to the limit on a 'for' loop for standard/serial C code if (iGID >= iNumElements) { return ; } // add the vector elements c[iGID] = a[iGID] + b[iGID]; } * This source code was highlighted with Source Code Highlighter .

We got our number in the global index space and added the elements of the vectors with the corresponding index, and if our number is larger than the size of the vector, nothing is done.

Everything looks easy and simple: kernel is a simple function, the declaration of which is preceded by the __kernel keyword (two underscores), and then everything is like in C — the return type, function name, parameters (when defining parameters, you must also specify the __global, __local, __private modifiers) .

Kernel is written in C. There are a number of extensions (except for syntax) and restrictions. Briefly about the limitations can be found here . More fully in the standard , the OpenCL Programming Guide can also be useful.
The language extensions are: data type “image” 2d and 3d, data types vector ints, floats and so forth. dimensions from 2 to 4.
When declaring variables, you must specify the memory area where they should be located: __global, __local, __private. If no memory is specified, private memory will be used.

If you need to use other functions in kernel that are hidden for the CPU, you can define them in the same file, but without specifying the __kernel modifier.

Host part

The simplest kernel we created. Now let's figure out how to run this kernel on a video card.

The host part of the program will also be simple and limited to running the kernel.

Functions for working with the kernel provides the OpenCL API. These are C functions. You can download C ++ - bindings and their documentation here.

The manual for all API functions is in the same document where the OpenCL standard is described.

For any kernel to work, the context in which it will be executed is necessary. Create the context object.

cl_context cxGPUContext = clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, &ciErr1); * This source code was highlighted with Source Code Highlighter .

The first parameter is a list of context properties and their values. NULL means using default implementation-defined properties.
Then we explain to the system that we are going to work with the GPU - this means that the device with which we will work can also be used for 3d API, for example, OpenGL. (the list of possible values of this parameter is also in the specification of the standard)
The following two parameters are needed to register the call-back function that will be called by OpenCL in case of errors in the context.
The last parameter is to return an error code. May be NULL.

In fact, this is not the only way to create a context. But the article does not pretend to describe all the functions of the OpenCL API. Just this way of creating context is more convenient for us.

Next, we select the device (I have only one in the system, but for the future let our program use a device with the maximum number of FLOPS).

cl_device_id cdDevice = oclGetMaxFlopsDev(cxGPUContext); * This source code was highlighted with Source Code Highlighter .

I note that in the various examples from the SDK, the entire initialization process sometimes differs; it can be done intentionally in order to show that there is more than one way to perform these actions and to force the developer to delve into specifications. For example, here we selected a device with maximum FLOPS, but we could use the clGetContextInfo function to get a list of all devices associated with the context (see the original VectorAdd example).

Chose and initialize the device.
Now we will connect a queue of commands with our device.

cl_command_queue cqCommandQue = clCreateCommandQueue(cxGPUContext, cdDevice, 0, &ciErrNum); * This source code was highlighted with Source Code Highlighter .

Of the interesting parameters, only 0. In fact, this is a list of properties of the command queue: is it possible to execute commands non-sequentially and is command profiling allowed.

Everything is ready to work with the device, we can send commands to the queue for execution.
Create memory objects through which memory areas on the device and host will be linked.

cl_mem cmDevSrcA = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, sizeof (cl_float) * szGlobalWorkSize, NULL, &ciErr1); cl_mem cmDevSrcB = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, sizeof (cl_float) * szGlobalWorkSize, NULL, &ciErr2); cl_mem cmDevDst = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, sizeof (cl_float) * szGlobalWorkSize, NULL, &ciErr2); * This source code was highlighted with Source Code Highlighter .

we specify the type of access to the memory of objects (for the device), the size of the memory area and the memory area of the host with which the memory object is associated (here NULL).
It would be possible to initialize the input data immediately, if the penultimate parameter passed a pointer to the memory area of the host that needs to be copied to the device.
But we will do this later, just before the launch of the kernel, in order not to occupy space on the device ahead of time.

All the preparatory work is completed, now we take the kernel itself. As you remember, the OpenCL compiler is built right into the platform. For this reason, the build of the OpenCL kernel should be done at runtime (you can build a kernel from both source and binary files).

Let's get started

Create a kernel from the file that we wrote earlier.

char *source = oclLoadProgSource(source_path, "" , &program_length); * This source code was highlighted with Source Code Highlighter .

Received the program source code in the char * string. Source_path is the full path to the vectorAdd.cl file, followed by the “preamble” - usually this is the header or the define list. The last parameter is the size of the output string.

ATTENTION!!! oclLoadProgSource is not a function of the OpenCL API, but is located in the auxiliary library supplied with the Nvidia Computing SDK.

We create the program object from the received sources, the subsequent functions are the OpenCL API.

cl_program cpProgram = clCreateProgramWithSource(cxGPUContext, 1,( const char **)&source, &program_length, &ciErrNum); * This source code was highlighted with Source Code Highlighter .

a program may consist of several files, each of which must be loaded into a separate char * line, we transfer an array of such lines to create a program. The second parameter here means the size of this array. In our case - 1.
After the array of strings is passed an array of the lengths of these strings.
All other parameters are not worthy of attention.

Blind the program from a pile of files, now let's assemble it (compilation and linking)

lBuildProgram(cpProgram, 0, NULL, NULL, NULL, NULL); * This source code was highlighted with Source Code Highlighter .

Collects the program for the selected devices from the list of devices associated with the context (the context is not passed, as it is associated with the program object).
Here you can specify for which of the devices related to the context you need to build the program. You can also set compiler options and set up a call-back function to enable asynchronous compilation.
We have only one device so far, and we can afford a synchronous compilation. And we will not use compiler options yet.
All this is described in detail in the specification of the standard. After the assembly, a ready-made executable file is already associated with our program object. Now, from the function (and this is just a function so far) with the __kernel identifier, you need to create a kernel.

cl_kernel ckKernel = clCreateKernel(cpProgram, "VectorAdd" , &ciErr1); * This source code was highlighted with Source Code Highlighter .

Everything is ready, all preparations have been completed and the moment of truth has arrived: the launch of the kernel for execution.
BUT! You must first set the input parameters with which our kernel will be called.

clSetKernelArg(ckKernel, 0, sizeof (cl_mem), ( void *)&cmDevSrcA); clSetKernelArg(ckKernel, 1, sizeof (cl_mem), ( void *)&cmDevSrcB); clSetKernelArg(ckKernel, 2, sizeof (cl_mem), ( void *)&cmDevDst); clSetKernelArg(ckKernel, 3, sizeof (cl_int), ( void *)&iNumElements); * This source code was highlighted with Source Code Highlighter .

we specify the serial number of the parameter, the size and the memory object.

Now everything is exactly. Work begins with the queue:
We copy (asynchronously; the third argument is responsible for this) the data on the device.

clEnqueueWriteBuffer(cqCommandQue, cmDevSrcA, CL_FALSE, 0, sizeof (cl_float) * szGlobalWorkSize, srcA, 0, NULL, NULL); clEnqueueWriteBuffer(cqCommandQue, cmDevSrcB, CL_FALSE, 0, sizeof (cl_float) * szGlobalWorkSize, srcB, 0, NULL, NULL); * This source code was highlighted with Source Code Highlighter .

The most important thing is to put the kernel for execution.

clEnqueueNDRangeKernel(cqCommandQue, ckKernel, 1, NULL, &szGlobalWorkSize, NULL, 0, NULL, NULL); * This source code was highlighted with Source Code Highlighter .

With the first two parameters, everything is clear.
The third parameter is the dimension of the index space. Vector - one-dimensional.
It is followed by an argument that means the size of the shift in the index space and in the current version of the standard should always be NULL.
szGlobalWorkSize indicates the size of the index space — this is the total number of work-items to be executed.
Group size is left to the discretion of the driver (NULL).
The following two parameters are used for synchronization when using out-of-order command execution. This is a list of events that must be completed before running this command (first comes the size of the list, then the list itself).
Through the last parameter, the event object is returned, signaling the completion of the command.

It remains only to read the result. Let's do it in sync:

clEnqueueReadBuffer(cqCommandQue, cmDevDst, CL_TRUE, 0, sizeof (cl_float) * szGlobalWorkSize, dst, 0, NULL, NULL); * This source code was highlighted with Source Code Highlighter .

Now it remains to clear the memory by deleting the created memory objects and programs. It is not difficult, and easy to find in any example from the SDK, so I will not give this code here.

Conclusion

The simplest program is ready.

Then you can study the specifications in depth, study the features of specific implementations of OpenCL, try various optimization options in compilers, read the documents produced by the suppliers of these implementations and, of course, write more complex programs.

There are many nuances and pitfalls when writing OpenCL programs, as well as currently available OpenCL application profiler developed by Nvidia, providing a number of interesting features. But all this is already beyond the scope of this article and, if dear readers show interest, it will be possible to write a separate article about the various subtleties and features of OpenCL applications and OpenCL for Nvidia GPU.

List of useful links

www.nvidia.com/object/cuda_opencl.html - here you can register for access to the Nvidia Computing SDK and download useful documents.
www.khronos.org/registry/cl - page on the side of the Khronos Group. Specifications, header files, and so on.
developer.amd.com/GPU/ATISTREAMSDKBETAPROGRAM/Pages/default.aspx - AMD Stream SDK with OpenCL support.
I strongly recommend that you read the two Nvidia documents: the OpenCL Programming Guide and the OpenCL Best Practices Guide .

Source: https://habr.com/ru/post/73526/

All Articles