OpenCL. How to start

Heavy start

Hello! Some time ago I started digging a thread with OpenCL under C #. But I came across difficulties due to the fact that it is not that under C #, but in general there is very little material on this topic. Any input on OpenCL can be found here . Just a simple, but working start, OpenCL is described here . I don’t want to offend the authors one iota, but all the articles that I found in Russian (including Habrés) suffer from the same problem - there are very few examples . There is documentation, there is a lot of it and, as is customary, for good documentation it is difficult to read. In my article (and if everything goes well, then in a series of articles), I will try to describe this area in more detail, from the point of view of the person who started to dig it from scratch. I think this approach will be useful to those who want to quickly start in high-performance computing.

Initially, I wanted to write an OpenCL miniscutorial article that contained information about what it was, how it was arranged, how to write code and some recommendations based on my experience. But in the process I realized that even if I was brief, I would be confined to the limitations of the length of the article. Because, IMHO, the article should be of such volume that it was not difficult to assimilate its volume. According to this, in this article (which will be the first) I plan to describe, then how to start in OpenCL, check that locally everything is correctly configured, and write the simplest program. Questions about the memory device, architecture and other things will be described in the following articles.

Friends, I would immediately like to say that my experience in OpenCL so far, unfortunately, is far from the level guru / iodine, but I will try to answer the questions with all my might. And if I don’t know something, I’ll share the resources and the vision of how it really should work.

What. Where. How.

OpenCL is a technology associated with parallel computing on various types of graphics and CPUs. The topic of parallel computing on GPU has recently been widely promoted along with CUDA technology. This promotion was mainly provided by the efforts of Nvidia. The differences between OpenGL and CUDA have already been widely discussed .
')
OpenCL allows you to work with both the CPU and the GPU, but I think it will be more interesting for us to focus on working with the GPU. To use this technology, you need a little bit modern graphics card. The main thing is to check that the device is functioning normally. Just in case, I remind you that this can be done in the device manager.

If in this window you see any file or vornings, then you will have a direct route to the site of the manufacturer of your video card. Fresh drivers should solve the problem with the functioning of iron and as a result give access to the OpenCL facilities.

I originally planned to use OpenCL over C #. But I came across the problem that all existing frameworks like Cloo or OpenCLNet are self-written and Khronos has nothing to do with them and therefore does not guarantee their stable operation. Well, we all remember the main problem - very few examples . Based on this, at first I would like to present examples written in C ++, and only then, having received confirmation that OpenCL behaves as we expect, to screw the proxy in the form of a C # framework.
So, to use OpenCL through C ++ you need to find its API. To do this, open the variable environment, and there are looking for a variable with a scary name, hinting with its name to the manufacturer of your video card. I have this variable called "AMDAPPSDKROOT" . After that, you can see what lies on the specified path. There look for include \ CL daddy.

By the way, usually in the daddy include, next to the CL folder is the GL folder, which provides access to the famous graphic library.

Now we create a project in Visual Studio, connect the include folder in my project properties (in my case $ (AMDAPPSDKROOT) \ include \) and go to battle!

Infrastructure

We have to remember that we will work with OpenCL not through the API , but using the API . It seems that these two phrases are almost identical, but it is not. For example, remember OpenGL. How does the work work there (the simplified version) - first we set up some common parameters, and then directly from the code we call methods like “draw a sphere”, “change the parameters of the light source”, etc.
So in the OpenCL script is different:

Using the API, we get access to devices that support OpenCL. This part of the application is usually called the host ;
We write the code that will be executed on the device. This code is called kernel . This host code knows nothing at all. Any host can pull it.
Using the API, we load the kernel code and run its execution on the selected device.

As you can see, our application will have an integrated infrastructure. Let's tackle it!

Since in the previous step we prudently connected the include folder , now you can simply add a link to the header file cl.h , which will give access to the API. When adding cl.h, it’s worth adding a platform selection check:

#ifdef __APPLE__ #include <OpenCL/opencl.h> #else #include <CL/cl.h> #endif

Now it is necessary to choose a device on which our code will work and create a context in which our variables will live. How to do this is shown below:

 /*    */ ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms); /*    */ ret = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_DEFAULT, 1, &device_id, &ret_num_devices); /*   */ context = clCreateContext(NULL, 1, &device_id, NULL, NULL, &ret); /*   */ command_queue = clCreateCommandQueue(context, device_id, 0, &ret);

Pay attention to the variable ret . This is a variable that contains a numeric value that a function returns. If ret == 0 , then the function is executed correctly, if not, then an error has occurred.
The CL_DEVICE_TYPE_DEFAULT constant also deserves attention, it asks for a device that is used for computations on OpenCL by default. Instead of this constant, others can be used. For example:

CL_DEVICE_TYPE_CPU - will query the existing CPU.
CL_DEVICE_TYPE_GPU - will query the existing GPU.

Kernel

Fine. Set up the infrastructure. Now we take on the kernel. Kernel is simply a declaration function that starts with the __kernel keyword . The OpenCL programming language syntax is based on the C99 standard , but it has a number of specific and very important changes. This will be (I really hope) a separate article. While basic information:

The code that will be twitched from the host part, for execution, must begin with the keyword __ kernel;
A function with the keyword __kernel always returns void ;
There are qualifiers of memory types: __global , __local , __constant , __private , which will determine in which memory the variables will be stored. If there is no qualifier before the variable, then it is __private;
"Communication" between the host and the kernel will be through the parameters of the kernel. In order for kernel to send something to the host via a parameter, the parameter must be with the __global qualifier (for now, we will only use __global);
The kernel code is usually stored in a file with the cl extension. But in fact, such code can be generated on the fly . This allows you to bypass some limitations. But more about that another time :)

The simplest kernel example is shown below:

 __kernel void test(__global int* message) { //   id. int gid = get_global_id(0); message[gid] += gid; }

What does this code do? The first is to get the global id work-item which is currently running. The work item is what our kernel does. Since we deal with parallel computations, for each work-item a separate kernel is created which knows nothing about others. And no one can guarantee in what order all the work-item will work. But more about this will be in a separate article (I have already repeated this). In our example, this is essentially the index of the element in the array, because we will process each element of the array in a separate work-item. I think the second line of the line in the kernel is unnecessary to comment :)

We form kernel

The next step is to compile, what lies in the * .cl file. This is done as follows:

 cl_program program = NULL; cl_kernel kernel = NULL; FILE *fp; const char fileName[] = "../forTest.cl"; size_t source_size; char *source_str; int i; try { fp = fopen(fileName, "r"); if (!fp) { fprintf(stderr, "Failed to load kernel.\n"); exit(1); } source_str = (char *)malloc(MAX_SOURCE_SIZE); source_size = fread(source_str, 1, MAX_SOURCE_SIZE, fp); fclose(fp); } catch (int a) { printf("%f", a); } /*      */ program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret); /*   */ ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL); /*   */ kernel = clCreateKernel(program, "test", &ret);

The types cl_program and cl_kernel are defined in cl.h. The script itself is quite simple - we load the file, create the binary (clCreateProgramWithSource) and compile it. If the ret variable still contains 0, then you did everything right. And it remains only to create the kernel itself. It is important that the name passed to the clCreateKernel command matches the kernel name in the cl file. In our case, this is "test" .

Options

I have already mentioned that the “communication” of the kernel with the host occurs at the expense of writing / reading in the parameters that are transferred to the kernel. In our case, this is the message parameter. Parameters that allow the host to communicate with the kernel like this are called buffers (buffer) . Let's create such a buffer on the host side and pass it to kernel via the API:

 cl_mem memobj = NULL; int memLenth = 10; cl_int* mem = (cl_int *)malloc(sizeof(cl_int) * memLenth); /*   */ memobj = clCreateBuffer(context, CL_MEM_READ_WRITE, memLenth * sizeof(cl_int), NULL, &ret); /*     */ ret = clEnqueueWriteBuffer(command_queue, memobj, CL_TRUE, 0, memLenth * sizeof(cl_int), mem, 0, NULL, NULL); /*   */ ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobj);

It is important to note the constant CL_MEM_READ_WRITE , it means that we have rights for the buffer to read and write on the kernel side. Constants of the CL_MEM_WRITE_ONLY , CL_MEM_READ_ONLY type can also be used. Also in the clSetKernelArg method, the second argument is important, it contains the index of the parameter. In this case, 0, since the message parameter comes first in the kernel signature. If he were second, we would write:

 /*   */ ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobj);

clEnqueueWriteBuffer writes data from the mem array to the memobj buffer.
Well, in general, everything is ready. It remains only to execute the kernel.

Execute the kernel

Let's drive, send the code for execution:

 size_t global_work_size[1] = { 10 }; /*   */ ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); /*     */ ret = clEnqueueReadBuffer(command_queue, memobj, CL_TRUE, 0, memLenth * sizeof(float), mem, 0, NULL, NULL);

global_work_size contains the number of work-items to be created. I have already said that we will have our own work-item for processing each element of the array. We have 10 elements in the array, therefore the work-item contains 10. clEnqueueNDRangeKernel does not have to generate special questions - it just starts the specified kernel a specified number of times. clEnqueueReadBuffer reads data from a buffer named memobj and puts the data into a mem array. The data in mem is our result!

Results and conclusions

Friends, this is how I represent the start in OpenCL for a beginner. I hope for your constructive comments in the comments so that you can make updates in the future. I tried to be brief, but still the volume came out not small. So I can say that I can still find material for 2-3 articles.

Thank you, everyone who read to the end!

Source: https://habr.com/ru/post/261323/

All Articles