This article focuses on the basics of programming with OpenCl. OpenCl is a programming language on the GPU / CPU, its structure is close to the
c99 standard.
Khronos Group is engaged in its development, where full
documentation is available on their website. In order to avoid controversy on the topic “well, this is all trivial, it’s enough to dig in the internet” at once I’ll make a reservation: there is almost no information on this topic in runet, and in the western Internet it is available in a rather fragmented state on a dozen sites. There will be some compilation of the basic principles, which will maximally simplify a novice programmer’s life, as well as allow using the computing power of a video card from the very first project. People who have written 2-3 serious programs on OpenCl will no longer be interested. The article is in a sense a continuation of my previous
article .
Compiler
First of all, the question: where to write the code itself. As far as I know under .NET, there is not yet any whistle that allows you to process kernel code directly in the studio. Therefore, we have to use third-party editors. AMD, nVidia and Intel add them to their SDKs. For some reason I like Intelovsky more. Just as an option, there are several editors written by fans. Of these, I like the editor that
came with the
OpenCLTemplate the most . It is worth noting that this is the editors. Compilation of the code occurs immediately before running on the GPU / CPU.
Device memory model

Before describing the language itself, I will give a brief description of the physical model of the device with which it interacts. The execution of language commands goes on objects called “work-item”. Each work-item is independent of the other and can execute code in parallel with the others. If a process from one work-item wants to get data used or already processed by any other work-item, it can do this through shared memory. Total memory is very slow, but it has a large volume. To speed up the calculations there is a local memory. If you are familiar with CUDA, then there it is called “shared memory”. It is much faster than the total, but not any process can access it. Only the work-item of one group can access local memory. These groups are called “Compute Unit” or “Workgroup” (the first name refers to the physical partitioning at the hardware level, and the second to the logical partitioning at the program level). Depending on the device in each of these groups, a different number of work-item (for example, 240 for the NVIDIA GT200 or 256 for the Radeon 5700 Series). The number of these units is limited to a fairly small number (30 for the NVIDIA GT200 or 9-10 for the Radeon 5700 Series). There is also an ultra-fast “private memory” to which the work-item can access individually.
OpenCL device drivers automate the start and operation of the work-item and workgroup. For example, if we need to perform a million processes, and we have only a thousand work-item, then the drivers will automatically start each process with the next task after it is completed. Understanding the physical layer is only required in order to have an idea about the possibilities of interaction between processes and the access of processes to memory.
Basic features
Since the basis of the language is practically standard with ++, I will consider only those features that distinguish OpenCL from it. Consider the code of the simplest kernel program, which I cited in the last article. This code adds two vectors, v1 and v2, putting the result in the first one.
__kernel void
floatVectorSum(__global float * v1,
__global float * v2)
{
int i = get_global_id(0);
v1[i] = v1[i] + v2[i];
}
* This source code was highlighted with Source Code Highlighter .
')
Procedure announcement
First of all, the mysterious "__kernel" catches the eye. This directive should be marked with any procedure that we want to call from the outside. If the procedure is not needed when working from the outside, it is possible not to mark it.
Types of memory
The data type "__global" denotes the memory that is allocated from the global address space of the operating device. It is slow enough, but roomy. For modern video cards measured by gigabytes. If you are working on a processor, global is RAM.
Besides global, there is a "__local". Only the workgroup can access it. About 8 kilobytes are allocated for each such group.
Just a quick memory is "__privat". This is a memory to which only a separate thread has access (work-item). In total, 32 registers of this memory are allocated to the stream.
The remaining types of memory that can be declared during kernel creation are based on the type "__global". First, it is "__constant", which can only be used for reading. Secondly, these are "__read_only", "__write_only" and "__read_write" - structures, the use of which is allowed only for images.
Process identifiers
After running on a video card, all processes are equivalent and execute an equivalent code. But, obviously, we do not need multiple repetitions of the same action - each process must do its own piece of the task. To understand their place in the world around them are process identifiers. The simplest identifier is "get_global_id (0)". In the case of the given example, it indicates the i number of the vector that should add this process. If we process not a one-dimensional vector, but a two-dimensional image - we need to know the position of the process along two axes. Of course, this value can be calculated. But these are unnecessary operations. Therefore, for convenience at startup, you can specify that we need a space of two-dimensional dimension. Then in the process you can get both position identifiers: “get_global_id (0)”, “get_global_id (1)”. The same can be done for three-dimensional space. Often it may be necessary and the dimension of the space in which we work. For example, for an image with virtually any treatment, we need its width and height. To get the dimension of the space, the identifier “get_global_size (i)” is used. In addition, there are process identifiers within the working group - “get_local_id (i)”, “get_local_size (i)” and the identifier of the group itself - “get_group_id (i)”, “get_num_groups (i)”. Most of these relationships are related to each other: num_groups * local_size = global_size, local_id + group_id * local_size = global_id, global_size% local_size = 0.
Optimization of calculations
The developers of OpenCL and video cards understood that the main purpose of their offspring is to speed up complex calculations. To this end, a number of specialized features have been added to the language, which make it possible to obtain an increase in speed on mathematical problems when using them.
Embedded vectors
The very first feature are vectors and vector math. In OpenCl, you can declare 2, 4, 8, and 16-dimensional vectors as variables. This is done accordingly: int2, int4, int8, int16. You can also declare double, byte and all other types. The vectors of the corresponding dimension can be added / subtracted / divided / multiplied, and also any vector can be divided / multiplied by a number:
uint4 sumall = (uint4)(1,1,1,1);
small += (uint4)(1,1,1,1);
sumall = sumall/2;
* This source code was highlighted with Source Code Highlighter .
In addition, there are a number of functions optimized for vectors and allowing working directly with them. These functions include the functions of calculating the distance, the function of the vector product. For example:
float4 dir1 = (float4)(1, 1, 1, 0);
float4 dir2 = (float4)(1, 2, 3, 0);
float4 normal = cross(dir1, dir2);
* This source code was highlighted with Source Code Highlighter .
You can also merge vectors by taking parts from one and the other, as well as glue them into larger ones:
int4 vi0 = (int4) -7 ;
int4 vi1 = (int4) ( 0, 1, 2, 3 ) ;
vi0.lo = vi1.hi; //
int8 v8 = (int8)(vi0.s0123, vi1.s0123); //
* This source code was highlighted with Source Code Highlighter .
Simple functions
The next feature of OpenClI is the built-in library of functions. In addition to the standard set math.lib in OpenCl there are so-called native functions. These are functions that are based directly on the use of certain functions of video cards and on rough mathematics. It is not advisable to apply them for ultra-accurate calculations, but in the case of filtering images the difference cannot be noticed. Such functions, for example, include: “native_sin”, “native_cos”, “native_powr”. I will not give a more detailed explanation of these functions, there are a lot of them, and the principles are different. If you need them - see the documentation.
Common Functions
In addition to "simple functions", developers have created a number called common function. These are functions that are often found in image processing. For example: mad (a, b, c) = a * b + c, mix (a, b, c) = a + (ba) * c. These functions are performed faster than the corresponding mathematical operations.
Example
The site
www.cmsoft.com.br has a wonderful example showing the possibilities for optimizing code with native and common functions:
kernel void regularFuncs()
{
for ( int i=0; i<5000; i++)
{
float a=1, b=2, c=3, d=4;
float e = a*b+c;
e = a*b+c*d;
e = sin(a);
e = cos(b);
e = a*b+c*d;
e = sin(a);
e = cos(b);
e = a*b+c*d;
e = sin(a);
e = cos(b);
float4 vec1 = (float4)(1, 2, 3, 0);
float4 vec2 = (float4)(-1, 3, 1, 0);
float4 vec = distance(vec1, vec2);
double x=1, y=2, z=3;
double resp = x*y+z;
}
}
kernel void nativeFuncs()
{
for ( int i=0; i<5000; i++)
{
float a=1, b=2, c=3, d=4;
float e = mad(a,b,c);
e = mad(a,b,c*d);
e = native_sin(a);
e = native_cos(b);
e = mad(a,b,c*d);
e = native_sin(a);
e = native_cos(b);
e = mad(a,b,c*d);
e = native_sin(a);
e = native_cos(b);
float4 vec1 = (float4)(1, 2, 3, 0);
float4 vec2 = (float4)(-1, 3, 1, 0);
float4 vec = fast_distance(vec1, vec2);
double x=1, y=2, z=3;
double resp = mad(x,y,z);
}
}
* This source code was highlighted with Source Code Highlighter .
The second procedure (using optimization) is performed 35 times faster.
Permissions
It is worth noting that in OpenCl there are a number of directives allowing to include various additional functionals. There are two reasons for this. First, historically, not all of these functionals were supported. The second is that these functionals can affect performance. Usually the functionality is enabled by the following command:
#pragma OPENCL EXTENSION extension name : behavior
* This source code was highlighted with Source Code Highlighter .
For example. The following commands include: the possibility of using the type of byte, double precision calculations and all mathematical functions
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
* This source code was highlighted with Source Code Highlighter .
Synchronization
Barriers
Often synchronization is needed in calculations. This is achieved in several ways. The first is barriers. A barrier is a team that will stop the process until all other processes or processes of its working group have been reached. Here are two examples:
kernel void localVarExample()
{
int i = get_global_id(0);
__local int x[10];
x[i] = i;
barrier(CLK_LOCAL_MEM_FENCE);
if (i>0) int y = x[i-1];
}
kernel void globalVarExample()
{
int i = get_global_id(0);
__global int x[10];
x[i] = i;
barrier(CLK_GLOBAL_MEM_FENCE);
if (i>0) int y = x[i-1];
}
* This source code was highlighted with Source Code Highlighter .
In the first example, all the workgroup processes are expected on the barrier command, in the second, all OpenCL device processes are waiting.
It is worth noting a feature of this example, the command "__local int x [10];" and "__ global int x [10];". They allow you to select a global variable in a group of processes and in all processes already during their execution.
Unit operations
The second synchronization option between threads is atomic. These are functions that prevent simultaneous access to memory. Before using them, you need to include the following directives:
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
* This source code was highlighted with Source Code Highlighter .
The simplest example of how this function works:
__kernel void test(global int * num)
{
atom_inc(&num[0]);
}
* This source code was highlighted with Source Code Highlighter .
If instead of “atom_inc (& num [0]);”, num ++ was written, the result of the program execution was unpredictable, since all processes would simultaneously access the memory and read the same value there. In total there are 11 functions of unit operations: “add, sub, xchg, inc, dec, cmp_xchg, min, max, and, or, xor”.
Using these functions is not difficult to create a semaphore.
void GetSemaphor(__global int * semaphor) {
int occupied = atom_xchg(semaphor, 1);
while (occupied > 0)
{
occupied = atom_xchg(semaphor, 1);
}
}
void ReleaseSemaphor(__global int * semaphor)
{
int prevVal = atom_xchg(semaphor, 0);
}
* This source code was highlighted with Source Code Highlighter .
Work with images
The last thing I want to include in this guide is working with images via OpenCL. The creators tried to make the work with images require a minimum of the user's brain. It is very nice. Image uploading is possible in the image2d_t and image3d_t types. The first ones are ordinary images, the second ones are three-dimensional. Also, the loaded image must be in one of the formats: "__ read_only", "__write_only", "__read_write". Reading and writing data from an image is possible only by special procedures: value = read_imageui (image, sampler, position), write_imageui (image, position, value).
In my opinion, everything is clear here except the notion of “sampler”. The sampler is the thing that will optimize your work with the image. It has three parameters: “normalized coords”, “address mode”, “filter mode”. The first has two meanings: "CLK_NORMALIZED_COORDS_TRUE, CLK_NORMALIZED_COORDS_FALSE". According to the name, it should indicate whether the input coordinates are normalized or not. The second shows what to do in case you try to read coordinates from outside the borders of the image. Possible options: mirror the image (CLK_ADDRESS_MIRRORED_REPEAT), take the nearest boundary value (CLK_ADDRESS_CLAMP_TO_EDGE), take the base color (CLK_ADDRESS_CLAMP), do nothing (the user guarantees that this will not happen CLK_ADDRESS_NONE). The third shows what to do if the input is not integer coordinates. Possible options: approximate the nearest value (CLK_FILTER_NEAREST), linearly interpolate (CLK_FILTER_LINEAR).
A brief example. We are overrun the image by the average value in the area:
__kernel void ImageDiff(__read_only image2d_t bmp1, __write_only image2d_t bmpOut)
{
const sampler_t smp = CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP_TO_EDGE |
CLK_FILTER_NEAREST;
int2 coords = (int2)(get_global_id(0), get_global_id(1));
uint4 sumall = (uint4)(0,0,0,0);
int sum = 0;
for ( int i=-10;i<11;i++)
for ( int j=-10;j<11;j++)
{
int2 newpol = (int2)(i,j)+coords;
sumall+= read_imageui(bmp1, smp, newpol);
sum++;
}
sumall = sumall/sum;
write_imageui(bmpOut, coords, sumall);
}
* This source code was highlighted with Source Code Highlighter .
Utility
Well, I think I coped with the short description. Now a few links for more detailed study, if anyone needs.
Official website with documentation.Site with examples and clear descriptions.Good pdf files, there the structure of OpenCl devices is nicely drawn.There are also 2 presentations about OpenCL in Russian. They have quite a bit of information and there is no link text. True, there are good examples.
The first .
The second .