OpenCL: we waited - version 1.1 from nVidia, but what's new?

A bit of history or the promised three years waiting

A little more than a year ago, Khronos Group introduced a new version of OpenCL 1.1 and nVidia immediately boasted that it already had a pre-release driver ready to support the new standard. Everything would be fine, but only the pre-release is not a working tool (there is enough of bugs in the official drivers, and so much for the test version), so the developers honestly waited for the release of the new version. CUDA 4 was released, and OpenCL wasn’t and wasn’t. Moreover, even the pre-release version of OpenCL was excluded from the new version of the drivers, i.e. I had to choose between the old driver with CUDA 3 + OpenCL 1.1 or the new driver with CUDA 4 + OpenCL 1.0. But today it is done! Developers received a letter stating that the final version is already available in the official drivers of 280.13, although it is still a beta version, but not for long.

So, I decided to remember what is so new and good in the new version, to share comments about why this or that function might be needed and whether there are any pitfalls you need to know about.

What's new in OpenCL 1.1

Let's go through the list of the Khronos Group, and if I remember something that is not there, then I will add.
')

Host-thread safety

Thread safety is a useful thing, now we don’t need to think about what to call and how to synchronize. As far as it is relevant - a good question, as long as one card is used, there is no particular sense. When there are a lot of cards and for each stream, then yes, life becomes simpler.
There is a truth here that there is one exception, which is usually not written about: the clSetKernelArg function is not thread-safe if it is called for the same kernel object (although for different, call it on health). They did it either for the sake of speed (I don’t see much sense), or because there’s still no point in using the same object in different streams, but fact is a fact, so be careful.

Sub-buffer object

In fact, you can now split the buffer into several pieces and send the desired piece to the desired device. Again, useful for the case of multiple video cards and improved portability. Yes, and data integrity is preserved when you do not need to make a separate buffer for each card. In general, it would be possible to manage, but it will come in handy.
Small nuance (quite logical): you cannot simultaneously write to a sub-buffer and read from an intersecting sub-buffer or main buffer, in this case the result is not defined.

User events

The clCreateUserEvent and clSetUserEventStatus commands have appeared , which allow the user to create his own events. Previously, it was possible to tie everything only to those events that occurred after the execution of OpenCL commands. Again, you can do without it (just by adding new commands to the queue after an event), but it has become more convenient.

Event callbacks

The clSetEventCallback function has appeared , which registers user-defined functions called upon corresponding changes in the status of an event. You can add several custom functions, but the order of their call is not defined. I already like this addition, before I had to check the status of the event every few ms, now everything is convenient, and most importantly, the very few seconds are saved and the whole process is simplified. Something I need to remind ActionScript :)

By the way, another useful function can be clSetMemObjectDestructorCallback , which allows you to add a user-defined function that is called when the cl_mem object is deleted.

3-component vector

There used to be vectors with the size of 2, 4, 8, 16, only we all live in 3-dimensional space and three-dimensional vectors are a popular thing. And if you use 4-dimensional, then you already lose the whole register, which is a great waste. You can of course make your own, but they will not be processed by the built-in geometric functions of OpenCL. It is certainly not scary, you can write your own (the question of whether there is hardware acceleration of built-in functions or not is still a mystery for me, I will have to be checked somehow), but that with native support will be more beautiful (more beautiful in my case = easier to maintain) is a fact.

Global work-offset which enable the kernels for the NDRange

Honestly, I always thought it was before. Yes, and in the official documentation for both versions it is. If anyone knows what has changed here, write off pliz.

Read, write and copy a 1D, 2D or 3D rectangular region of a buffer object

The clEnqueueReadBufferRect, clEnqueueWriteBufferRect, and clEnqueueCopyBufferRect functions appeared that allow you to copy a rectangular area of the buffer. Those. if you store a matrix or image in the buffer (the same matrix in essence) line by line, but you only need to get a part of it, you no longer need to read / copy / write the entire buffer or issue a command to each row, or you can simply specify the parameters of the matrix and OpenCL will do everything himself. Personally, I have never had to do this, but it should be useful for large matrices.

Mirrored repeat addressing

If you store the image as a texture, then you can use CLK_ADDRESS_MIRRORED_REPEAT so that when you exit the border of the texture, the result is displayed as if it were reflected there. And it works as many times as necessary, the reflections alternate.

Any various additions and improvements

The CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE argument to the clGetKernelWorkGroupInfo function should prompt a multiplier for the size of the group that is suitable for this device. If I understand correctly, this piece will give us the size of the warp (so that the group size always includes several whole warps) and is intended for auto-tuning the application on different platforms. But I have not tried it yet, so I could be wrong.

The CL_CONTEXT_NUM_DEVICES argument for the clGetContextInfo function tells us how many devices are in context.

One useful thing to find out in the program which version of OpenCL is used is CL_VERSION_1_0 and CL_VERSION_1_1 macros

More standard extensions such as:
cl_khr_byte_addressable_store
cl_khr_global_int32_base_atomics,
cl_khr_global_int32_extended_atomics,
cl_khr_local_int32_base_atomics,
cl_khr_local_int32_extended_atomics .
And the prefix for embedded atomic operations was changed from atom_ to atomic_ .

There are 2 new extensions:
cl_khr_gl_event and cl_khr_d3d10_sharing .

There are also several new commands:
get_global_offset
minmag
maxmag,
async_work_group_strided_copy,
vec_step,
shuffle,
shuffle2.

I would describe all the updates, but I want to sleep, I hope someone will find this list useful.

PS: Again, I don’t have a single line of code in the post, I promise to fix it later, there are a lot of ideas about examples and tests, but for now we will live without it :)

PS2: In the past, my post OpenCL: versatility and high performance, or not so simple? they asked about a book about OpenCL and then I had nothing to answer. So, today I discovered that just a week ago the book “OpenCL Programming Guide” was published . I have not read it yet, but a quick look revealed that it looks quite interesting.

Source: https://habr.com/ru/post/125687/

All Articles