Using OpenCL in Python

Recently, parallel computing is firmly entering life, in particular, using a GPU.

There have been many articles on this topic, so I will limit myself to a superficial description of the technology. GPGPU - the use of GPUs for general-purpose tasks, i.e. not directly related to rendering. An example is the Nvidia PhysX library for calculating physics in some modern games. This technology is beneficial because GPUs are good on parallel execution with multiple threads. True, there should be a lot of threads, otherwise the performance will drop dramatically. Also, due to the peculiarities of working with memory, one has to be somewhat cunning with the transfer of data from RAM to video memory. Known implementations: CUDA (Nvidia, for video cards only), OpenCL (Khronos Group, for heterogeneous systems) and AMD FireStream . Only OpenCL will be discussed here.

So let's get down to practice. We will choose Python as the main program language. Of course, it is not very fast by itself, but it works fine as a “glue” - in many applications the basic calculation goes to OpenCL, and the Python code only “brings the cartridges”. There is an excellent PyOpenCL library, which we will use.
')

Installation

But, first of all, you need to install everything you need to work with OpenCL. For Nvidia cards, OpenCL support comes with the CUDA SDK, as well as in the official driver. In nouveau, as far as I know, there is no support yet. For Windows, there is an installer on the official website; for GNU / Linux, you should install the necessary software from the repositories. For ArchLinux, these are the cuda-toolkit, cuda-sdk, nvidia, nvidia-utils, nvidia-opencl packages. For Intel processors, there is a very good Intel OpenCL SDK, for AMD there is also a native SDK.

The CUDA SDK has a great example, oclDeviceQuery, which shows a lot of information on the OpenCL device by default. A very handy thing to check the overall performance.

The final chord is the installation of libraries for Python. First, NumPy and, secondly, PyOpenCL itself will be required.

Some comments on the practice

Actually, many important things to practice were told in a recent article , so I strongly advise you to read it too. Here I will focus on the features of using PyOpenCL.

In general, it is worth mentioning the main distinguishing features of the GPU. First, they are demanding to the presence of a large number of threads, otherwise the performance will fall. That is, unlike the CPU, they like a lot of light streams, rather than a little heavy ones. Secondly, you should be very careful when working with memory. The memory controller will issue the promised GB / s only on condition that a large amount of data is transferred at a time. It should also be borne in mind that when transferring an algorithm to a GPU, it is as though you need to go camping - you need to take everything with you, because you won't be able to reach the RAM or disk from the GPU or select a large block of video memory. And thirdly, conditional transitions are now handled very poorly, so they should be used to a minimum.

Based on this, it should be remembered that not all algorithms work well on the GPU. Apparently, image processing and mathematical modeling are most suitable for video cards.

Initialization

Working with OpenCL begins with device initialization. On the official documentation page there is an example of how you can use create_some_context () to select a default device by writing a minimum of code. I will not copy-paste it, but it is very useful to get acquainted first with it.

Here I will consider a somewhat more complicated case. I will select a specific device, as well as transfer different types of parameters. It should be noted that arrays for transmitting to a device (hereinafter “device” is an OpenCL device, that is, on which the code will be executed on OpenCL) cannot be built-in lists or tuples. Only NumPy arrays can be used. This is due to the fact that the library expects to receive an array of one type.

Initialization Host Code

import pyopencl as cl import numpy set_simple_args(model_params) set_device() create_buffers(np.array(particles, dtype=np.float64)) program = cl.Program(self.context, open('worker.cl').read()).build() def set_device(): device = cl.get_platforms()[settings.DEVICE_NUM].get_devices(getattr(cl.device_type, settings.CL_DEVICE))[0] context = cl.Context(devices=[device], dev_type=None) cqueue = cl.CommandQueue(self.context) def set_simple_args(mp): #       numpy #       ,   OpenCL   central_size = numpy.float64(mp['central size']) dt = numpy.float64(mp['dt']) skip = numpy.int32(mp['skip']) q_e = numpy.float64(mp['e charge']) q_i = numpy.float64(mp['ion charge']) ion_m = numpy.float64(mp['ion mass']) def create_buffers(particles): # particles -      mf = cl.mem_flags buf_particles_1 = cl.Buffer(self.context, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=particles) buf_particles_2 = cl.Buffer(self.context, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=particles) buf_charge = cl.Buffer(self.context, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=np.float64(0.0)) buf_q_e = cl.Buffer(self.context, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=np.int32(0.0)) buf_q_i = cl.Buffer(self.context, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=np.int32(0.0))

What is going on here? First, “simple” arguments are set - they are passed as constants. Next, configure the desired device. settings.DEVICE_NUM — device number, settings.CL_DEVICE — type (GPU, CPU, or something more exotic). Now the video memory buffers are loaded. This step can be performed only for arrays, as well as variables whose values need to be read after the kernel has been executed. And finally, a program object is created (in worker.cl, the kernel source code).

It should be noted that in case of a compilation error, an exception is thrown and a corresponding message is displayed on the screen. You can also turn on showing warnings and other nonsense from the compiler. This can be done by adding export PYOPENCL_COMPILER_OUTPUT=1 to .bashrc. Here we can mention the Intel compiler, which always explicitly reports on how well he vectorized everything, in contrast to the more Unix-based one from Nvidia.

Kernel startup

Everything is relatively simple here. Our arguments are fed as parameters to the kernel launch function (via __call __ ()). In this code, each new iteration is a new kernel launch.

Kernel startup host code

 def iterate(): program.dust_plasma(cqueue, [global_size], None, buf_particles_1, buf_particles_2, buf_charge, buf_q_e, buf_q_i, q_e, q_i, ion_m, central_size, outer_boundary, dt, skip) output_particles, cur_charge = empty_particles, numpy.empty_like(numpy.float64(0.0)) #   output_particles   buf_particles_2 cl.enqueue_copy(cqueue, output_particles, buf_particles_2) cl.enqueue_copy(cqueue, cur_charge, buf_charge) return output_particles, cur_charge

For ease of understanding, here is a slice of the OpenCL core code (as you can see, this is something like C):

Kernel code snippet

 __kernel void dust_plasma(__global double4* particles_1, //      __global double4* particles_2, //    __global double* dust_charge, __global int* charge_new_e, __global int* charge_new_i, __const double charge_e, __const double charge_i, //     __const double ion_mass, //   __const double central_size, // ,  __const double outer_boundary, //  __const double dt, __const unsigned skip) { int id = get_global_id(0); // ... for(unsigned i = 0; i < skip; i++) { //printf("Iteration %i\n", i); params.previous = &particles_1; params.next = &particles_2; one_iteration_rk2(&params); params.previous = &particles_2; params.next = &particles_1; one_iteration_rk2(&params); } barrier(CLK_GLOBAL_MEM_FENCE); // ... }

Conclusion

It should be noted that the documentation is very adequate and convenient. If you have questions about the library, you can write to the pyopencl@tiker.net mailing list, the community is quite lively and responds quickly. The library is distributed under the liberal license MIT, which is very nice.

Regarding the OpenCL language itself - I consider that special manuals are not required to learn it. Just write, as in C99, sometimes recalling the limitations (if anything, the compiler prompts) and scrolling through the specification. Perhaps you should know in advance, these are built-in data types, like float4 (a variable consisting of 4 separate float values, work with it faster than with 4 separate floats).

I would also like to thank the author of PyOpenCL by the name of Andreas Klöckner for the excellent library.

Please 1) report typos / errors in the text 2) in a personal.

Useful links on the topic

EDIT. Corrected typos.

The text of the article is distributed under the Creative Commons Attribution-ShareAlike 3.0 license , code snippets are MIT.

Source: https://habr.com/ru/post/146993/

All Articles