
A lot of people who have tried to “taste” the technology of using CUDA / OpenCL graphic accelerators did not get very good results. Yes, tests are underway and simple examples show impressive acceleration, but when it comes to real-world algorithms, it is very difficult to get a good result.
How to make this technology work?
In this article, I tried to summarize my semi-annual experience of butting with OpenCL technology under Mandriva linux and MacOS X 10.6 on the tasks of complex search for string data for bioinformatics. OpenCL was chosen because it is a “native” technology for Mac (some of the Macs come with AMD and CUDA graphics cards under them are not available even theoretically), but the proposed recommendations are quite versatile and suitable for NVIDIA CUDA as well.
So, what is needed for a graphics accelerator to work?
')
Code parallelism
1. It is necessary to remember, and for those who did not know, study the techniques of code
refactoring . If you have an atypical algorithm, be prepared for the fact that it will take a long time to torture before you can properly parallelize. In the course of these torments, the results of the program should not change, therefore, without a good test case, you can not even get down to work. The example should not be worked out for too long, since it will have to be launched often, but not too quickly - we must evaluate the speed of its work. Optimally - 20 seconds.
2. Mass parallelism at the accelerator implies that there will be
data arrays at the input and at the output of the algorithm, and the dimension of these arrays cannot be less than the number of threads of the parallel algorithm. An accelerated algorithm usually looks like a for loop (or several nested such cycles), each iteration of which does not depend on the result of the previous ones. Do you have such an algorithm?
3. Parallel programming in itself is not an easy thing, even without the use of graphics accelerators. Therefore, it is highly recommended to parallelize your algorithm first with something simpler, for example using
OpenMP . There, parallelism is turned on by one directive ... just do not forget that if any buffer variables are used in a cycle, then in parallel execution they must be multiplied by the number of iterations or created inside the cycle!
4. In order not to lose dozens of hours, you need to be 100% sure that at least the parallel part of the program's algorithm is completely free from memory errors. This is achieved, for example, with the help of
valgrind . Also, this wonderful program can catch paralleling errors via OpenMP, so it’s better to catch everything in advance, until you hit the accelerator - there are a lot less tools.
Consideration of the features of the accelerator
1. You need to understand that the accelerator works with
its own memory , the amount of which is limited, and the transfer to and fro is quite expensive. Specific figures depend on the specific system, but this question will have to be “kept in mind” all the time. Best of all, algorithms that take not so much (but not less than the number of threads!) Data work with graphic accelerators, and mathematically process them for a long time. It is enough, but not very long, in many systems a limit is set for the time of maximum thread execution!
2. Memory back and forth is copied
inseparable areas . Therefore, for example, ordinary dimensional arrays for C, organized as arrays of pointers to pointers, are completely unsuitable, it is necessary to use linearly organized multidimensional data arrays.
3. In the “internal” code executed on the accelerator,
there are no operations for allocating and freeing memory, I / O, recursion is impossible. I / O, temporary buffers are provided by code executed on the central processor.
4. “Perfect” algorithm for an accelerator - when
the same code works for different data (SIMD) , when the code does not depend on them (for example, vector addition). Any branching, any loop with a variable, dependent on a given number of iterations gives a significant slowdown.
Some conclusions
Thus, most likely, to use all the power of GPGPU, you will have to rewrite the code, having the above limitations in mind, and it's far from a fact that it will work out right away and it will work out altogether. Not all tasks are well paralleled by data, alas. But the result is worth it, since the transfer of molecular dynamics problems to GPGPU allowed NVIDIA specialists to get
very interesting results, which in the realities of our institute would allow us to count not one month on a supercomputer in another city, but one day on a desktop machine. And for this it is worth the work.
Used materials:
NVIDIA OpenCLCycle of articles on HabréOpenCL for Mac OS X