OpenCL: universality and high performance or is it not so simple?

On Habré, there were already articles about OpenCL, CUDA and GPGPU with performance comparisons, basic concepts and examples, so I’m not going to talk about the basics and principles of operation here, even the code will not show. But I want to describe what are the real difficulties in using the GPU (about the limitations and their consequences), why it is impossible to compare the performance of the CPU and the GPU, and also about how “universal” OpenCL really is.

Foreword

My acquaintance with GPGPU began 1.5 years ago and is still ongoing as an active research project. Then I had a choice: OpenCL or CUDA, there was no difference in the choice, at that time, but the university began to teach a course about OpenCL, so I chose it. I’ll just say that I wrote only for cards with NVidia architecture, so I’ll talk about it (most often about Fermi).

At this point there was a big paragraph about the history and state of affairs in the area of calculations on the GPU, but after describing the problems, the post was too long and the paragraph was severely curtailed (there is hope that he will return in the next part). Therefore, let us proceed immediately to why the algorithms ported to the GPU do not always work quickly, i.e. give in practice 0.5X-10X performance gains instead of the promised 20X-100X relative to the CPU (otherwise every application would have already used it).

How slow is it?

So, we all know that the GPU architecture differs significantly from the CPU, but few people think about how much this difference is and how much it affects the development of algorithms for the GPU. Man, although it is a rather parallel system, is used to thinking about algorithms consistently. For the past XX years, processors have indulged us in this and we are all accustomed to the fact that one command is executed after the other. And we are accustomed to the fact that the resources available to the program are practically unlimited (we don’t remember about microcontrollers), and the data can be obtained almost immediately. Almost all programming and optimization techniques are based on this. Only here with the GPU it does not work and I want to try to describe the consequences of our habits.
')

Restriction one: 32 threads (warp) always execute ONE command

If somewhere before this command there was a branch and the threads went in different ways, then the GPU will execute both branches sequentially.
Thus, an attempt to simplify the calculations for a particular case (if the general and short solution of the problem is known) can lead not to an acceleration of the calculations (which always happens on the CPU), but to the addition of the calculation time for the general and particular cases.
Another example: each core chooses a different algorithm depending on the type of data, for example, it is necessary to calculate the distance from a point to a geometric shape and each core gets a different shape and, accordingly, a different algorithm. As a result, the total time will be the sum of the execution time of the algorithm for each object.
And it turns out that we consider everything exactly as consistently as the CPU (and with many nested branches, we consider the GPU much more than the CPU), but with a consistent calculation of the GPU will be ten times slower. Pay attention to how many if-s in your programs, although if all 32 threads follow one path, then everything is ok, but does this often happen in all branches?

The second limitation: 128 bytes are always read sequentially with each memory access, even if we only need 1 byte.

And one more stream can access only 16 bytes from the 128 bytes at a time.
The result is that the memory bandwidth is more than 150Gb / s, but only if all 128 bytes are used constantly. If each stream must read one large structure that weighs 40 bytes, then each stream must make 3 memory requests and download 3 * 128 bytes. And if the data for each stream is located in different places (and the stream receives a pointer to them and then loads, the situation is quite normal for the CPU when the memory is spent rationally), then the useful memory bandwidth is 40 * 32 / (128 * 3 * 32) , that is, about 10% of the real.
And again, we are closer to the available memory bandwidth CPU. You can certainly remember about the fact that there is a cache, but it appeared only at Fermi and it is not so big, although it helps significantly. On the other hand, we can recall that in the first versions of the GPU, even when sequentially reading 128 bytes, if they are not read sequentially and / or shifted by at least one byte, a separate memory request is made for each stream.

The third limitation: memory latency is approximately 800 cycles for each request

And in the last example, to obtain data by all processes, it is necessary to make 3 * 32 requests ... almost 80 thousand cycles ... What should be done at this time? Execute other threads and new restrictions appear here.

Fourth restriction: 32k registers are allocated to all active threads of the multiprocessor

At first it seems that a lot, but they are allocated to all active threads, and not just running ones (and they are allocated statically to the maximum, as much as is necessary in the worst branching, so much will be allocated). And the active threads should be 1536 to hide the memory latency (try to calculate whether it is easy to hide 80 thousand cycles from the last example), that is, there are 21 registers per stream. Try to implement a complex algorithm and fit into 21 registers (these are not only variables, but also intermediate results of operations, cycle counters, etc., etc.). On the other hand, you can try to use less than fifteen hundred active threads and then the following limitation appears.

The fifth limitation: the Fermi flow scheduler can start threads only in groups of 512 pieces (it was easier before Fermi, around 128)

That is, only 3 options are available: 1536 flows if each uses less than 21 registers, 1024 flows if less than 32 registers are used or 512 flows, there is less in any way. At the same time, a smaller number of threads means serious problems with an attempt to hide the memory latency and downtime of the whole multiprocessor for thousands of cycles.
And this is much worse than the CPU. And the worst happens if each thread uses more than 64 registers.

Restriction sixth: if the stream uses more than 64 registers, then they are stored in global memory

I still can not believe that in the global memory, and not in the local, but the documentation says so. That is, there are additional requests to the memory. By the way, to call functions, the stack is used, which also occupies registers (yes, functions are bad).

To combat the use of registers and load optimization, there is also a shared memory (shared, which I don’t remember correctly in Russian). But its only 16 / 48Kb and it is divided between all active groups, that is, if each group eats 25kb of memory, then more than one group will not work with all the ensuing consequences.

Seventh restriction: the launch of each core is accompanied by a slight delay

In fact, everything is not so terrible here, this delay is measured in tens of microseconds. But if you run the kernel 1000 times, then it is already tens of milliseconds, which in the case of real-time calculations (for example, rendering) immediately creates a 15FPS limit, even without taking into account the time of the calculations themselves.

There should have been a conclusion, but it will be next time.

Once I sold, but already got too long list. But you still need to remember about synchronization, atomic operations, copying data to a device, load balancing for each card (SLI does not work here), accuracy, special functions, driver curves, debugging, and much more. Yes, and about the real universality of OpenCL, I have to say a lot. Well, okay, let's postpone it until the following parts.

And, in general, the developers certainly know (with the advent of experience after much agony) about many (but not all) constraints and try to optimize the code so that they can get around it, but imagine how much time it takes to remake the algorithms that were developed without thinking on the GPU, and not all algorithms can be altered in principle. But I slowly “comprehend Zen” and understand that not everything is so bad and you can get the promised teraflops, and I also promise to write about this in the following parts of the OpenCL story.

Source: https://habr.com/ru/post/125398/

All Articles