Continuation of Article 1 and Article 2 .
Below the cut, I’ll tell you about the author’s experience in using the GPU for calculations, including as part of creating a bot to participate in the AI mini cup. But rather it is an essay on the GPU.
- Your name is magic ...
- What's up, Joel? .. The magic goes away ...
As a child, we talk about that age when they are still undergoing chemistry or are just beginning to pass, the author was fascinated by the burning reaction, it turned out that the parents did not hinder him and the Moscow wasteland near the house was occasionally lit up with flashes of various children's activities, rockets on homemade black gunpowder, on sugar-nitrate caramel and so on. Two circumstances limited children's fantasies: the reaction of nitroglycerin decomposition in the home laboratory with a acid sizzling ceiling and a drive to the nursery of the police in an attempt to get chemical reagents at one of the defense enterprises scattered around the Aviamotornaya district.
And then there was a physics school with computers yamaha msx, a programmable calculator MK at home and it was not up to chemistry. The interests of the child shifted to computers. And what the author lacked from the first acquaintance with a computer was a burning reaction, his programs smoldered, there was no such feeling of natural power. It was possible to see the process of optimizing calculations in games, but at that time the author did not know how to replace sin () with a table of the values of this function, there was no Internet ...
So the feeling of joy from computing power, clean burning, the author was able to get used in the calculations of the GPU.
On Habré there are several good articles about computing on the GPU. There are also many examples on the Internet, so it was decided to simply write on Saturday morning about personal feelings and maybe push others towards mass parallelism.
Let's start with simple forms. Computing on the GPU supports several frameworks, but the most famous are NVIDIA CUDA and OpenCL. We will take CUDA and immediately we will have to narrow down our set of programming languages to C ++. There are libraries for connecting to CUDA in other programming languages, for example, ALEA GPU in C #, but this is more likely a topic for a separate review article.
It was not possible at the time to make a massive car with a jet engine, although some of its indicators are higher than that of an internal combustion engine, parallel computing is not always possible to apply in real-world problems. The main application for parallel computing: we need a task containing some element of mass, multiplicity. In our case of creating a bot, a neural network (many neurons, neural connections) and a population of bots (calculation of motion dynamics, collisions for each bot take some time, if the bots are 300-1000, then the CPU will surrender and you will see just slow corruption of your program, for example, long pauses between frames of visualization).
The best variant of mass character is when each element of the computation does not depend on the result of computations on another element of the list, for example, the simple task of sorting an array already acquires all sorts of tricks, since the position of the number in the array depends on the other numbers and cannot be picked up at the parallel loop . To simplify the wording even more: the first sign of a successful mass character is that if you do not need to change the position of an element in an array, you can freely perform calculations on it, take other elements for this, but do not move it. Something like a fairy tale: do not change the order of the elements, otherwise the GPU will turn into a pumpkin.
In modern programming languages, there are constructions capable of running in parallel on several CPU cores or logical threads and they are widely used, but the author focuses the reader’s attention on mass parallelism when the number of executing modules exceeds hundreds or thousands of units.
The first elements of parallel structures appeared: a parallel loop . For most tasks it will be enough. In a broad sense, this is the quintessence.
parallel computing.
Example of recording the main loop in CUDA (kernel):
int tid = blockIdx.x * blockDim.x + threadIdx.x; int threadN = gridDim.x * blockDim.x; for (int pos = tid; pos < numElements; pos += threadN) { // pos, , thread pos. : thread , thread pos=1146 thread c pos=956. . . }
Much is written in the documentation and reviews to CUDA , about GPU blocks, about Threads that are produced in these blocks, how to parallelize the task on them. But if you have an array of data and it clearly consists of mass elements, use the above cycle form, since it visually looks like a normal cycle in form, which is pleasant, but unfortunately not in content.
I think the reader is already becoming clear that the class of tasks in relation to mass parallel programming is rapidly narrowing. If we are talking about the creation of games, 3d rendering engines , neural networks, video editing and other similar tasks, then the clearing for independent actions of the reader is heavily trampled, there are big programs, small programs, frameworks, libraries known and unknown for these tasks. That is, the area remains just from the topic, to create your own small computing rocket, not SpaceX and Roscosmos, but something homely, but quite evil for yourself before calculations.
Here is a picture of a rocket flame itself.
Speaking of tasks that the parallel loop in your hands cannot solve. And the creators of CUDA on behalf of NVIDIA developers have already thought about this.
There is a Thrust library in some places useful until "without options" is done differently. By the way, did not find its full review on Habré.
To understand the principle of its work, you first need to say three sentences about the principles of work of CUDA. If you need more words, then the link can be read .
CUDA working principles:
The computations are performed on the GPU programmatically which is the kernel (kernel) and you have to write it in the C language. The kernel in turn only communicates with the GPU memory and you have to load data into the video processor memory from the main program and unload it back into the program. Sophisticated algorithms at CUDA require mental flexibility.
So, the Thrust library removes the routine and takes on some of the “difficult” tasks for CUDA, such as summing up arrays or sorting them. You no longer need to write a separate kernel, load pointers into memory, and copy data from these pointers to the GPU memory. The whole mystery will happen in front of you in the main program and at a speed slightly lower than CUDA. The Thrust library is written in CUDA, so this is a single berry field in terms of performance.
What you need to do in Thrust is to create an array (thrust :: vector) within its library that is compatible with regular arrays (std :: vector). That is, of course, not so simple, but the meaning of what was said by the author is similar to the truth. Indeed, two arrays, one on the GPU (device), the other in the main program (host).
An example will show the simplicity of the syntax (X, Y, Z arrays):
// initialize X to 0,1,2,3, .... thrust::sequence(X.begin(), X.end()); // compute Y = -X thrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate<int>()); // fill Z with twos thrust::fill(Z.begin(), Z.end(), 2); // compute Y = X mod 2 thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(), thrust::modulus<int>()); // replace all the ones in Y with tens thrust::replace(Y.begin(), Y.end(), 1, 10);
You can see how it looks innocuously against the background of the creation of the CUDA core, and the set of functions in Thrust is big . Starting from working with random variables, which is done in CUDA by a separate library cuRAND (it is desirable to run as a separate kernel), to sorting, summing and writing its functions in functionality close to the functions of the kernel.
The author has little experience using CUDA and C ++, for two months. Before that, about a year C #. This, of course, slightly contradicts the beginning of an article about his early acquaintance with computers, physics and mathematics and applied mathematics as an education. I say it happened. But why am I writing this article, no, not that I learned it all, but that C ++ turned out to be a comfortable language (I used to be afraid of it, against the background of articles in a hambre like "lambda function → overloading of internal operators, like redefine everything "), it is clear that the years of its development led to quite friendly development environments (IDE). The language itself in its latest version, it seems as it collects garbage from memory, I don’t know how it was before. At least, the programs written by the author on the most simple algorithmic constructions, drove computational algorithms for bots for days and there were no memory leaks and other failures under high loads. This also applies to CUDA, at first it seems difficult, but it is based on simple principles and of course it is time-consuming to initialize arrays on the GPU if there are a lot of them, but then you will have your own little rocket, with a smoke from the video card.
From classes of objects for training with a GPU, the author recommends cellular automata . At one time there was an increase in popularity and fashion for them, but then neural networks intercepted the minds of developers.
Up to:
"Every quantity in physics, including time and space, is finite and discrete."
than not a cellular automaton.
But it's beautiful when three simple formulas can create this:
If it is interesting to read about cellular automata on CUDA, write in the comments, there is a material for a small article typed.
And this is the original source for cellular automata (under the video there are links to sources):
The idea to write an article after breakfast, in the same breath seems to me to have turned out. Second coffee time. Have a nice weekend reader.
Source: https://habr.com/ru/post/417757/
All Articles