📜 ⬆️ ⬇️

CUDA: right off the bat

Many have seen my introduction to modern high-performance computing technology and performance evaluation, now I will continue the topic with a more detailed story about CUDA technology.
For those who have not watched the previous series: CUDA allows you to write and run on C video cards nVidia (8xxx and above) programs written in C ++ with special extensions. On the right tasks, a significant performance gain of $ compared to conventional CPUs is achieved.
The achievable performance is 1 trillion or more operations per second on the GTX295.

NB: The article is a brief introduction, it is hardly possible to cover all the nuances of CUDA programming in one article :-)

About iron


CUDA works on video cards starting from 8400GS and higher. Different video cards have opportunities. In general, if you see that in a video card, for example, 128 SP (Streaming Processor) - this means that there are 8 SIMD MP (multiprocessor), each of which performs 16 operations simultaneously. One MP has 16kb of shared memory, 8192 pieces of 4-byte registers (the GTX2xx series cards have more values). There are also 64kb constants common to all MPs, they are cached, if they are not in the cache, there is a rather large delay (400-600 cycles). There is a global memory of the video card, access is not cached there, and textures (cached, the cache is optimized for 2D samples). To use several video cards, you need to first disable SLI in the firewood, and secondly, run each video card downstream, and call cudaSetDevice ().

Where to begin?


The fastest way to learn how to program in CUDA is to take an example from the SDK, run it, and then modify it while it works (actually I did it when I wrote my BarsWF) :-)
To begin with, go to http://www.nvidia.com/object/cuda_get.html and download the SDK and Toolkit for your operating system of the required bit depth. (unfortunately, for example, the 32-bit SDK and the 64-bit toolkit cannot be interfered with). It is useful to update the video card driver to the latest version (since CUDA is developing rapidly, it is always useful to have the latest firewood, and you and the users of your programs). Here I will look at developing under Windows in Visual Studio (2005, recently from 2008 it also became possible) .
For example, take the example of the Mandelbrot SDK. The most important is the .cu file, pay attention to its Custom Build Rule:
$(CUDA_BIN_PATH)\nvcc.exe -ccbin "$(VCInstallDir)bin" -c -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/O2,/Zi,/MT -I"$(CUDA_INC_PATH)" -I./ -I../../common/inc -I"$(DXSDK_DIR)\Include" -o $(ConfigurationName)\Mandelbrot_sm10.obj Mandelbrot_sm10.cu

You can use it in all your projects, only instead of "../../common/inc" you can specify an absolute path (or environment variable).
nvcc - this is the great and terrible compiler CUDA. At the output, it generates an object file, in which the compiled program for the video card is already included.
Pay attention to the description of the interface in Mandelbrot_kernel.h - here you have to describe the kernels that we are going to call from the main C ++ program (however, there are usually not many of them, so this is not a problem).
After you have managed to run the SDK sample, you can see how the CUDA program differs from the usual one.
')
NB: If you add the -keep parameter, then after compiling you will be able to find many interesting intermediate files.

Definition of functions


The following “modifiers” may precede the functions in the .cu file:
__device__ - this means that the function is performed only on the video card. From a program running on a regular processor (host), it cannot be called.
__global__ - This function is the beginning of your computational core. It is executed on the video card, but it is launched only from the host.
__host__ - Run and run only from the host (i.e., the usual C ++ function). If you specify __host__ and __device__, for example, before the function, 2 versions of the function will be compiled (not all combinations are allowed).

Data definition


__device__ - means that the variable is in the global memory of the video card (i.e. which is 512-1024 MB and higher). Very slow memory by the standards of computing on a video card (although it is several times faster than the memory of the central processor), it is recommended to use it as little as possible. In this memory, data is stored between calls to different cores. Data here can be written and read from the host part using
cudaMemcpy(device_variable, host_variable, size, cudaMemcpyHostToDevice); //cudaMemcpyDeviceToHost -

__constant__ - sets the variable in the constant memory. It should be noted that the values ​​for the constants must be loaded with the cudaMemcpyToSymbol function. Constants are available from all threads, the speed is comparable to registers (when it gets to the cache).
__shared__ - sets the variable in the shared memory of the thread block (that is, the value will be common to all). Here you need to be approached with caution - the compiler aggressively optimizes access here (you can choke with the volatile modifier), you can get race condition, you need to use __syncthreads (); To ensure that the data is recorded. Shared memory is divided into banks, and when 2 threads simultaneously try to turn to one bank, a bank conflict arises and the speed drops.

All local variables that you define in the kernel (__device__) are in registers, the highest access speed.

How does the stream know what to work on it


The main idea of ​​CUDA is that to solve your problem, you start thousands and thousands of threads, so you shouldn’t be afraid of what will be written here :-)
Suppose you need to do some kind of operation on a 200x200 picture. The picture is broken into pieces 10x10, and for every pixel of such a piece we start running along the stream. It will look like this:
dim3 threads (10, 10); // size of a kvadatika, 10 * 10
dim3 grid (20, 20); // how many squares do you need to cover the whole image

your_kernel <<< grid, threads >>> (image, 200,200); // This line will launch 40'000 threads (not at the same time, approximately 200-2000 threads will work at the same time).

Unlike Brook + from AMD, where we immediately determine which stream to work on which data, everything is different in CUDA: the transferring kernel parameters are the same for all threads, and the stream must receive data for itself in order to do this, the flow needs to be calculated where in the image it is. The magic variables blockDim, blockIdx help in this.
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;

In ix and iy - coordinates, with which you can get the source data from the image array, and record the result of the work.

Optimization


A few words about how not to make your program very slow (writing a program that is slower than the CPU is much easier than working 10 times faster :-))


Does not work?


First of all, you should read the documentation together with the SDK (NVIDIA_CUDA_Programming_Guide, CudaReferenceManual, ptx_isa), then you can ask on the official forum - there even nVidia developers often unsubscribe, and indeed many smart people. In Russian, you can ask me on the forum for example, where I answer :-) Also a lot of people live on gpgpu.ru .

I hope this introduction will help people who decide to try programming for video cards. If there are problems / questions, I will be happy to help. Well, in front of us is waiting for an introduction to Brook + and SIMD x86

The original here is http://3.14.by/ru/read/cuda-crash-course-vvedenie

Source: https://habr.com/ru/post/50241/


All Articles