Concurrent programming with CUDA. Part 1: Introduction

Another article about CUDA - why?

On Habré there were already a lot of good articles on CUDA - one , two and others. However, the search for the combination “CUDA scan ” produced only 2 articles in no way connected with, actually, the scan on GPU algorithm - and this is one of the most basic algorithms. Therefore, inspired by the recently viewed course on Udacity - Intro to Parallel Programming , I decided to write a more complete series of articles on CUDA. I will say right away that the series will be based on this course, and if you have time, it will be much more useful to go through it.

Content

At the moment, the following articles are planned:
Part 1: Introduction.
Part 2: GPU hardware and parallel communication patterns.
Part 3: GPU fundamental algorithms: convolution (reduce), scan (scan) and histogram (histogram).
Part 4: Fundamental GPU algorithms: compaction (compact), segmented scan (segmented scan), sorting. Practical application of some algorithms.
Part 5: Optimization of GPU programs.
Part 6: Examples of parallelization of sequential algorithms.
Part 7: Additional Parallel Programming Topics, Dynamic Concurrency.

Delay vs bandwidth

The first question that everyone should ask before using a GPU to solve their problems - and for what purposes is the GPU good, when should it be used? To answer, you need to define 2 concepts:
Delay (latency) - the time spent on the execution of a single instruction / operation.
Bandwidth is the number of instructions / operations performed per unit of time.
A simple example: we have a passenger car at a speed of 90 km / h and a capacity of 4 people, and a bus at a speed of 60 km / h and a capacity of 20 people. If you take 1 person per kilometer for the operation, then the delay of the car - 3600/90 = 40s - for so many seconds, 1 person will overcome the distance of 1 kilometer, the capacity of the car - 4/40 = 0.1 operations / second; bus delay - 3600/60 = 60s, bus capacity - 20/60 = 0.3 (3) operations / second.
So, the CPU is a car, the GPU is a bus: it has a big delay but also a large bandwidth. If for your task the delay of each specific operation is not as important as the number of these operations per second - it is worth considering the use of the GPU.

Basic CUDA Concepts and Terms

So, let's deal with the terminology of CUDA:

Device (device) - GPU. It performs the role of “subordinate” - it does only what the CPU tells it.
Host (host) - CPU. Performs the control role - runs tasks on the device, allocates memory on the device, moves the memory to / from the device. And yes, using CUDA assumes that both the device and the host have their own separate memory.
The kernel is a task that is started by a host on a device.

When using CUDA, you simply write code in your favorite programming language ( list of supported languages, not considering C and C ++), after which the CUDA compiler will generate code separately for the host and separately for the device. A small caveat: the code for the device should be written only in C with some 'CUDA extensions'.

The main stages of the CUDA program

The host allocates the required amount of memory on the device.
The host copies data from its memory to the device’s memory.
The host starts the execution of certain cores on the device.
The device performs the kernel.
The host copies the results from the device’s memory to its memory.

Naturally, for the most efficient use of the GPU, it is necessary that the ratio of time spent on cores to the time spent on memory allocation and data movement be as large as possible.
')

Kernels

Let us consider in more detail the process of writing kernel code and launching it. An important principle is that the kernels are written as (practically) ordinary sequential programs — that is, you will not see the creation and launch of threads in the code of the cores themselves. Instead, for the organization of parallel computing, the GPU will launch a large number of copies of the same kernel in different threads - or rather, you yourself tell how many threads to launch. And yes, returning to the question of the efficiency of using the GPU - the more threads you run (provided they all do useful work) - the better.
Kernel code is different from regular sequential code at such times:

Inside the cores, you have the opportunity to find out the “identifier” or, more simply, the position of the stream that is currently running - using this position, we achieve that the same core will work with different data depending on the thread in which it is running. By the way, such an organization of parallel computing is called SIMD (Single Instruction Multiple Data) - when several processors simultaneously perform the same operation but on different data.
In some cases, it is necessary to use different synchronization methods in the kernel code.

How do we set the number of threads in which the kernel will run? Since the GPU is still the Graphics Processing Unit, this naturally affected the CUDA model, namely, the method of defining the number of threads:

First, the dimensions of the so-called grid (grid), in 3D coordinates: grid_x, grid_y, grid_z . As a result, the grid will consist of grid_x * grid_y * grid_z blocks.
Then the block sizes in 3D coordinates are specified : block_x, block_y, block_z . As a result, the block will consist of block_x * block_y * block_z streams. Total, we have grid_x * grid_y * grid_z * block_x * block_y * block_z flows. Important note - the maximum number of threads in one block is limited and depends on the GPU model - typical values are 512 (older models) and 1024 (newer models).
Inside the kernel, the variables threadIdx and blockIdx are available with the fields x, y, z - they contain the 3D coordinates of the flow in the block and the block in the grid, respectively. Also available are blockDim and gridDim variables with the same fields — block and grid sizes, respectively.

As you can see, this method of starting streams is really suitable for processing 2D and 3D images: for example, if you need to process each pixel of 2D or 3D images in a certain way, then after selecting block sizes (depending on image size, processing method and GPU model), grid sizes are chosen such that the entire image is covered, possibly in excess - if the image sizes are not completely divided by the block sizes.

Writing the first program on CUDA

Enough theory, time to write code. Instructions for installing and configuring CUDA for different operating systems - docs.nvidia.com/cuda/index.html . Also, for ease of working with image files, we will use OpenCV , and for comparing CPU and GPU performance - OpenMP .
The task is pretty simple: converting a color image to shades of gray . For this, the pixel brightness pix in the gray scale is calculated by the formula: Y = 0.299 * pix.R + 0.587 * pix.G + 0.114 * pix.B.
First, write the skeleton of the program:

main.cpp

#include <chrono> #include <iostream> #include <cstring> #include <string> #include <opencv2/core/core.hpp> #include <opencv2/highgui/highgui.hpp> #include <opencv2/opencv.hpp> #include <vector_types.h> #include "openMP.hpp" #include "CUDA_wrappers.hpp" #include "common/image_helpers.hpp" using namespace cv; using namespace std; int main( int argc, char** argv ) { using namespace std::chrono; if( argc != 2) { cout <<" Usage: convert_to_grayscale imagefile" << endl; return -1; } Mat image, imageGray; uchar4 *imageArray; unsigned char *imageGrayArray; prepareImagePointers(argv[1], image, &imageArray, imageGray, &imageGrayArray, CV_8UC1); int numRows = image.rows, numCols = image.cols; auto start = system_clock::now(); RGBtoGrayscaleOpenMP(imageArray, imageGrayArray, numRows, numCols); auto duration = duration_cast<milliseconds>(system_clock::now() - start); cout<<"OpenMP time (ms):" << duration.count() << endl; memset(imageGrayArray, 0, sizeof(unsigned char)*numRows*numCols); RGBtoGrayscaleCUDA(imageArray, imageGrayArray, numRows, numCols); return 0; }

It's all pretty obvious - read the image file, prepare pointers to color and grayscale image, run the option
with OpenMP and CUDA version, we measure time. The prepareImagePointers function has the following form:

prepareImagePointers

 template <class T1, class T2> void prepareImagePointers(const char * const inputImageFileName, cv::Mat& inputImage, T1** inputImageArray, cv::Mat& outputImage, T2** outputImageArray, const int outputImageType) { using namespace std; using namespace cv; inputImage = imread(inputImageFileName, IMREAD_COLOR); if (inputImage.empty()) { cerr << "Couldn't open input file." << endl; exit(1); } //allocate memory for the output outputImage.create(inputImage.rows, inputImage.cols, outputImageType); cvtColor(inputImage, inputImage, cv::COLOR_BGR2BGRA); *inputImageArray = (T1*)inputImage.ptr<char>(0); *outputImageArray = (T2*)outputImage.ptr<char>(0); }

I went for a little trick: the fact is that we do very little work on every pixel of the image — that is, with the CUDA version, the above mentioned problem arises between the time needed to perform useful operations and the time to allocate memory and copy data, and as a result The CUDA option will be larger than the OpenMP option, but we want to show that CUDA is faster :) Therefore, for CUDA, only the time spent on performing the actual image conversion will be measured - without taking into account memory operations. In my defense, I’ll say that for a large class of tasks, the useful time will still dominate, and CUDA will be faster even with regard to memory operations.
Next, write the code for the OpenMP version:

openMP.hpp

 #include <stdio.h> #include <omp.h> #include <vector_types.h> void RGBtoGrayscaleOpenMP(uchar4 *imageArray, unsigned char *imageGrayArray, int numRows, int numCols) { #pragma omp parallel for collapse(2) for (int i = 0; i < numRows; ++i) { for (int j = 0; j < numCols; ++j) { const uchar4 pixel = imageArray[i*numCols+j]; imageGrayArray[i*numCols+j] = 0.299f*pixel.x + 0.587f*pixel.y+0.114f*pixel.z; } } }

Everything is pretty straightforward - we just added the omp parallel for directive to the single-threaded code - this is all the beauty and power of OpenMP. I tried to play with the schedule parameter, but it turned out only worse than without it.
Finally, go to CUDA. Here we will write in more detail. First you need to allocate memory for the input data, move it from the CPU to the GPU and allocate memory for the output:

Hidden text

 void RGBtoGrayscaleCUDA(const uchar4 * const h_imageRGBA, unsigned char* const h_imageGray, size_t numRows, size_t numCols) { uchar4 *d_imageRGBA; unsigned char *d_imageGray; const size_t numPixels = numRows * numCols; cudaSetDevice(0); checkCudaErrors(cudaGetLastError()); //allocate memory on the device for both input and output checkCudaErrors(cudaMalloc(&d_imageRGBA, sizeof(uchar4) * numPixels)); checkCudaErrors(cudaMalloc(&d_imageGray, sizeof(unsigned char) * numPixels)); //copy input array to the GPU checkCudaErrors(cudaMemcpy(d_imageRGBA, h_imageRGBA, sizeof(uchar4) * numPixels, cudaMemcpyHostToDevice));

It is worth paying attention to the naming standard of variables in CUDA - the data on the CPU starts with h_ ( h ost), the data and the GPU - with d_ ( d evice). checkCudaErrors - macro, taken from the gdubub repository of the Udacity course. It has the following form:

Hidden text

 #include <cuda.h> #define checkCudaErrors(val) check( (val), #val, __FILE__, __LINE__) template<typename T> void check(T err, const char* const func, const char* const file, const int line) { if (err != cudaSuccess) { std::cerr << "CUDA error at: " << file << ":" << line << std::endl; std::cerr << cudaGetErrorString(err) << " " << func << std::endl; exit(1); } }

cudaMalloc is an analog malloc for a GPU, cudaMemcpy is an analog of memcpy , it has an additional parameter in the form of an enum that indicates the type of copying: cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice.
Next, you need to set the dimensions of the grid and the block and call the core, do not forget to measure the time:

Hidden text

  dim3 blockSize; dim3 gridSize; int threadNum; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); threadNum = 1024; blockSize = dim3(threadNum, 1, 1); gridSize = dim3(numCols/threadNum+1, numRows, 1); cudaEventRecord(start); rgba_to_grayscale_simple<<<gridSize, blockSize>>>(d_imageRGBA, d_imageGray, numRows, numCols); cudaEventRecord(stop); cudaEventSynchronize(stop); cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); float milliseconds = 0; cudaEventElapsedTime(&milliseconds, start, stop); std::cout << "CUDA time simple (ms): " << milliseconds << std::endl;

Pay attention to the format of the kernel call - kernel_name <<< gridSize, blockSize >>> . The code for the kernel itself is also not very complicated:

rgba_to_grayscale_simple

 __global__ void rgba_to_grayscale_simple(const uchar4* const d_imageRGBA, unsigned char* const d_imageGray, int numRows, int numCols) { int y = blockDim.y*blockIdx.y + threadIdx.y; int x = blockDim.x*blockIdx.x + threadIdx.x; if (x>=numCols || y>=numRows) return; const int offset = y*numCols+x; const uchar4 pixel = d_imageRGBA[offset]; d_imageGray[offset] = 0.299f*pixel.x + 0.587f*pixel.y+0.114f*pixel.z; }

Here we calculate the coordinates y and x of the pixel being processed using the previously described variables threadIdx , blockIdx and blockDim , and we also do the conversion. Pay attention to the if check (x> = numCols || y> = numRows) - since the dimensions of the image will not necessarily be completely divided into the sizes of blocks, some blocks may “go beyond” the image - therefore this check is necessary. Also, the kernel function must be marked with the __global__ specifier .
The last step is to copy the result back from the GPU to the CPU and free the allocated memory:

Hidden text

  checkCudaErrors(cudaMemcpy(h_imageGray, d_imageGray, sizeof(unsigned char) * numPixels, cudaMemcpyDeviceToHost)); cudaFree(d_imageGray); cudaFree(d_imageRGBA);

By the way, CUDA allows you to use the C ++ compiler for the host code - so you can easily write wrappers to automatically free memory.
So, we start, we measure (the size of the input image is 10,109 × 4,542 ):

 OpenMP time (ms):45 CUDA time simple (ms): 43.1941

Configuration of the machine on which the tests were conducted:

Hidden text

Processor: Intel® Core (TM) i7-3615QM CPU @ 2.30GHz.
GPU: NVIDIA GeForce GT 650M, 1024 MB, 900 MHz.
RAM: DD3,2x4GB, 1600 MHz.
OS: OS X 10.9.5.
Compiler: g ++ (GCC) 4.9.2 20141029.
CUDA compiler: Cuda compilation tools, release 6.0, V6.0.1.
Supported OpenMP Version: OpenMP 4.0.

It turned out somehow not very impressive :) And the problem is still the same - too little work is done on each pixel - we run thousands of threads, each of which works almost instantly. In the case of the CPU, such a problem does not occur - OpenMP will launch a relatively small number of threads (8 in my case) and divide the work between them equally - so the processors will be occupied by almost 100%, while with the GPU we are, in fact, do not use all his power. The solution is quite obvious - to process several pixels in the core. A new, optimized kernel will look like this:

rgba_to_grayscale_optimized

 #define WARP_SIZE 32 __global__ void rgba_to_grayscale_optimized(const uchar4* const d_imageRGBA, unsigned char* const d_imageGray, int numRows, int numCols, int elemsPerThread) { int y = blockDim.y*blockIdx.y + threadIdx.y; int x = blockDim.x*blockIdx.x + threadIdx.x; const int loop_start = (x/WARP_SIZE * WARP_SIZE)*(elemsPerThread-1)+x; for (int i=loop_start, j=0; j<elemsPerThread && i<numCols; i+=WARP_SIZE, ++j) { const int offset = y*numCols+i; const uchar4 pixel = d_imageRGBA[offset]; d_imageGray[offset] = 0.299f*pixel.x + 0.587f*pixel.y+0.114f*pixel.z; } }

It is not so simple as with the previous core. If to understand, now each stream will process elemsPerThread pixels, and not in a row, but with a distance in WARP_SIZE between them. What is WARP_SIZE, why it is equal to 32, and why to process pixels in a free way, will be explained in more detail in the following parts, now I’ll just say that we are achieving more efficient work with memory. Each stream now processes elemsPerThread pixels with a distance in WARP_SIZE between them, so the x-coordinate of the first pixel for this stream based on its position in the block is now calculated using a slightly more complex formula than before.
This kernel is launched as follows:

Hidden text

  threadNum=128; const int elemsPerThread = 16; blockSize = dim3(threadNum, 1, 1); gridSize = dim3(numCols / (threadNum*elemsPerThread) + 1, numRows, 1); cudaEventRecord(start); rgba_to_grayscale_optimized<<<gridSize, blockSize>>>(d_imageRGBA, d_imageGray, numRows, numCols, elemsPerThread); cudaEventRecord(stop); cudaEventSynchronize(stop); cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); milliseconds = 0; cudaEventElapsedTime(&milliseconds, start, stop); std::cout << "CUDA time optimized (ms): " << milliseconds << std::endl;

The number of blocks on the x-coordinate is now calculated as numCols / (threadNum * elemsPerThread) + 1 instead of numCols / threadNum + 1 . Otherwise, everything remains the same.
Run:

 OpenMP time (ms):44 CUDA time simple (ms): 53.1625 CUDA time optimized (ms): 15.9273

We received a speed increase of 2.76 times (again, without taking into account the time for a memory operation) - for such a simple problem, this is pretty good. Yes, yes, this task is too simple - the CPU copes well with it. As can be seen from the second test, a simple implementation on a GPU can even play according to the speed of implementation on the CPU.
Today, everything, in the next part, will look at the GPU hardware and the main patterns of parallel communication.
All source code is available on bitbucket .

Source: https://habr.com/ru/post/245503/

All Articles

Concurrent programming with CUDA. Part 1: Introduction

Another article about CUDA - why?

Content

Delay vs bandwidth

Basic CUDA Concepts and Terms

The main stages of the CUDA program

Kernels

Writing the first program on CUDA

More articles: