⬆️ ⬇️

Concurrent programming with CUDA. Part 2: GPU hardware and parallel communication patterns

Content



Part 1: Introduction.

Part 2: GPU hardware and parallel communication patterns.

Part 3: GPU fundamental algorithms: convolution (reduce), scan (scan) and histogram (histogram).

Part 4: Fundamental GPU algorithms: compaction (compact), segmented scan (segmented scan), sorting. Practical application of some algorithms.

Part 5: Optimization of GPU programs.

Part 6: Examples of parallelization of sequential algorithms.

Part 7: Additional Parallel Programming Topics, Dynamic Concurrency.



Parallel communication patterns





What is parallel computing? Nothing else but a multitude of flows that solve a specific task cooperatively . The key word here is “cooperative” - to achieve cooperation, it is necessary to apply certain communication mechanisms between the streams. When using CUDA, communication takes place through memory: streams can read input data, change output data, or exchange “intermediate” results.

Depending on the way in which the threads communicate through memory, different patterns of parallel communication are distinguished.

In the previous section, the task of converting a color image to shades of gray was considered as a simple example of using CUDA. For this, the intensity of each pixel of the output image in shades of gray was calculated by the formula I = A * pix.R + B * pix.G + C * pix.B , where A, B, C are constants, pix is the corresponding pixel of the original image. Graphically, this process looks like this:



If we abstract from the very method of calculating the output value, we get the first parallel communication pattern - map : the same function is performed on each input data element, with index i , and the result is stored in the output data array under the same index i . The map template is very effective on the GPU, and moreover, it is simply expressed in the framework of CUDA - just run one thread per input element (as was done in the previous part for the task of converting an image). However, only a small part of the tasks can be solved using only this template.

If, however, several input elements are used to calculate the output value with the index i , then this pattern is called the gather , and it can look like this:



Or so:



The effectiveness of the implementation of this pattern on CUDA depends on what kind of input values ​​are used when calculating the output and their number - it is best when a small number of consecutive elements are used.

The reverse pattern, scatter - each input element affects several (or one) output elements, graphically looks like the gather , but the meaning changes: now we are “repelling” not from the output elements, for which we calculate their value, but from the input, which affect the values ​​of certain output elements. Quite often, the same problem can be solved within the framework of both the gather pattern and the scatter . For example, if we want to average 3 neighboring input elements and write them into the output array, then we can:



As you most likely guessed, when using the scatter approach, a synchronization problem arises, since several streams may attempt to modify the same output element at the same time.

It is also worth highlighting the subspecies of the gather - stencil pattern: in this pattern, a restriction is imposed on the input elements that are involved in the calculation of the output element - namely, it can only be neighboring elements. In the case of 2D / 3D images, various types of this pattern can be used, for example, a two-dimensional von Neumann stencil:



or two-dimensional Moore stencil:



In connection with this limitation, the stencil pattern is usually quite effectively implemented within CUDA: it is enough to run one thread per output element, and the flow will read the input elements it needs. With this organization of computation, efficiency is ensured by two factors:

  1. All data needed for one stream is grouped in memory (in the case of a one-dimensional array, a solid “piece” of memory, in the 2D case, several pieces of memory that are at the same distance from each other).
  2. The value of some input element is read several times from neighboring streams (the specific number of readings depends on the selected mask) - it becomes possible to "reuse" the data provided to us by CUDA - this will be discussed later in the article.


GPU hardware



Consider the high-level structure of the GPU hardware:





Compliance with CUDA model and GPU hardware. CUDA Warranties



According to the CUDA model, the programmer divides the task into blocks, and blocks into threads. How does the mapping of these software entities with the above described GPU hardware blocks performed?





For all these reasons, CUDA provides the following guarantees:



CUDA does not guarantee that:



Synchronization



So, let's list the main synchronization mechanisms provided by CUDA:



Basic principles for the effective use of CUDA





Writing a second program on CUDA



We proceed to practice. As an example illustrating the theory presented, we will write a program that performs Gaussian image blurring. The principle of operation is the following: the value of the R, G, B channels of a pixel in the output blurred image is calculated as the weighted sum of the values ​​of the R, G, B channels of all pixels of the original image in a stencil:



The weights are calculated using the 2D Gaussian distribution, but exactly how this is done is not too important for our task.

As you can see from the description of the task, it is quite natural to choose a stencil pattern for the implementation of this algorithm, because each pixel of the output image is calculated from the corresponding neighboring pixels of the original image.

Let's start with the skeleton of the program:

main.cpp
 #include <chrono> #include <iostream> #include <cstring> #include <string> #include <opencv2/core/core.hpp> #include <opencv2/highgui/highgui.hpp> #include <opencv2/opencv.hpp> #include <vector_types.h> #include "openMP.hpp" #include "CUDA_wrappers.hpp" #include "common/image_helpers.hpp" void prepareFilter(float **filter, int *filterWidth, float *filterSigma) { static const int blurFilterWidth = 9; static const float blurFilterSigma = 2.; *filter = new float[blurFilterWidth * blurFilterWidth]; *filterWidth = blurFilterWidth; *filterSigma = blurFilterSigma; float filterSum = 0.f; const int halfWidth = blurFilterWidth/2; for (int r = -halfWidth; r <= halfWidth; ++r) { for (int c = -halfWidth; c <= halfWidth; ++c) { float filterValue = expf( -(float)(c * c + r * r) / (2.f * blurFilterSigma * blurFilterSigma)); (*filter)[(r + halfWidth) * blurFilterWidth + c + halfWidth] = filterValue; filterSum += filterValue; } } float normalizationFactor = 1.f / filterSum; for (int r = -halfWidth; r <= halfWidth; ++r) { for (int c = -halfWidth; c <= halfWidth; ++c) { (*filter)[(r + halfWidth) * blurFilterWidth + c + halfWidth] *= normalizationFactor; } } } void freeFilter(float *filter) { delete[] filter; } int main( int argc, char** argv ) { using namespace cv; using namespace std; using namespace std::chrono; if( argc != 2) { cout <<" Usage: blur_image imagefile" << endl; return -1; } Mat image, blurredImage, referenceBlurredImage; uchar4 *imageArray, *blurredImageArray; prepareImagePointers(argv[1], image, &imageArray, blurredImage, &blurredImageArray, CV_8UC4); int numRows = image.rows, numCols = image.cols; float *filter, filterSigma; int filterWidth; prepareFilter(&filter, &filterWidth, &filterSigma); cv::Size filterSize(filterWidth, filterWidth); auto start = system_clock::now(); cv::GaussianBlur(image, referenceBlurredImage, filterSize, filterSigma, filterSigma, BORDER_REPLICATE); auto duration = duration_cast<milliseconds>(system_clock::now() - start); cout<<"OpenCV time (ms):" << duration.count() << endl; start = system_clock::now(); BlurImageOpenMP(imageArray, blurredImageArray, numRows, numCols, filter, filterWidth); duration = duration_cast<milliseconds>(system_clock::now() - start); cout<<"OpenMP time (ms):" << duration.count() << endl; cout<<"OpenMP similarity:" << getEuclidianSimilarity(referenceBlurredImage, blurredImage) << endl; for (int i=0; i<4; ++i) { memset(blurredImageArray, 0, sizeof(uchar4)*numRows*numCols); start = system_clock::now(); BlurImageCUDA(imageArray, blurredImageArray, numRows, numCols, filter, filterWidth); duration = duration_cast<milliseconds>(system_clock::now() - start); cout<<"CUDA time full (ms):" << duration.count() << endl; cout<<"CUDA similarity:" << getEuclidianSimilarity(referenceBlurredImage, blurredImage) << endl; } freeFilter(filter); return 0; } 




The points:

  1. We read the image file, prepare pointers to the original image and the resulting blurred image. The prepareImagePointers function remains the same; if necessary, you can view its source code on bitbucket.
  2. Prepare a Gaussian filter - that is, a set of our scales. We also remember the used filter parameters in order to transfer them to OpenCV and get a sample of the blurred image to verify the correctness of our algorithms.
  3. Call the Gauss blur function from OpenCV, save the resulting sample, measure the time spent.
  4. Call the Gauss blur function written using OpenMP, measure the time spent, check the result obtained with the sample. The image similarity calculation function of getEuclidianSimilarity is as follows:

    getEuclidianSimilarity
     double getEuclidianSimilarity(const cv::Mat& a, const cv::Mat& b) { double errorL2 = cv::norm(a, b, cv::NORM_L2); double similarity = errorL2 / (double) (a.rows * a.cols); return similarity; } 




    In fact, it finds the average sum of squares of differences in the values ​​of all channels of all pixels of two images.
  5. Call the CUDA-version of the blur according to Gauss 4 times, each time measuring the time spent and checking the obtained result with the sample. Why call 4 times? The fact is that with the very first call, some time will be spent on initialization - therefore, it is better to run several times and measure the time spent on subsequent calls.


OpenMP implementation of the algorithm:

openMP.hpp
 #include <stdio.h> #include <omp.h> #include <vector_types.h> #include <vector_functions.h> void BlurImageOpenMP(const uchar4 * const imageArray, uchar4 * const blurredImageArray, const long numRows, const long numCols, const float * const filter, const size_t filterWidth) { using namespace std; const long halfWidth = filterWidth/2; #pragma omp parallel for collapse(2) for (long row = 0; row < numRows; ++row) { for (long col = 0; col < numCols; ++col) { float resR=0.0f, resG=0.0f, resB=0.0f; for (long filterRow = -halfWidth; filterRow <= halfWidth; ++filterRow) { for (long filterCol = -halfWidth; filterCol <= halfWidth; ++filterCol) { //Find the global image position for this filter position //clamp to boundary of the image const long imageRow = min(max(row + filterRow, static_cast<long>(0)), numRows - 1); const long imageCol = min(max(col + filterCol, static_cast<long>(0)), numCols - 1); const uchar4 imagePixel = imageArray[imageRow*numCols+imageCol]; const float filterValue = filter[(filterRow+halfWidth)*filterWidth+filterCol+halfWidth]; resR += imagePixel.x*filterValue; resG += imagePixel.y*filterValue; resB += imagePixel.z*filterValue; } } blurredImageArray[row*numCols+col] = make_uchar4(resR, resG, resB, 255); } } } 




For all 3 channels of each pixel of the original image, we consider the described weighted sum, write the result to the corresponding position of the output image.

CUDA option:

CUDA.cu
 #include <iostream> #include <cuda.h> #include "CUDA_wrappers.hpp" #include "common/CUDA_common.hpp" __global__ void gaussian_blur(const uchar4* const d_image, uchar4* const d_blurredImage, const int numRows, const int numCols, const float * const d_filter, const int filterWidth) { const int row = blockIdx.y*blockDim.y+threadIdx.y; const int col = blockIdx.x*blockDim.x+threadIdx.x; if (col >= numCols || row >= numRows) return; const int halfWidth = filterWidth/2; extern __shared__ float shared_filter[]; if (threadIdx.y < filterWidth && threadIdx.x < filterWidth) { const int filterOff = threadIdx.y*filterWidth+threadIdx.x; shared_filter[filterOff] = d_filter[filterOff]; } __syncthreads(); float resR=0.0f, resG=0.0f, resB=0.0f; for (int filterRow = -halfWidth; filterRow <= halfWidth; ++filterRow) { for (int filterCol = -halfWidth; filterCol <= halfWidth; ++filterCol) { //Find the global image position for this filter position //clamp to boundary of the image const int imageRow = min(max(row + filterRow, 0), numRows - 1); const int imageCol = min(max(col + filterCol, 0), numCols - 1); const uchar4 imagePixel = d_image[imageRow*numCols+imageCol]; const float filterValue = shared_filter[(filterRow+halfWidth)*filterWidth+filterCol+halfWidth]; resR += imagePixel.x * filterValue; resG += imagePixel.y * filterValue; resB += imagePixel.z * filterValue; } } d_blurredImage[row*numCols+col] = make_uchar4(resR, resG, resB, 255); } void BlurImageCUDA(const uchar4 * const h_image, uchar4 * const h_blurredImage, const size_t numRows, const size_t numCols, const float * const h_filter, const size_t filterWidth) { uchar4 *d_image, *d_blurredImage; cudaSetDevice(0); checkCudaErrors(cudaGetLastError()); const size_t numPixels = numRows * numCols; const size_t imageSize = sizeof(uchar4) * numPixels; //allocate memory on the device for both input and output checkCudaErrors(cudaMalloc(&d_image, imageSize)); checkCudaErrors(cudaMalloc(&d_blurredImage, imageSize)); //copy input array to the GPU checkCudaErrors(cudaMemcpy(d_image, h_image, imageSize, cudaMemcpyHostToDevice)); float *d_filter; const size_t filterSize = sizeof(float) * filterWidth * filterWidth; checkCudaErrors(cudaMalloc(&d_filter, filterSize)); checkCudaErrors(cudaMemcpy(d_filter, h_filter, filterSize, cudaMemcpyHostToDevice)); dim3 blockSize; dim3 gridSize; int threadNum; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); threadNum = 32; blockSize = dim3(threadNum, threadNum, 1); gridSize = dim3(numCols/threadNum+1, numRows/threadNum+1, 1); cudaEventRecord(start); gaussian_blur<<<gridSize, blockSize, filterSize>>>(d_image, d_blurredImage, numRows, numCols, d_filter, filterWidth); cudaEventRecord(stop); cudaEventSynchronize(stop); cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); float milliseconds = 0; cudaEventElapsedTime(&milliseconds, start, stop); std::cout << "CUDA time kernel (ms): " << milliseconds << std::endl; checkCudaErrors(cudaMemcpy(h_blurredImage, d_blurredImage, sizeof(uchar4) * numPixels, cudaMemcpyDeviceToHost)); checkCudaErrors(cudaFree(d_filter)); checkCudaErrors(cudaFree(d_image)); checkCudaErrors(cudaFree(d_blurredImage)); } 




  1. Allocate memory on the device for the original image, output image, filter. Copy the relevant data from the host memory to the allocated device memory.
  2. Call the core. Pay attention to the new, 3rd parameter when you call the kernel: <<< gridSize, blockSize, filterSize >>> - it sets the size of the total memory that is needed for each block. In this task, it was possible to realize the use of shared memory in two ways: either move the filter data to the common memory of the block, or move the “piece” of the image that only this block needs, since it is these data that are needed by several threads of the block at once. However, the second option is somewhat more complicated - you need to take into account that the piece of the input image that is needed for each block is slightly larger than the block itself - because we perform the gather operation, which means each stream calculates the values ​​of one pixel of the output image using several neighboring pixels of the original image:



    Therefore, I stopped at the first option, which means that each block needs exactly sizeof (float) * filterWidth * filterWidth memory to store all filter values. Moving the filter weights from the device’s memory to the block’s common memory is as follows:

    Hidden text
      extern __shared__ float shared_filter[]; if (threadIdx.y < filterWidth && threadIdx.x < filterWidth) { const int filterOff = threadIdx.y*filterWidth+threadIdx.x; shared_filter[filterOff] = d_filter[filterOff]; } __syncthreads(); 




    Here __shared__ in the declaration of the filter weights array says that this data should be placed in the shared memory of the block; extern means that the size of the allocated memory will be set at the location of the kernel call; __syncthreads is a barrier that guarantees that all filter weights will be transferred to the common memory of the block before any of the threads of this block continues its execution. Further, all readings of the filter weights are made from the total memory of the block.
  3. Copy the output image from the device memory to the host memory, freeing the allocated memory.


Compile, run (input image size - 557x313):

 OpenCV time (ms):2 OpenMP time (ms):11 OpenMP similarity:0.00287131 CUDA time kernel (ms): 2.93245 CUDA time full (ms):32 CUDA similarity:0.00287131 CUDA time kernel (ms): 2.93402 CUDA time full (ms):4 CUDA similarity:0.00287131 CUDA time kernel (ms): 2.93267 CUDA time full (ms):4 CUDA similarity:0.00287131 CUDA time kernel (ms): 2.93312 CUDA time full (ms):4 CUDA similarity:0.00287131 


As you can see, if you do not take into account the very first launch of the CUDA version, we received almost 3-fold time gain in comparison with the OpenMP option, although we did not catch up with the OpenCV option - which, by the way, uses OpenCL .

Configuration of the machine on which the tests were conducted:

Hidden text
Processor: Intel® Core (TM) i7-3615QM CPU @ 2.30GHz.

GPU: NVIDIA GeForce GT 650M, 1024 MB, 900 MHz.

RAM: DD3,2x4GB, 1600 MHz.

OS: OS X 10.9.5.

Compiler: g ++ (GCC) 4.9.2 20141029.

CUDA compiler: Cuda compilation tools, release 6.0, V6.0.1.

Supported OpenMP Version: OpenMP 4.0.



Today, everything, in the next part, will look at some of the fundamental algorithms of the GPU.

All source code is available on bitbucket .


')

Source: https://habr.com/ru/post/245523/



All Articles