📜 ⬆️ ⬇️

First steps with OpenCL or a tale about how to run the same code on the GPU and CPU

So, almost a year has passed since my first post about programming video cards and horror stories about how complicated it is. Now it's time to show that everything is not so bad and how to use this strange thing called OpenCL, and even use its main advantage, that is, the ability to run the same code on different devices. And I will show how you can get an order of magnitude greater productivity of a conventional processor almost free.

Introduction


I think that retelling Wikipedia about OpenCL makes no special sense, but in a nutshell, OpenCL is a language (framework and platform) that allows you to run the same code on different devices with different architectures, and in particular on highly parallel processors. , like video cards and modern central processors. The standard is based on C99 and is maintained by The Khronos Group, we will consider it complete on this educational program.

I will begin by showing a small piece of code and explaining what is happening there, at the same time talking about how OpenCL works.

First I will describe a fairly trivial code and those who are not eager to see OpenCL magic can skip the first part (just read the last paragraph where I describe the MathCalculations function, this is important. And if you know about OpenCL and you want to see the test results, go straight to the fifth section, but still look at MathCalculations).
int main (int argc, char * argv [])
int main(int argc, char* argv[]) { GenerateTestData(); PerformCalculationsOnHost(); //Get all available platforms vector<cl::Platform> platforms; cl::Platform::get(&platforms); for (int iPlatform=0; iPlatform<platforms.size(); iPlatform++) { //Get all available devices on selected platform std::vector<cl::Device> devices; platforms[iPlatform].getDevices(CL_DEVICE_TYPE_ALL, &devices); //Perform test on each device for (int iDevice=0; iDevice<devices.size(); iDevice++) { try { PerformTestOnDevice(devices[iDevice]); } catch(cl::Error error) { std::cout << error.what() << "(" << error.err() << ")" << std::endl; } CheckResults(); } } //Clean buffers delete[](pInputVector1); delete[](pInputVector2); delete[](pOutputVector); delete[](pOutputVectorHost); return 0; } 


This is what the main program of my small program for testing OpenCL looks like, and more specifically, to calculate some abstract mathematical expression that we will get to later. So, let's figure out line by line what's going on here.
')

Part One - Initialization of source data and the traditional method of computing


GenerateTestData (); does nothing extraordinary, but simply allocates memory for input and output arrays, and also fills the input arrays with random data.
void GenerateTestData ()
 void GenerateTestData() { pInputVector1 = new float[DATA_SIZE]; pInputVector2 = new float[DATA_SIZE]; pOutputVector = new float[DATA_SIZE]; pOutputVectorHost = new float[DATA_SIZE]; srand (time(NULL)); for (int i=0; i<DATA_SIZE; i++) { pInputVector1[i] = rand() * 1000.0 / RAND_MAX; pInputVector2[i] = rand() * 1000.0 / RAND_MAX; } } 


Next is a slightly more interesting feature:
void PerformCalculationsOnHost ()
 void PerformCalculationsOnHost() { cout << "Device: Host" << endl << endl; //Some performance measurement timeValues.clear(); __int64 start_count; __int64 end_count; __int64 freq; QueryPerformanceFrequency((LARGE_INTEGER*)&freq); for(int iTest=0; iTest<(TESTS_NUMBER/10); iTest++) { QueryPerformanceCounter((LARGE_INTEGER*)&start_count); for(int iJob=0; iJob<DATA_SIZE; iJob++) { //Check boundary conditions if (iJob >= DATA_SIZE) break; //Perform calculations pOutputVectorHost[iJob] = MathCalculations(pInputVector1[iJob], pInputVector2[iJob]); } QueryPerformanceCounter((LARGE_INTEGER*)&end_count); double time = 1000 * (double)(end_count - start_count) / (double)freq; timeValues.push_back(time); } hostPerformanceTimeMS = std::accumulate(timeValues.begin(), timeValues.end(), 0)/timeValues.size(); PrintTimeStatistic(); } 


In her first cycle
 for(int iTest=0; iTest<(TESTS_NUMBER/10); iTest++) 

it is necessary in order to carry out the test several times in order to obtain a more accurate execution time. The calculation time of each test is stored in the timeValues ​​array from which the average value is then calculated and stored in hostPerformanceTimeMS.

Second cycle
 for(int iJob=0; iJob<DATA_SIZE; iJob++) 

consistently performs some mathematical calculations on the elements of the input arrays and stores them in the output array.

As we can see, there is nothing unusual in this code, it is compiled by a regular CBC compiler and executed sequentially on a central processor, like most of the code we all write every day. And we need it in order to subsequently verify with it the results obtained by OpenCL, as well as understand what performance gains we get.

Immediately you should look at MathCalculations and see that everything is completely boring:
float MathCalculations (float a, float b)
 float MathCalculations(float a, float b) { float res = 0; res += a*a*0.315f + b*0.512f + 0.789f; res += a*a*0.15f + b*0.12f + 0.789f; res += a*a*0.35f + b*0.51f + 0.89f; res += a*a*0.31f + b*0.52f + 0.7f; res += a*a*0.4315f + b*0.512f + 0.4789f; res += a*a*0.515f + b*0.132f + 0.7859f; res += a*a*0.635f + b*0.521f + 0.89f; res += a*a*0.731f + b*0.152f + 0.7f; res += a*a*0.1315f + b*0.512f + 0.789f; res += a*a*0.115f + b*0.12f + 0.789f; res += a*a*0.135f + b*0.51f + 0.89f; res += a*a*0.131f + b*0.52f + 0.7f; res += a*a*0.14315f + b*0.512f + 0.4789f; res += a*a*0.1515f + b*0.132f + 0.7859f; res += a*a*0.1635f + b*0.521f + 0.89f; res += a*a*0.1731f + b*0.152f + 0.7f; return res; } 


Actually, it has no special meaning (and it is obvious that it can be greatly simplified), but serves as a simple demonstration of pure mathematical operations. The important thing about it is that it is in a separate .cpp file and that a lot of arithmetic operations are performed in it, but more on that later.

Part Two - Initializing OpenCL


So, the patient read up to this part and were delighted that the interesting begins, but those who are impatient cannot experience this feeling, they missed the last paragraph :)

First, I will say that the OpenCL Runtime API is exactly the API for C, not for C ++. In general, there is nothing wrong with this except that for error checking, it is necessary to check the code returned by each function and this is not very convenient. And you also need to manually monitor the release of allocated resources.
But there is also an official C ++ wrapper (it can be found on the Khronos website), which is a set of classes corresponding to OpenCL objects and supporting reference counting and throwing exceptions in case of errors (exceptions must be included with #define __CL_ENABLE_EXCEPTIONS ). I will use this very wrapper in our test.

So first we get a list of available platforms:
 vector<cl::Platform> platforms; cl::Platform::get(&platforms); 

The platform in OpenCL corresponds to the vendor, i.e. NVidia will have one platform with its devices, Intel will have another, etc. and so on. In my case, two NVidia and Intel platforms are available to me.

Immediately another little trick, the C ++ wrapper can use its own vectors (if you tell it about it) or vectors from STD, so if somewhere in the examples you get something like cl :: vector, don't be alarmed, he knows both formats .

After we have received the list of platforms, for each platform we get a list of available devices:
 std::vector<cl::Device> devices; platforms[iPlatform].getDevices(CL_DEVICE_TYPE_ALL, &devices); 

Actually the devices are what will perform our calculations. It can be a GPU, a CPU, and some special accelerator that is connected to the host, i.e. the system on which OpenCL runs. Instead of CL_DEVICE_TYPE_ALL, you can send CL_DEVICE_TYPE_GPU, then it will issue only video cards or CL_DEVICE_TYPE_CPU for CPUs.

For each device I find, I run a test, which I’ll talk about below, and try to catch exceptions that OpenCL throws in case of problems, and if everything went well, CheckResults compares the results with those we counted in the first part of the host and calculates statistics mistakes.

Part Three - Creating and Running the Kernel


Here we come to the most interesting part - calculations.
void PerformTestOnDevice (cl :: Device device)
 void PerformTestOnDevice(cl::Device device) { cout << endl << "-------------------------------------------------" << endl; cout << "Device: " << device.getInfo<CL_DEVICE_NAME>() << endl << endl; //For the selected device create a context vector<cl::Device> contextDevices; contextDevices.push_back(device); cl::Context context(contextDevices); //For the selected device create a context and command queue cl::CommandQueue queue(context, device); //Clean output buffers fill_n(pOutputVector, DATA_SIZE, 0); //Create memory buffers cl::Buffer clmInputVector1 = cl::Buffer(context, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR, DATA_SIZE * sizeof(float), pInputVector1); cl::Buffer clmInputVector2 = cl::Buffer(context, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR, DATA_SIZE * sizeof(float), pInputVector2); cl::Buffer clmOutputVector = cl::Buffer(context, CL_MEM_READ_WRITE|CL_MEM_COPY_HOST_PTR, DATA_SIZE * sizeof(float), pOutputVector); //Load OpenCL source code std::ifstream sourceFile("OpenCLFile1.cl"); std::string sourceCode(std::istreambuf_iterator<char>(sourceFile),(std::istreambuf_iterator<char>())); //Build OpenCL program and make the kernel cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(), sourceCode.length()+1)); cl::Program program = cl::Program(context, source); program.build(contextDevices); cl::Kernel kernel(program, "TestKernel"); //Set arguments to kernel int iArg = 0; kernel.setArg(iArg++, clmInputVector1); kernel.setArg(iArg++, clmInputVector2); kernel.setArg(iArg++, clmOutputVector); kernel.setArg(iArg++, DATA_SIZE); //Some performance measurement timeValues.clear(); __int64 start_count; __int64 end_count; __int64 freq; QueryPerformanceFrequency((LARGE_INTEGER*)&freq); //Run the kernel on specific ND range for(int iTest=0; iTest<TESTS_NUMBER; iTest++) { QueryPerformanceCounter((LARGE_INTEGER*)&start_count); queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(DATA_SIZE), cl::NDRange(128)); queue.finish(); QueryPerformanceCounter((LARGE_INTEGER*)&end_count); double time = 1000 * (double)(end_count - start_count) / (double)freq; timeValues.push_back(time); } PrintTimeStatistic(); // Read buffer C into a local list queue.enqueueReadBuffer(clmOutputVector, CL_TRUE, 0, DATA_SIZE * sizeof(float), pOutputVector); } 


First we output the device name obtained in this way:
 device.getInfo<CL_DEVICE_NAME>() 

In the same way you can get information about the number of cores, frequency, version, etc. etc.

Then we create the context:
 vector<cl::Device> contextDevices; contextDevices.push_back(device); cl::Context context(contextDevices); 

With contexts, everything is not so simple ... When creating a context, we pass a list of devices that we want to include in it, but there is a restriction: only devices on one platform can be in one context, i.e. make context with the GPU and CPU (in the case of Intel / NVidia) fail. In the case of multiple devices in the same context, all buffers will be synchronized automatically on different devices. On the one hand, this simplifies support for multi-GPU, and on the other hand, no one knows how, when and when the driver will synchronize, and data transfer efficiency is critical for obtaining high performance for which everything is started. Therefore, I usually create a separate context for each device and manually distribute the data. Thus, it is always known what, where, when occurs.

The next step is to create a command queue for the device:
 cl::CommandQueue queue(context, device); 

This queue is tied to a specific device and, in theory, it may be Out of Order, but in fact, I did not notice this behavior. There can be several queues for one device, and you can synchronize commands from different queues, but within the same context.

Next we create buffers for the input and output vectors:
 //Create memory buffers cl::Buffer clmInputVector1 = cl::Buffer(context, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR, DATA_SIZE * sizeof(float), pInputVector1); cl::Buffer clmInputVector2 = cl::Buffer(context, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR, DATA_SIZE * sizeof(float), pInputVector2); cl::Buffer clmOutputVector = cl::Buffer(context, CL_MEM_READ_WRITE|CL_MEM_COPY_HOST_PTR, DATA_SIZE * sizeof(float), pOutputVector); 

When creating a buffer, the context (and not the specific device), its volume and, if desired, and using the CL_MEM_COPY_HOST_PTR flag, a pointer to the data that will be copied into it during creation are indicated. As I said earlier, the C ++ wrapper uses reference counting, so you do not need to delete the buffer manually, unlike the pure C API.

Next, we need to create a kernel whose code is stored in the file “OpenCLFile1.cl”. To do this, we read the text from the file, create an OpenCL program, compile it, and get a kernel from it with the name "TestKernel", which you will see in the next section.
 cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(), sourceCode.length()+1)); cl::Program program = cl::Program(context, source); program.build(contextDevices); cl::Kernel kernel(program, "TestKernel"); 

When compiling, you need to specify on which devices we plan to launch it, in our case it is one selected device for the test, although you can specify everything at once. You can also pass compilation flags, but in this example we don’t.

Next, we need to set the arguments that will be passed to the kernel. Unlike CUDA, you need to call special functions (in the case of the C ++ wrapper, methods) for each argument and, if necessary, specify the size of the argument.
 int iArg = 0; kernel.setArg(iArg++, clmInputVector1); kernel.setArg(iArg++, clmInputVector2); kernel.setArg(iArg++, clmOutputVector); kernel.setArg(iArg++, DATA_SIZE); 

Now we come to the most important thing - starting the kernel:
 queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(DATA_SIZE), cl::NDRange(128)); 

The queue.enqueueNDRangeKernel itself adds the kernel start command to the command queue and sets the number of elements to be processed, as well as the size of the group. I will tell about groups separately (in another article), but now I will only mention the fact that all elements are always divided into groups and the performance can strongly depend on the size of the group. In our case, the number of elements is DATA_SIZE, and the group size is 128. During kernel execution, it will be run DATA_SIZE once (in an unknown sequence and possibly simultaneously) and every time it is launched, information will be transmitted on which element is being processed.
enqueueNDRangeKernel is not blocking, so after starting the kernel, we have to wait for it to complete, which is what it is for:
 queue.finish(); 

In fact, finish performs two tasks:
1) Sends all commands to the device (execution of enqueueNDRangeKernel ensures that the driver received the command and put it in the queue, but does not guarantee its launch on the device, and quite often it can take quite a long time before the actual launch of the kernel).
2) Waiting for completion of all teams in the queue.
If only the first part needs to be executed, there is a push command (clFlush), which is not blocking, but causes the driver to start executing commands from the queue.

After performing the calculations, we calculate the elapsed time and load the calculation results back to the host with the command:
 queue.enqueueReadBuffer(clmOutputVector, CL_TRUE, 0, DATA_SIZE * sizeof(float), pOutputVector); 

Depending on the second argument, the enqueueReadBuffer may be blocking or non-blocking. In our case, it is blocking, so there is no need to call finish separately. The syntax is simple: the first argument is where to read, the fourth argument is how much to read, and the last argument is where to read. There is also a parameter that specifies the offset from the beginning of the input buffer, which should be used if we need to read the data not first, since we cannot use address arithmetic for OpenCL buffers on the host.

Part Four - OpenCL kernel code


And here we got to the place where we need to start writing code (although it is difficult to call it code, so ... self-indulgence :)) on OpenCL. This is what OpenCLFile1.cl looks like:
 #include "MathCode.cpp" __kernel void TestKernel( __global const float* pInputVector1, __global const float* pInputVector2, __global float* pOutputVectorHost, int elementsNumber) { //Get index into global data array int iJob = get_global_id(0); //Check boundary conditions if (iJob >= elementsNumber) return; //Perform calculations pOutputVectorHost[iJob] = MathCalculations(pInputVector1[iJob], pInputVector2[iJob]); } 

So in order:
First of all, we include in our code the MathCode.cpp file, which contains a mathematical function, the same one I asked to pay attention to earlier and the same one that is used for traditional calculations on the host. As you can see, we do not even copy the code, we use the same file with the mathematical code.
Next we create a kernel, which we mark with the __kernel keyword. Some kernel arguments are also labeled with the __global keyword, which indicates that this is a buffer in the device’s global memory created by us in the host code.
In the kernel code we get the number of the element that needs to be processed:
 int iJob = get_global_id(0); 

The get_global_id parameter indicates the dimension, since the elements being processed can be a 1, 2, or 3-dimensional array.
Then we check the boundary conditions:
 if (iJob >= elementsNumber) return; 

This must be done for the reason that the number of elements to be processed must always be a multiple of the size of the group, and thus it may exceed the number to be processed.
And after verification we do the main part: calculations, and in exactly the same way as on the host:
 pOutputVectorHost[iJob] = MathCalculations(pInputVector1[iJob], pInputVector2[iJob]); 


Part Five - Testing and Performance Measurements


So it's time to start the application, evaluate the performance and draw some conclusions.

I ran the test on two machines and got interesting results:
Laptop (CPU: Intel® Core ™ i7-820QM , GPU: NVidia Quadro FX 2800M ):
 Host: 959.256 ms CPU: 82.4163 ms (13.106X faster than host) GPU: 9.90836 ms (109.014X faster than host) 

Desktop (CPU: Intel® Core ™ i7-2600 , GPU: NVidia GeForce GTX 580 ):
 Host: 699.031 ms CPU: 27.7833 ms (25.159X faster than host) GPU: 2.06257 ms (338.897X faster than host) 

Full results
 Device: Host Calculation time statistic: (20 runs) Med: 959.256 ms (1.12602X faster than host) Avg: 1080.15 ms Min: 933.554 ms Max: 1319.19 ms ------------------------------------------------- Device: Quadro FX 2800M Calculation time statistic: (200 runs) Med: 9.90836 ms (109.014X faster than host) Avg: 10.7231 ms Min: 9.82841 ms Max: 135.924 ms Errors: avgRelAbsDiff = 5.25777e-008 maxRelAbsDiff = 5.83678e-007 ------------------------------------------------- Device: Intel(R) Core(TM) i7 CPU Q 820 @ 1.73GHz Calculation time statistic: (200 runs) Med: 82.4163 ms (13.106X faster than host) Avg: 85.2226 ms Min: 79.4138 ms Max: 113.03 ms Errors: avgRelAbsDiff = 3.64332e-008 maxRelAbsDiff = 4.84797e-007 

 Device: Host Calculation time statistic: (20 runs) Med: 699.031 ms (0.999956X faster than host) Avg: 699.1 ms Min: 691.544 ms Max: 715.233 ms ------------------------------------------------- Device: GeForce GTX 580 Calculation time statistic: (200 runs) Med: 2.06257 ms (338.897X faster than host) Avg: 2.4 ms Min: 2.03873 ms Max: 82.0514 ms Errors: avgRelAbsDiff = 3.50006e-008 maxRelAbsDiff = 4.92271e-007 ------------------------------------------------- Device: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Calculation time statistic: (200 runs) Med: 27.7833 ms (25.159X faster than host) Avg: 27.49 ms Min: 27.0154 ms Max: 35.8386 ms Errors: avgRelAbsDiff = 3.64377e-008 maxRelAbsDiff = 4.89584e-007 



So, let's proceed to the analysis of the results, and the results, I must say, are very impressive. A GPU on a laptop at ~ 110X is faster than a host, and on a desktop at ~ 340X faster, an impressive result, however. Before they start throwing sneakers at me and say that such a comparison is not correct, I will say that there are indeed some deceptions in it, but nothing more.

First, we here do not take into account the time of copying data to the device and back. On the one hand, this is wrong, since taking into account the copy, everything may not look so happy. On the other hand, copying can be performed simultaneously with the calculations, or it may not be necessary to do it at all if the data are already on the device. In general, everything is not so simple and depends on the specific task.

Secondly, remember what the mathematical code looked like? For those who did not look at it, I will say that it is a lot of mathematical operations on the same data, and it turned out to be simple copy-paste and replacing the numbers in the coefficients, but initially it was simpler and took only one line, only when I started testing, the results were not so joyful, the GPU was only 4-5 times faster. What do you think, why? (rhetorical question, you can not think :)). And everything is simple, we have rested against memory performance. I hope that later my hands will reach and I will write an article on the relationship between memory and processor performance, but this is another story, in this article we are only interested in the fact that we have obtained a pure test of processor arithmetic performance with this core.

Given these two points, we can say that the GPU is indeed hundreds of times faster than non-parallel code on the CPU for pure arithmetic, which in general corresponds to the difference in theoretical performance. (Another hope is that hands will measure the real numbers and their conformity of the theory for another article).

But the fact that the GPU quickly thinks we know, and as a result of our test, it turned out that the CPU executes the OpenCL code rather quickly, to be exact, then 13X and 25X times faster than normal compiled MSVC10 code with default settings. Let's understand how it turns out and where these figures come from.

Both processors contain 4 real and 8 virtual cores, and OpenCL is made to use all cores, but the improvement is much greater than 4X. And here I must say thanks to Intel, which in its implementation of OpenCL, added support for automatic vectorization, i.e. without any changes in the code, OpenCL uses SSE or AVX, depending on what is available. Considering that SSE is 128bit and AVX works with 256bit, it turns out that the performance should go up to 16X and 32X, respectively. This is closer to the truth, but still not quite an exact match. And then we need to remember such a joyful thing as TurboBoost. These processors operate at 1.73GHz / 3.06GHz (laptop) and 3.4GHz / 3.8GHz (desktop) frequencies, but in fact I can say that the frequency of the laptop processor jumps from 1.73 to 2.8 continuously, and it heats up quite strongly (here you should throw a big camet at Dell behind the curve cooling system), so we really won't see any significant time during the 3.06GHz frequency test. Plus, we should not forget that the practical result is always less than the theoretically possible (the desktop should work faster), but as we can see, 25X performance improvements can be obtained almost free of charge on the same hardware.

Conclusion


OpenCL, , ( , ) , , . , , .

PS: , , ( ) . OpenCL SDK .

PS2: - Ivy Bridge, . , OpenCL SDK, Intel IGP, , . AMD .

Source: https://habr.com/ru/post/146823/


All Articles