Use VTune Amplifier 2016 to analyze the HelloOpenCL application for GPU

VTune Amplifier 2016 can be used to analyze OpenCL programs. In this article, you will learn how to use this solution, as well as how to create a simple OpenCL program called HelloOpenCL using Microsoft Visual Studio and Intel OpenCL code builder.

OpenCL is an open standard designed to implement parallel programming on heterogeneous systems, such as systems with a CPU, GPU, digital signal processors, FPGA, and other physical devices. Any OpenCL applications usually contain two versions of code: for the host and for the device (“device kernel” or “cores”). Host APIs contain two types of APIs. Platform APIs are designed to test available devices and their capabilities in order to select and initialize OpenCL devices. Runtime APIs are used to configure and run kernels on selected devices. You can use the Intel OpenCL development environment code collector to develop the code for devices running in the OpenCL runtime. Different hardware vendors have their own implementation of the OpenCL runtime. Therefore, pay attention to the fact that the right environment is installed.

VTune OpenCL analysis will help determine which OpenCL cores spend the most time and how often these cores are called. In addition, copying data between different hardware components also takes a certain amount of time due to the switching of the hardware context. In VTune, the readCL and write memory metrics of OpenCL will help analyze the delays caused by memory access. In the following sections, we look at creating a simple HelloOpenCL program and using VTune OpenCL analysis with a new architecture diagram function.
')

Run the first OpenCL program for the GPU - HelloOpenCL

Before starting the development of the HelloOpenCL program, you need to load a number of components. To build the kernel code and verify the compatibility of the platform, you can download the Intel OpenCL code builder contained in the INDE package . Secondly, on the target device you need to install the implementation of the OpenCL runtime. The Intel OpenCL implementation is included in the Intel graphics package. You can download the driver here . Visit this page for instructions and other download options.

After installing the Intel OpenCL code builder, you can check which OpenCL devices it supports. This test target system is equipped with 4th generation Intel® Core ™ processors (Haswell).

After ensuring that the environment supports the desired OpenCL devices, as shown in the figure above, you can use Microsoft Visual Studio Professional 2013 to create your first OpenCL program using the HelloOpenCL template installed, or use the sample code that we have included in this article. This code sample requests the GPU device to perform a mathematical addition operation for two two-dimensional buffers; the sum is two-dimensional output buffers. Such a scenario can be used when using standard image filters. Here is a sample of the HelloOpenCL code .

Profiling HelloOpenCL with VTune Amplifier 2016

After successfully building the HelloOpenCL program, you can run VTune to profile the application in the Visual Studio development environment. For a detailed description of the steps to configure OpenCL profiling in VTune, see the following figure.

Run VTune in the VS 2013 development environment.
Choose the Advanced Hotspots analysis type.
Select graphics memory access events.
Check the OpenCL software checkbox.

After successfully collecting VTune logs, you will see the VTune analysis graph shown below by going to the Graphics tab. For information about the features, see the following indexes.

VTune contains several group views of the function call list. For the openCL program for the GPU, there is a Computing Task Purpose / * group view, which makes it possible to better explain the effectiveness of the OpenCL API using metrics that support OpenCL.
These annotations are used to describe OpenCL system API codes that run on the CPU side. They also provide information on how long the CPU takes a single task function. clBuildProgram interprets the kernel code into a program that can be run in the OpenCL runtime. clCreateKernel selects one kernel function in a previously built OpenCL program, which may contain several kernel functions. clEnqueueNDRange places a specific kernel function in an OpenCL command queue, from which this command is received and processed by the graphics processor.
This timeline of Intel® HD Graphics 4 ... shows that Add is a kernel feature planned to be implemented on an Intel GPU runtime.
It is highlighted when the actual Add action takes place on the GPU hardware. There is a gap between the planned execution time of the kernel function and the actual execution time of the kernel function, caused by certain preparation and context switching.
This is a new feature available in the latest version of VTune Amplifier 2016. As shown in the following figure, it shows the data transfer efficiency using a static data form and presents data on the speed of data flows in the general scheme of the GPU architecture. The untyped memory read speed is twice as fast as the write speed, which coincides with the behavior of the HelloOpenCL application.

According to this architecture scheme, you can also monitor the operation of buffers that are allocated in the HelloOpenCL application in the 3rd level cache. The efficiency of using a GPU can be significantly improved, since most of the time the GPU is idle. In other words, the Intel OpenCL device can perform more complex tasks.

Use VTune Amplifier 2016 to analyze the HelloOpenCL application for GPU

Run the first OpenCL program for the GPU - HelloOpenCL

Profiling HelloOpenCL with VTune Amplifier 2016

Related Links

More articles: