OpenCL. Technology details

Hello, dear habrasoobschestvo.

In a previous article about OpenCL, a review was made of this technology, the possibilities that it can offer to the user and its state at the moment.
Now consider the technology more closely. We will try to understand how OpenCL represents a heterogeneous system, which provides opportunities for interaction with the device and which one offers an approach to creating programs.
')

OpenCL was conceived as a technology for creating applications that could run in a heterogeneous environment. Moreover, it is designed to ensure comfortable work with such devices, which are now only in the plans and even with those that no one has yet invented. To coordinate the work of all these devices to a heterogeneous system, there is always one “main” device that interacts with all other means of the OpenCL API. Such a device is called a “host”; it is defined outside of OpenCL.

Therefore, OpenCL proceeds from the most general assumptions that give an idea of the device with OpenCL support: since this device is supposed to be used for computing, there is a certain “processor” in the general sense of the word. Something that can execute commands. Since OpenCL is designed for parallel computing, such a processor may have the means of parallelism within itself (for example, several cores of one CPU, several SPE processors in the Cell). Also, an elementary way to increase the performance of parallel computing is to install several such processors on a device (for example, PC multiprocessor motherboards, etc.). And naturally, in a heterogeneous system there can be several such OpenCL devices (generally speaking, with different architectures).

In addition to computing resources, the device has some amount of memory. And there are no requirements for this memory, it can be both on the device and generally be marked up on the host's RAM (as for example, this is done with the built-in video cards).

Actually everything. No more assumptions are made about the device.

Such a broad concept of the device allows not to impose any restrictions on the programs developed for OpenCL. This technology will allow you to develop both applications that are highly optimized for a specific architecture of a specific device supporting OpenCL, as well as those that will demonstrate stable performance on all types of devices (assuming equivalent performance of these devices).

OpenCL provides a programmer with a low-level API through which it interacts with device resources. The OpenCL API can either be directly supported by the device, or work through an intermediate API (as in the case of NVidia: OpenCL works on top of the CUDA Driver API supported by the devices), this depends on the specific implementation not described by the standard.

Consider how OpenCL provides such versatility while maintaining a low-level nature.

Next, I will give a free translation of the part of the OpenCL 1.0 specification with some comments and additions.

To describe the basic ideas of OpenCL, we will use a hierarchy of 4 models:

Platform Model;
Memory Model;
Execution Model;
Program Model (Programming Model);

Platform Model.

The OpenCL platform consists of a host connected to devices that support OpenCL. Each OpenCL device consists of computing blocks (Compute Unit), which are further divided into one or more processing elements (Processing Elements, hereinafter referred to as PE).

An OpenCL application is executed on a host in accordance with the native models of its platform. An openCL application sends commands from a host to devices for performing calculations on PE. PEs within a computing block execute one command stream as SIMD blocks (one instruction is executed by all at once, the processing of the next instruction will not start until all PEs finish execution of the current instruction), or as SPMD blocks (each PE has its own instruction counter (program counter) ).

That is, OpenCL processes some commands from the host. Thus, the application is not strictly connected with OpenCL, which means it is always possible to replace the implementation of OpenCL without disturbing the operation of the program. Even if such a device is created that does not fit the “OpenCL device” model, it will be possible to create an OpenCL implementation for it that translates host commands into a more convenient device form.

Execution Model.

The execution of an OpenCL program consists of two parts: the host part of the program and the kernels (kernels; with your permission, I will continue to use the English term, as more familiar to most of us) running on an OpenCL device. The host part of the program determines the context in which the kernels are executed, and controls their execution.

The main part of the OpenCL execution model describes the execution of kernels. When the kernel is queued for execution, the index space is defined (NDRange, the definition will be given below). A copy (instanse) of the kernel is made for each index from this space. A copy of the kernel that is executed for a specific index is called a “Work Item” (work unit) and is determined by a point in the index space, that is, a global ID is provided to each “unit”. Each Work Item executes the same code, but the specific execution path (branching, etc.) and the data it works with may be different.

Work Items are organized into groups (Work-Groups). Groups provide a larger index space partitioning. Each group is assigned a group ID with the same dimension that was used to address individual elements. Each element is associated with a unique, within the group, local ID. Thus, Work-Items can be addressed both by global ID and by combination of group and local ID.

Work items in a group are executed competitively (in parallel) on the PE of a single computing unit.

Here, a unified device model is clearly visible: several PE -> CU, several CU -> device, several devices -> heterogeneous system.

The index space in OpenCL 1.0 is called NDRange and can be 1-, 2-, and 3-dimensional. NDRange is an array of integers (integer) of length N, indicating the dimension in each direction.

The choice of the NDRange dimension is determined by convenience for a specific algorithm: in the case of working with three-dimensional models, it is convenient to index by three-dimensional coordinates; in the case of working with images or two-dimensional grids, it is more convenient when the dimension of the indices is 2. 4-dimensional objects in our world are rare; limited to 3. In addition, be that as it may, but at the moment the main goal of OpenCL is the GPU. Nvidia GPUs now natively support the dimension of the indexes up to 3, respectively, in order to implement a higher dimension, you would have to resort to tricks and complications of either the CUDA Driver API or the implementation of OpenCL.

Execution context and command queues in the OpenCL execution model.

The host determines the execution context of the kernels. The context includes the following resources:

Devices: a set of OpenCL devices that the host uses.
Kernels: OpenCL functions that run on devices.
Program Objects: source codes and executable files of kernels.
Memory Objects: A collection of memory objects visible to both the host and OpenCL devices. Memory objects contain values that kernels can work with.

Context is created and managed by means of functions from the OpenCL API. A host creates a data structure called a “command queue” (command-queue) to control the execution of kernels on devices. The host sends commands to the queue, after which they are set by the scheduler to run on devices in the desired context.

Commands can be of the following types:

Kernel Execution Command: Execute the kernel on the PEs device.
Memory Commands: Move data to, from, or between memory objects.
Synchronization commands: control the order of execution of commands.

The command queue schedules commands for execution on the device. They are executed asynchronously between the host and the device. Commands can be performed against each other in two ways:

Execution in order: commands are launched for execution in the order in which they are located in the queue and are completed in the same order. That is, the commands are executed sequentially.
Inconsistent execution: commands are sent for execution in order, but do not wait for the previous command to complete before starting execution. In this case, the programmer must explicitly use synchronization commands.

You can associate multiple command queues with a single context. These queues are executed competing with each other, and independently without any obvious means of synchronization between them.

Using the command queue allows for greater versatility and flexibility when using OpenCL. Modern GPUs have their own scheduler, which decides what and when and on which computing blocks to execute. Using the queue does not hamper the work of the scheduler, which has its own command queue.

Execution model: kernel categories.

In the OpenCL kernel there can be two categories:

OpenCL kernel: written in OpenCL C and compiled with the OpenCL compiler. All OpenCL implementations must support an OpenClO-kernel. Implementations may provide other mechanisms for creating an OpenCL kernel.
Naitive kernel: they are accessed through host function pointers. Native kernels are queued for execution, just like the OpenCL kernel and uses the same memory objects as the OpenCL kernel. For example, such kernels can be functions defined in the application code or exported from a library. Note that the ability to execute native kernels is optional and their semantics are not defined by the standard. The OpenCL API includes functions for querying device capabilities for support for such kernels.

Memory Model.

A work item that executes a kernel can use four different types of memory:

Global memory. This memory provides read and write access to elements of all groups. Each Work Item can read and write from any part of the memory object. Reading and reading global memory can be cached depending on the capabilities of the device.
Constant memory A region of global memory that remains constant during the execution of the kernel. The host allocates and initializes memory objects located in constant memory.
Local memory. Memory area local to the group. This area of memory can be used to create variables shared by the whole group. It can be implemented as a separate memory on an OpenCL device. Alternatively, this memory may be labeled as a region in global memory.
Private memory. The memory area belonging to the Work Item. Variables defined in the private memory of one Work Item are not visible to others.

The specification defines 4 types of memory, but again does not impose any requirements on the implementation of memory in hardware. All 4 types of memory can reside in global memory, and type separation can be performed at the driver level and, on the contrary, hard memory type separation dictated by the device architecture may exist.

The existence of these types of memory is quite logical: the processor core has its own cache, the processor has a common cache, and the whole device has some memory.

Software model (Programming Model)

The OpenCL execution model supports two software models: data parallelism (Data Parallel) and task parallelism (Task Parallel), hybrid models are also supported. The main model that defines OpenCL design is data concurrency.

Software model with data parallelism.

This model defines computations as a sequence of instructions applied to a set of elements of a memory object. The index space associated with the OpenCL execution model determines the work items and how the data is distributed between them. In the strict data parallelism model, there is a strict one-to-one correspondence between the Work-Item and the element in the memory object with which the kernel can work in parallel. OpenCL implements a softer data concurrency model, where strict one-to-one correspondence is not required.

OpenCL provides a hierarchical data parallelism model. There are two ways to define hierarchical division. In an explicit model, the programmer determines the total number of elements that must be executed in parallel and in the same way how these elements will be distributed into groups. In the implicit model, the programmer only determines the total number of elements that must be executed in parallel, and the division into working groups is performed automatically.

Program model with job parallelism.

In this model, each copy of the kernel is executed independently of any index space. Logically, this is equivalent to executing a kernel on a computing unit (CU) with a group consisting of one element. In this model, users express parallelism in the following ways:

use vector data types implemented in the device;
queue up multiple tasks;
set in queue native kernels using a software model orthogonal to OpenCL;

The existence of two programming models is also a tribute to universality. For modern GPU and Cell, the first model works well. But not all algorithms can be effectively implemented within the framework of such a model, and there is also a possibility of the appearance of a device whose architecture will be inconvenient for using the first model. In this case, the second model allows you to write applications specific to another architecture.

What does the OpenCL platform consist of?

The OpenCL platform allows applications to use a host and one or more OpenCL devices as one heterogeneous parallel computer system. The platform consists of the following components:

OpenCL Platform Layer: allows the host to discover OpenCL devices, poll their properties and create context.
OpenCL Runtime: The runtime allows a program on a host to manage contexts after they have been created.
OpenCL compiler : The OpenCL compiler creates executable files containing an OpenCL – kernel. The OpenCL-C programming language is implemented by a compiler that supports a subset of the ISO C99 standard with extensions for parallelism.

How does all this work?

In the next article, I will discuss in detail the process of creating an OpenCL application using the example of one of the applications distributed with the Nvidia Computing SDK. I will give examples of optimizing the work of applications for OpenCL, proposed by Nvidia as recommendations.

Now, I will schematically describe the steps involved in creating such an application:

Create a context for the execution of our program on the device.
Select the desired device (you can immediately select the device with the most Flops).
Initialize the selected device with the context we created.
Create a command queue based on the device ID and context.
We create a program based on source codes and context,
either based on binary files and context.
We build the program (build).
Create a kernel.
Create memory objects for input and output.
We queued the command to write data from the memory area with data on the host to the device memory.
We queued the execution team of the kernel we created.
We queued the command to read data from the device.
We are waiting for the completion of operations.

It is worth noting that the build program is carried out at runtime, almost JIT-compile. The standard describes that this is done so that the program can be assembled according to the selected context. It also allows each provider of an OpenCL implementation to optimize the compiler for its device. However, the program can also be created from binary codes. Either create it once at the first launch, and then reuse it, this feature is also described in the standard. Nevertheless, the compiler is integrated into the OpenCL platform, good or bad, but true.

Conclusion

As a result, the OpenCL model turned out to be very versatile, while it remains low-level, allowing you to optimize applications for a specific architecture. It also provides cross-platform when moving from one type of OpenCL device to another. The vendor of the implementation of OpenCL has the ability to optimize the interaction of its device with the OpenCL API in every possible way, seeking to improve the efficiency of resource allocation of the device. In addition, a properly written OpenCL application will remain effective when changing generations of devices.

References:

http://www.khronos.org/registry/cl/specs/opencl-1.0.48.pdf - the latest version (6.10.09) of OpenCL specifications in English

Source: https://habr.com/ru/post/72650/

All Articles