AMD APP SDK: Intermediate Language (IL)

ATI Stream SDK was renamed to AMD Accelerated Parallel Processing (APP) SDK, OpenCL replaced the main GPGPU computing programming language. However, not many people guess that it is possible to write code for ATI cards with the help of another technology: AMD Compute Abstraction Layer (CAL) / Intermediate Language (IL). The CAL technology is designed to write code that interacts with the GPU and runs on the CPU, while the IL technology allows you to write code that will run directly on the GPU.

This article will consider the technology IL , its scope, limitations, advantages compared with OpenCL. Who cares, I ask under the cat.

Introduction

For starters, here are some comparisons with the Nvidia CUDA SDK:

High level programming language:
- Nvidia: CUDA C ++ Extension
- AMD: OpenCL 1.1 or Compute Abstraction Layer (CAL)
Low level programming language (pseudo assembler *):
- Nvidia: Parallel Thread Execution (PTX)
- AMD: Intermediate Language (IL)
The ratio of “number of parrots per second” (for example, the number of hashes per second per second) to “GPU price”:
- Nvidia: x
- AMD: ~ 2x using CAL / IL bundle

* means that the language, although similar to an assembler, is still optimized by the compiler and converted to different code for different GPUs.
')
So how can you get such a performance gain?

Features of the AMD GPU architecture

If you carefully read the Nvidia PTX specification and AMD IL specification, you will notice that the operands in Nvidia PTX are one-component vectors (that is, simple n-bit registers), while AMD IL operands are 4 component vectors of n-bit registers . This will become clearer if we consider the multiplication operation in both languages:

# Nvidia PTX mul.u32 %r0, %r1, %r2 # AMD IL umul r0.xyzw, r1.xyzw, r2.xyzw

Thus, for one (almost one) operation, an AMD GPU can change up to 4 n-bit registers, and an Nvidia GPU can only change one n-bit register (meaning within one GPU stream). But OpenCL also allows you to declare multicomponent vectors and work with them! Then what's the difference and why do we need this IL ?

Difference from OpenCL

And all the differences are banal in that the developers of the AMD APP SDK were either difficult or technically impossible to create a compiler that translates code written according to the OpenCL specification to code written in AMD IL . Hence the restrictions on the support of the OpenCL standard:

OpenCL 1.0 is supported starting around Radeon HD 4000 Series (Beta Level Support) (perhaps there is no support for an image object, i.e. texture memory)
OpenCL 1.1 is supported starting around Radeon HD 5000 Series.
OpenCL 1.2 is supported starting around Radeon HD 7000, but the SDK that supports this version of the standard has not even been released yet.

It is worth noting that AMD IL allows you to use for GPGPU computing some cards from the Radeon HD 3000 Series and even from the Radeon HD 2000 Series! (To be completely precise, this is a GPU on R600, RV610, RV630 and RV670 chips)
Further, for brevity, we will designate all GPUs, starting with the Radeon HD 5000 Series, as Evergreen GPUs (this is the Radeon HD 5700 chip), because only these cards support some interesting operations.

Before proceeding to explain the principles of writing code for AMD IL , I would like to draw your attention to

Features of working with memory

As I already mentioned, AMD GPU works with 4 component vectors of n-bit registers, where n = 32 (how to work with 64-bit registers, further). This imposes a basic restriction on memory: memory can only be allocated in multiples of 16 bytes. It should be remembered that when loading data from memory the minimum amount of transmission is again these 16 bytes. That is, it does not matter at all that you specify that your memory consists of 4 component vectors of 1 byte (char4), that of the 4 component vectors of 4 bytes (int4), the result will be one - from memory in one exchange operation will load 16 bytes

Further, unlike the Nvidia GPU, the AMD GPU allocates local memory in the global area (which means a very slow data transfer rate), so forget about the local memory. Use registers and global memory.

And lastly: again, unlike the Nvidia GPU, there is only one global memory operating in read-write (hereinafter, it will be “g []”), and many different sources of texture memory (hereinafter, it will be “i0”, “i1 ", Etc.) and constant memory (hereinafter, it will be" cb0 "," cb1 ", etc.), working only for reading.
A feature of constant memory is the presence of caching when all GPU threads access one data area (it works as fast as registers).
The peculiarity of the texture memory is reading caching (from 8 KB, if memory serves me, per one stream processor) and the ability to access memory by real coordinates. When going beyond the boundaries of the texture, you can either read the boundary element, or loop and read first (the coordinate is taken modulo the width / length of the texture).

Now let's get to the most interesting part:

Code structure for AMD IL

Work with registers

First, a small explanation of how the exchange takes place between registers in operations.
The output register in place of the vector component may contain either the name of the component or the "_" sign, which means that the component will not be changed.
Each input register in place of each component can contain any name of the four components, either "0" or "1". This means that either an input register component or a constant is involved in the operation on the corresponding component of the output register. Let me explain this with an example:

 # r0.x = r1.z # r0.y = r1.w # r0.w = r1.y mov r0.xy_w, r1.zwyy # r0.y = 1 # r0.z = 0 mov r0._yz_, r1.x100

Shaders

The code for AMD GPU is made in the form of shaders. It is possible to run both a computer shader (Compute Shader, CS) and a pixel shader (Pixel Shader, PS). However, CS is supported starting only with the Radeon HD 4000 Series. The speed of their work is almost the same.

It is known that the number of simultaneously launched threads on a GPU is determined by the launch parameters: the number of blocks, the number of threads per block. Each multiprocessor (from 8 pieces) GPU takes one block for execution. Then divides the requested number of threads into a block into pieces (warp, a multiple of 32) and gives each of its streamlined processor to execute one warp. Thus, the real number of simultaneously running threads is:

<multiprocessors_count> * <stream_processors_per_multiprocessor_count> * <warp_size>

That is why for the fastest work it is required that within one warp'a the streams perform the same operation, without branching. Then this operation will be executed at once.

In order not to consider a spherical horse in a vacuum, we consider a simple task: each thread calculates its local identifier within a block (32 bits), the global identifier (32 bits), reads constants (64 bits) from the command memory and from the data memory, reads element from texture (128 bits). He writes all this into the output memory, for which each stream will need 256 bits.
Note: each texture line contains data for streams of one block.

Pixel shader

 il_ps_2_0 ;   (cb0): ; cb0[0].x -   ; cb0[0].y -   ; cb0[0].zw -  dcl_cb cb0[1] ;     (i0) ;   -  (   ),  (     float  0  1) ;          (   uint) dcl_resource_id(0)_type(2d,unnorm)_fmtx(uint)_fmty(uint)_fmtz(uint)_fmtw(uint) ;       dcl_input_position_interp(linear_noperspective) vWinCoord0.xy__ ;   (g[]) ; ,     dcl_literal l0, 0xFFFFFFFF, 0xABCDEF01, 0x3F000000, 2 ;         ; r0.x -   x  i0   (float) (     ) ; r0.y -   y  i0   (float) (   ) ftoi r0.xyzw, vWinCoord0.xyxy ;  r0.z -    (uint) umad r0.__z_, r0.wwww, cb0[0].yyyy, r0.zzzz ;       ftoi r1.x___, vWinCoord0.xxxx mov r1._y__, r0.zzzz mov r1.__z_, cb[0].xxxx mov r1.___w, l0.yyyy ;      g[] umul r0.__z_, r0.zzzz, l0.wwww ;       mov g[r0.z+0].xyzw, r1.xyzw ;     i0 ;     float   0.5 itof r0.xy__, r0.xyyy add r0.xy__, r0.xyyy, l0.zzzz sample_resource(0)_sampler(0)_aoffimmi(0,0,0) r1, r0 ; sample_resource(0) -   i0 ; _sampler(0) -   sampler'a #0 ; _aoffimmi(0,0,0) -   x, y, z ;        ,  _aoffimmi(1,0,0);   - _aoffimmi(0,1,0) ;       mov g[r0.z+1].xyzw, r1.xyzw ;     endmain ;    end

Compute shader

All the difference will be only in the calculation of the flow identifiers, the rest is the same.

 il_cs_2_0 dcl_num_thread_per_group 64 ;   (cb0): ; cb0[0].x -   ; cb0[0].yzw -  dcl_cb cb0[1] ;     (i0) ;   -  (   ),  (     float  0  1) ;          (   uint) dcl_resource_id(0)_type(2d,unnorm)_fmtx(uint)_fmty(uint)_fmtz(uint)_fmtw(uint) ;   (g[]) ; ,     dcl_literal l0, 0xFFFFFFFF, 0xABCDEF01, 0x3F000000, 2 ;   mov r0._y__, vThreadGrpIDFlat.xxxx ;     mov r0.x___, vTidInGrpFlat.xxxx ;    mov r0.__z_, vAbsTidFlat.xxxx ;       mov r1.x___, vTidInGrpFlat.xxxx mov r1._y__, vAbsTidFlat.xxxx mov r1.__z_, cb[0].xxxx mov r1.___w, l0.yyyy ;      g[] umul r0.__z_, r0.zzzz, l0.wwww ;       mov g[r0.z+0].xyzw, r1.xyzw ;     i0 ;     float   0.5 itof r0.xy__, r0.xyyy add r0.xy__, r0.xyyy, l0.zzzz sample_resource(0)_sampler(0)_aoffimmi(0,0,0) r1, r0 ; sample_resource(0) -   i0 ; _sampler(0) -   sampler'a #0 ; _aoffimmi(0,0,0) -   x, y, z ;        ,  _aoffimmi(1,0,0);   - _aoffimmi(0,1,0) ;       mov g[r0.z+1].xyzw, r1.xyzw ;     endmain ;    end

Shader Differences

In addition to support on different cards, the main difference of shaders is in the storage location of the number of threads launched per block. For PS, this value can be stored in memory, for CS, this value needs to be punched in the code. In addition, it is easier for CS to calculate thread identifiers.

Conclusion

I tried to tell in this article how to write a simple code on AMD IL for execution on the GPU itself. As a conclusion a few words about optimizing the speed of work:

Do not try to use optimization techniques specific to the assembler (prediction of operations with constants, permutation of independent operations). Do not forget that this is still a pseudo assembler, so the compiler will optimize for you. Better think about the algorithm.
Load as much data as possible on the card. It is advisable to use all 32 bits of all 4 components of the vector.
If you have the same type of calculations on the input data (for example, hash calculation), then you should experiment on the number of components in operations: sometimes r0.x ___ will work, sometimes r0.xy ___, and sometimes r0.xyzw will work faster.
Although AMD claims that the number of threads in the block can be any multiple of <warp_size> and at the same time the GPU will behave correctly, in fact it is not. In nature, I saw only <warp_size> = 32 or 64, and my GPU worked correctly only with the number of threads in the block equal to <warp_size>. Moreover, the Radeon HD 4650 at the start with 32 threads in the block (and for technical data, for this card <warp_size> = 32) on one of my algorithms produced incorrect data, but with 64 threads in the block it worked with a bang. Conclusion: run the algorithm with only 64 threads in the block (and the number of blocks can already be varied).
GPU Evergreen support several cool features: cyclic shift, support for overflow flags, support for 64-bit operations (2 components are reserved for this). Unfortunately, the GPU of a family younger than Evergreen does not support all these buns. If someone tells you how to write 64-bit operations on them, I will be grateful.

How to transfer data to the card and take data from it is written in the second part about the AMD Compute Abstraction Layer (CAL).

Links for information

Source: https://habr.com/ru/post/138954/

All Articles