AMD APP SDK: Compute Abstraction Layer (CAL)

In the first part, I talked about AMD Intermediate Language (IL) technology. In this article, as you can guess from the title, we will discuss the second component: AMD Compute Abstraction Layer (CAL). These two technologies are inseparable from each other: it is impossible to use one without using the other. Therefore, for further understanding I recommend to familiarize with the first part.

I will try to highlight the main aspects of work at the top level with an AMD GPU, I will describe the limitations of this technology and possible problems when working with it. Who cares, I ask under the cat.

Instead of introducing

When I first started to deal with programming under AMD GPU, I was asked what I use for this. "ATI CAL " - I replied. “Yes, ATI is really CAL ” - was my answer.

In general, I do not know how to pronounce the abbreviation CAL , but I pronounce it through “O” so as not to embarrass people.

')

For brevity, we denote the program described in the first part by the kernel . Under the kernel, I will mean both the source code of the program and the compiled binary code that is loaded onto the GPU. I will not give the full text of any program that works with the GPU through the AMD CAL , but just tell you about the main points of the work:

driver initialization
getting information on all supported GPUs
memory allocation and copying
compiling and loading the kernel on the GPU
kernel launch
synchronization with the CPU

To get started, we need two header files from the AMD APP SDK:

cal.h - describes the basic functions of the driver, the functions have the prefix "cal" (library aticalrt.dll)
calcl.h - describes the main functions of the text core compiler in binary code, the functions have the prefix "calcl" (library aticalcl.dll)

As you can see, unlike Nvidia CUDA, which has a Run-time API and Driver API , only Driver API is available for AMD. Therefore, for your application to work, do not forget to link to the appropriate libraries.

Most called functions return a value of type CALresult. A total of 11 return codes are available. The most important for us is the code CAL_RESULT_OK, equal to 0 (indicating the successful completion of the call).

So let's go.

Driver initialization

Rule number 1: before starting work with the GPU, ~~wash your hands,~~ initialize the driver with the following call:

CALresult result = calInit();

Rule number 2: after working with the GPU, do not forget to ~~wash away to~~ complete the work correctly. This is done by the following call:

 CALresult result = calShutdown();

These two calls should always be paired. There may be several of them (such pairs of calls) in the program, but never work with the GPU outside of these calls: this behavior may entail a hardware exception .

Getting GPU info

Find out the number of supported GPUs (they can be less than the total number of AMD GPUs in the system):

 unsigned int deviceCount = 0; CALresult result = calDeviceGetCount( &deviceCount );

In this article I will indicate where the GPU identifier is used, but I will work with the GPU under identifier 0. In general, this identifier takes values from 0 to (deviceCount-1).

Find out information about the GPU:

 unsigned int deviceId = 0; //  GPU CALdeviceinfo deviceInfo; CALresult result = calDeviceGetInfo( &deviceInfo, deviceId ); CALdeviceattribs deviceAttribs; deviceAttribs.struct_size = sizeof( deviceAttribs ); CALresult result = calDeviceGetAttribs( &deviceAttribs, deviceId );

The most important thing in the CALdeviceinfo structure is the identifier of the GPU chip. It is here called Device Kernel ISA :

 typedef struct CALdeviceinfoRec { CALtarget target; /**< Device Kernel ISA */ CALuint maxResource1DWidth; /**< Maximum resource 1D width */ CALuint maxResource2DWidth; /**< Maximum resource 2D width */ CALuint maxResource2DHeight; /**< Maximum resource 2D height */ } CALdeviceinfo;

The remaining fields of the structure determine the maximum size of the texture memory by two coordinates, which can be allocated on this GPU.

Much more interesting is the CALdeviceattribs structure, which is responsible for the attributes of the GPU (I will give only a few structure fields):

 typedef struct CALdeviceattribsRec { CALtarget target; /**< Asic identifier (  Device Kernel ISA) */ CALuint localRAM; /**<   GPU RAM   */ CALuint wavefrontSize; /**<  warp'a (      ) */ CALuint numberOfSIMD; /**<   */ CALboolean computeShader; /**<   Compute Shader */ CALuint pitch_alignment; /**<        calCreateRes */ /*   */ } CALdeviceattribs;

Rule number 3: field CALdeviceattribs.pitch_alignment is measured in memory elements , and not in bytes. The memory element is a 1, 2 or 4 component vector of 8, 16 or 32 bit registers.

And now let's take a close look at what values the CALdeviceinfo.target field can take (it is CALdeviceattribs.target):

 /** Device Kernel ISA */ typedef enum CALtargetEnum { CAL_TARGET_600, /**< R600 GPU ISA */ CAL_TARGET_610, /**< RV610 GPU ISA */ CAL_TARGET_630, /**< RV630 GPU ISA */ CAL_TARGET_670, /**< RV670 GPU ISA */ CAL_TARGET_7XX, /**< R700 class GPU ISA */ CAL_TARGET_770, /**< RV770 GPU ISA */ CAL_TARGET_710, /**< RV710 GPU ISA */ CAL_TARGET_730, /**< RV730 GPU ISA */ CAL_TARGET_CYPRESS, /**< CYPRESS GPU ISA */ CAL_TARGET_JUNIPER, /**< JUNIPER GPU ISA */ CAL_TARGET_REDWOOD, /**< REDWOOD GPU ISA */ CAL_TARGET_CEDAR, /**< CEDAR GPU ISA */ CAL_TARGET_RESERVED0, CAL_TARGET_RESERVED1, CAL_TARGET_WRESTLER, /**< WRESTLER GPU ISA */ CAL_TARGET_CAYMAN, /**< CAYMAN GPU ISA */ CAL_TARGET_RESERVED2, CAL_TARGET_BARTS, /**< BARTS GPU ISA */ } CALtarget;

It turns out that this field indicates the chip on which the GPU is built. Thus, it’s impossible to find out exactly how the GPU in the world is called (for example, Radeon HD 3850) with the help of AMD CAL ! This is such a convenient technology ... But it was interesting to observe that, for example, the Radeon HD 5750 and Radeon HD 6750 are actually the same video card! They differ only slightly in the frequency of memory (within a few percent).

Another note: in this list there is no Evergreen GPU, which I mentioned in the first part. My guess is that the Evergreen family of GPUs start with a Cypress chip (CAL_TARGET_CYPRESS). All that was before was the previous generation without the support of new buns (cyclic shift, support for operation flags and 64-bit operations).

For further work, we need to create a device descriptor (device) with which we will interact with the GPU:

 unsigned int deviceId = 0; //  GPU CALdevice device; CALresult result = calDeviceOpen( &device, deviceId ); CALcontext context; result = calCtxCreate( &context, device );

Context is required to work within your application with this GPU. All work with the GPU is done using this context. As soon as you delete the context, all allocated resources are considered to be freed, and all unfinished tasks on the GPU are forcibly terminated.

Do not forget about the pair calls after finishing work with the device:

 calCtxDestroy( context ); calDeviceClose( device );

Calls must go in exactly this order, otherwise we get a hardware exception .

So, we have created a device and context for it, now we can proceed to

Memory allocation

To work with memory, you need to allocate a resource . According to the documentation, the resource can be located in local memory (local memory = stream processor memory) and remote memory (remote memory = system memory). As I understand it, remote memory is nothing but RAM, while local memory is the memory of the GPU itself.

Why do I need remote memory if there is a local memory? First, it is needed to share the same memory among multiple GPUs. That is, remote memory can be allocated once and work with it from several GPUs. Secondly, not all GPUs support direct access to their memory (see “Obtaining direct memory access” below).

 CALresource resource; unsigned int memoryWidth; unsigned int memoryHight; CALformat memoryFormat; unsigned int flags; //      // 1D  CALresult result = calResAllocRemote1D( &resource, &device, 1, memoryWidth, memoryFormat, flags ); /*         GPU,    -     ,   -      (1   ) */ // 2D  CALresult result = calResAllocRemote2D( &resource, &device, 1, memoryWidth, memoryHeight, memoryFormat, flags ); //      // 1D  CALresult result = calResAllocLocal1D( &resource, device, memoryWidth, memoryFormat, flags ); /*  ,       ,       */ // 2D  CALresult result = calResAllocLocal2D( &resource, device, memoryWidth, memoryHeight, memoryFormat, flags );

The width and height of the allocated resource is measured in memory elements .

The memory element itself is described by the memoryFormat parameter:

 //  ,         /** Data format representation */ typedef enum CALformatEnum { CAL_FORMAT_UNORM_INT8_1, /**< 1 component, normalized unsigned 8-bit integer value per component */ CAL_FORMAT_UNORM_INT8_4, /**< 4 component, normalized unsigned 8-bit integer value per component */ CAL_FORMAT_UNORM_INT32_1, /**< 1 component, normalized unsigned 32-bit integer value per component */ CAL_FORMAT_UNORM_INT32_4, /**< 4 component, normalized unsigned 32-bit integer value per component */ CAL_FORMAT_SNORM_INT8_1, /**< 1 component, normalized signed 8-bit integer value per component */ CAL_FORMAT_SNORM_INT8_4, /**< 4 component, normalized signed 8-bit integer value per component */ CAL_FORMAT_SNORM_INT32_1, /**< 1 component, normalized signed 32-bit integer value per component */ CAL_FORMAT_SNORM_INT32_4, /**< 4 component, normalized signed 32-bit integer value per component */ CAL_FORMAT_UNSIGNED_INT8_1, /**< 1 component, unnormalized unsigned 8-bit integer value per component */ CAL_FORMAT_UNSIGNED_INT8_4, /**< 4 component, unnormalized unsigned 8-bit integer value per component */ CAL_FORMAT_SIGNED_INT8_1, /**< 1 component, unnormalized signed 8-bit integer value per component */ CAL_FORMAT_SIGNED_INT8_4, /**< 4 component, unnormalized signed 8-bit integer value per component */ CAL_FORMAT_UNSIGNED_INT32_1, /**< 1 component, unnormalized unsigned 32-bit integer value per component */ CAL_FORMAT_UNSIGNED_INT32_4, /**< 4 component, unnormalized unsigned 32-bit integer value per component */ CAL_FORMAT_SIGNED_INT32_1, /**< 1 component, unnormalized signed 32-bit integer value per component */ CAL_FORMAT_SIGNED_INT32_4, /**< 4 component, unnormalized signed 32-bit integer value per component */ CAL_FORMAT_UNORM_SHORT_565, /**< 3 component, normalized 5-6-5 RGB image. */ CAL_FORMAT_UNORM_SHORT_555, /**< 4 component, normalized x-5-5-5 xRGB image */ CAL_FORMAT_UNORM_INT10_3, /**< 4 component, normalized x-10-10-10 xRGB */ CAL_FORMAT_FLOAT32_1, /**< A 1 component, 32-bit float value per component */ CAL_FORMAT_FLOAT32_4, /**< A 4 component, 32-bit float value per component */ CAL_FORMAT_FLOAT64_1, /**< A 1 component, 64-bit float value per component */ CAL_FORMAT_FLOAT64_2, /**< A 2 component, 64-bit float value per component */ } CALformat;

It is a pity that 64-bit operations on old video cards (not Evergreen) can be performed only with float data ...

Rule number 4: the element format describes only the way in which the GPU will interpret the data lying in this element. Physically, an element always occupies 16 bytes of memory.

This can be understood if we recall that in the first part we described the resource as follows:

 dcl_resource_id(0)_type(2d,unnorm)_fmtx(uint)_fmty(uint)_fmtz(uint)_fmtw(uint)

And according to the AMD IL language specification, fmtx-fmtw values are required. That is, the following code (like it would be possible to describe a texture with elements of type 1-component vector) is incorrect:

 dcl_resource_id(0)_type(2d,unnorm)_fmtx(uint)

Rule number 5: respect the types that you declare in the kernel and when allocating a resource. If they do not match, you can not bind the resource to the kernel.

Rule number 6: for constant memory, the type of the element must always be of type float .

Why this is done is not clear to me, since you can still load integer values from constant memory (which we do in the example).

A couple more words about the flags that are needed when allocating memory:

 /** CAL resource allocation flags **/ typedef enum CALresallocflagsEnum { CAL_RESALLOC_GLOBAL_BUFFER = 1, /**< used for global import/export buffer */ CAL_RESALLOC_CACHEABLE = 2, /**< cacheable memory? */ } CALresallocflags;

I never worked with the second flag, when it gives an advantage, I do not know. And judging by the question mark in the comments of the authors themselves, they also do not know (smile).

But the first flag is needed to allocate a global buffer (“g []”).

Now apply the theory in practice. Bearing in mind the example described in the previous article, we will also set the kernel launch parameters:

 unsigned int blocks = 4; //  4  unsigned int threads = 64; //  64    //    cb0 CALresource constantResource; CALresult result = calResAllocLocal1D( &constantResource, device, 1, CAL_FORMAT_FLOAT32_4, 0 ); //    i0 CALresource textureResource; result = calResAllocLocal2D( &textureResource, device, threads, blocks, CAL_FORMAT_UNSIGNED_INT32_4, 0 ); //    g[] CALresource globalResource; result = calResAllocLocal1D( &globalResource, device, threads * blocks, CAL_FORMAT_UNSIGNED_INT32_4, CAL_RESALLOC_GLOBAL_BUFFER );

After the resources are no longer needed, they will need to be released:

 calResFree( constantResource ); calResFree( textureResource ); calResFree( globalResource );

Copy memory

Gaining direct memory access

If the GPU supports mapping its memory (mapping the addresses of its memory into the address space of the process), then we can get a pointer to this memory, and then work with it, like with any other memory:

 unsigned int pitch; unsigned char* mappedPointer; CALresult result = calResMap( (CALvoid**)&mappedPointer, &pitch, resource, 0 ); //    ,   ,

And after we finish working with memory, you need to free the pointer:

 CALresult result = calResUnmap( resource );

Rule number 7: always remember that when working with GPU memory, alignment must be considered. This alignment is characterized by the pitch variable.

Rule number 8: pitch is measured in elements , not in bytes.

Why do you need to know about this alignment? The fact is that, unlike RAM, the memory of a GPU is not always a contiguous area. This is especially evident when working with textures. Let me explain this with an example: if you want to work with a texture of 100x100 elements, and the calResMap () function returns a pitch value equal to 200, this means that the GPU will actually work with a 200x100 texture, just in each line of the texture only the first 100 will be taken into account items.

Copying to GPU memory taking into account the pitch value can be organized as follows:

 unsigned int pitch; unsigned char* mappedPointer; unsigned char* dataBuffer; CALresult result = calResMap( (CALvoid**)&mappedPointer, &pitch, resource, 0 ); unsigned int width; unsigned int height; unsigned int elementSize = 16; if( pitch > width ) { for( uint index = 0; index < height; ++index ) { memcpy( mappedPointer + index * pitch * elementSize, dataBuffer + index * width * elementSize, width * elementSize ); } } else { memcpy( mappedPointer, dataBuffer, width * height * elementSize ); }

Naturally, the data in the dataBuffer must be prepared according to the type of element. But at the same time remember that the element is always 16 bytes in size.

That is, for the CAL_FORMAT_UNSIGNED_INT16_2 format element, its byte-by-memory representation will be as follows:

 // w - word, 16  // wi.j - i- word, j-  // x -  [ w0.0 | w0.1 | x | x ][ w1.0 | w1.1 | x | x ][ x | x | x | x ][ x | x | x | x ]

Copying data between resources

Data is not copied directly between resources, but between their context-mapped values. The copy operation is asynchronous, therefore, to find out about the completion of the copy operation, use the system object of the type CALevent:

 CALresource inputResource; CALresource outputResource; CALmem inputResourceMem; CALmem outputResourceMem; //     CALresult result = calCtxGetMem( &inputResourceMem, context, inputResource ); result = calCtxGetMem( &outputResourceMem, context, outputResource ); //   CALevent syncEvent; result = calMemCopy( &syncEvent, context, inputResourceMem, outputResourceMem, 0 ); //    ,   ,    //     while( calCtxIsEventDone( context, syncEvent ) == CAL_RESULT_PENDING );

Compiling and loading the kernel on the GPU

“Death of Koshchey in a needle, needle in an egg, egg in a duck, duck in a hare, hare in a chest ...”

The process of loading the kernel on a GPU can be described as follows: the source (txt) is compiled into an object, then one or several object users are linked into an image, which is then loaded into a GPU module, and then can be obtained from the module a pointer to the kernel entry point (using this pointer we will be able to launch the kernel for execution).

And now, how it is implemented:

 const char* kernel; //       // ,   GPU  unsigned int deviceId = 0; //  GPU CALdeviceinfo deviceInfo; CALresult result = calDeviceGetInfo( &deviceInfo, deviceId ); //   CALobject obj; result = calclCompile( &obj, CAL_LANGUAGE_IL, kernel, deviceInfo.target ); //     CALimage image; result = calclLink( &image, &obj, 1 ); //   -  ,  -   //     ,   result = calclFreeObject( obj ); //     CALmodule module; result = calModuleLoad( &module, context, image ); //      CALfunc function; result = calModuleGetEntry( &function, context, module, "main" );

Rule number 9: the entry point to the kernel is always the same, since there is only one function after the link - the function "main".

That is, unlike Nvidia CUDA, there can be only one global main function in the AMD CAL core.

As you can see, the compiler can only handle the source code written in IL .

Loading the image into the module is due to the fact that the image needs to be loaded into the selected GPU context. Consequently, the described compilation process must be done for each GPU (except for the case when 2 identical GPUs: it is enough to compile and link once, but you still have to load the image into the module for each card).

I want to draw attention to the possibility of linking several object students. Maybe someone this opportunity is useful. In my opinion, it can be applied in the case of different implementations of the same subfunction: these implementations can be transferred to different object managers, since AMD IL does not have preprocessor directives like #ifdef.

After the kernel completes execution on the GPU, it will be necessary to release the corresponding resources:

 CALresult result = calclFreeImage( image ); result = calModuleUnload( context, module );

Run the kernel for execution

Setting kernel startup options

So, we have dedicated resources, full memory, and a compiled kernel. It remains only to bind resources to our specific kernel and start it. To do this, you need to get its startup parameters from the kernel, and the resource to map to the context.

 const char* memoryName; //    ,       //      CALname kernelParameter; CALresult result = calModuleGetName( &kernelParameter, context, module, memoryName ); //     CALmem resourceMem; result = calCtxGetMem( &resourceMem, context, resource ); //         result = calCtxSetMem( context, kernelParameter, resourceMem );

And now we do it within the framework of our example:

 CALname kernelParameter; CALmem resourceMem; //      CALresult result = calModuleGetName( &kernelParameter, context, module, "cb0" ); result = calCtxGetMem( &resourceMem, context, constantResource ); result = calCtxSetMem( context, kernelParameter, resourceMem ); //      result = calModuleGetName( &kernelParameter, context, module, "i0" ); result = calCtxGetMem( &resourceMem, context, textureResource ); result = calCtxSetMem( context, kernelParameter, resourceMem ); //      result = calModuleGetName( &kernelParameter, context, module, "g[]" ); result = calCtxGetMem( &resourceMem, context, globalResource ); result = calCtxSetMem( context, kernelParameter, resourceMem );

After the kernel completes execution on the GPU, it will be necessary to unbind the resources from the kernel. This can be done like this:

 CALname kernelParameter; //      CALresult result = calModuleGetName( &kernelParameter, context, module, "cb0" ); result = calCtxSetMem( context, kernelParameter, 0 ); //      result = calModuleGetName( &kernelParameter, context, module, "i0" ); result = calCtxSetMem( context, kernelParameter, 0 ); //      result = calModuleGetName( &kernelParameter, context, module, "g[]" ); result = calCtxSetMem( context, kernelParameter, 0 );

Now the kernel knows where to get the data. It remains the case for small:

Kernel startup

As you remember, in the first part I mentioned the shaders PS and CS . You can find out if the latter is supported by the attributes of the GPU (see above).

PS Launch:

 unsigned int blocks = 4; //  4  unsigned int threads = 64; //  64    CALdomain domain; domain.x = 0; domain.y = 0; domain.width = threads; domain.height = blocks; CALevent syncEvent; CALresult result = calCtxRunProgram( &syncEvent, context, function, &domain ); while( calCtxIsEventDone( context, syncEvent ) == CAL_RESULT_PENDING );

Here, function is the entry point into the kernel that we got at the stage of loading the kernel on the GPU (see above “Compiling and loading the kernel on the GPU” ).

Rule number 10: PS does not know the value of threads within itself, it must be passed through memory (in our example this is done through constant memory).

CS Launch:

 unsigned int blocks = 4; //  4  unsigned int threads = 64; //  64    CALprogramGrid programGrid; programGrid.func = function; programGrid.flags = 0; programGrid.gridBlock.width = threads; programGrid.gridBlock.height = 1; programGrid.gridBlock.depth = 1; programGrid.gridSize.width = blocks; programGrid.gridSize.height = 1; programGrid.gridSize.depth = 1; CALevent syncEvent; CALresult result = calCtxRunProgramGrid( &syncEvent, context, &programGrid ); while( calCtxIsEventDone( context, syncEvent ) == CAL_RESULT_PENDING );

Rule No. 11: the value of threads must correspond to the value punched in the source code of the kernel. The kernel will be started anyway, but you can either go beyond the memory (running fewer threads than declared in the kernel), or not all input data will be processed (starting more threads than was announced in the kernel).

Done! The kernel is running, and if everything went well, then the processed memory is in the output memory (“g []”). It remains only to copy them outside (see above, section "Copying memory" ).

Useful features

It remains only to mention some features that can be useful in everyday life.

 CALresult result; //     CALdevicestatus status; result = calDeviceGetStatus( &status, device ); //      GPU  result = calCtxFlush( context ); //       ( ) CALfunc function; CALfuncInfo functionInfo; result = calModuleGetFuncInfo( &functionInfo, context, module, function ); /*      ,       (     ,      ) */ //        aticalrt.dll const char* errorString = calGetErrorString(); //        aticalcl.dll () const char* errorString = calclGetErrorString();

Inter-Thread Synchronization

Unlike Nvidia CUDA, you do not need to perform additional actions with the context if you are working with a GPU from different threads. But there are still some limitations.

Rule # 12: All CAL compiler functions are not thread safe. Within one application, only one thread can work with the compiler at a time.

Rule # 13: All the functions of the main CAL library that work with a specific context / device descriptor (context / device) are thread safe.All other functions are not thread safe.

Rule # 14: Only one application thread at a time can work with a specific context.

Conclusion

I tried to describe the most accessible technology AMD CAL and AMD IL , so that anyone could write from scratch a simple application for AMD GPU. The main thing is to always remember one golden rule: RTFM!

Hope you enjoyed reading.

Links for information

Source: https://habr.com/ru/post/139049/

All Articles