📜 ⬆️ ⬇️

Analyzing a proprietary application on OpenCL, written using the AMD APP SDK

Consider the following situation: there is an application that uses an AMD GPU for its calculations. As a rule, the most demanding operations are made on the GPU. Therefore, if the application runs faster than its competitors, then you may want to know what algorithm is implemented in this program. But what if the program is proprietary and is distributed under a license prohibiting reverse engineering and disassembling?

In order not to violate the license, you can use one little trick left by the developers of the AMD APP SDK. However, in order for this trick to work, another condition must be met (in addition to the application developers use the specified SDK): the application must use OpenCL for computing on the GPU.

If you carefully read the AMD Accelerated Parallel Processing OpenCL Programming Guide (v1.3f) , then in the section "4.2.1 Intermediate Language and GPU Disassembly" you can find one remarkable environment variable: GPU_DUMP_DEVICE_KERNEL. It can take 3 values:

Experiment


Set the value of the environment variable GPU_DUMP_DEVICE_KERNEL = 3. As an experimental program, let's take an example from the AMD APP SDK - the binary search program BinarySearch.exe . This example is not the most interesting, since the kernel source file for the GPU is already nearby: BinarySearch_Kernels.cl . However, in real life programs do not store such valuable information in the clear, it is either encrypted or stored within the program.
')
So, after running BinarySearch.exe , core dump files appear next to the program.

Here is the original kernel written in OpenCL (file BinarySearch_Kernels.cl ):
__kernel void binarySearch( __global uint4 * outputArray, __const __global uint * sortedArray, const unsigned int findMe, const unsigned int globalLowerBound, const unsigned int globalUpperBound, const unsigned int subdivSize) { unsigned int tid = get_global_id(0); /* lower bound and upper bound are computed from segment and total search space for this pass * The total search space is known from global lower and upper bounds for this pass. */ unsigned int lowerBound = globalLowerBound + subdivSize * tid; unsigned int upperBound = lowerBound + subdivSize - 1; /* Then we find the elements at the two ends of the search space for this thread */ unsigned int lowerBoundElement = sortedArray[lowerBound]; unsigned int upperBoundElement = sortedArray[upperBound]; /* If the element to be found does not lie between them, then nothing left to do in this thread */ if( (lowerBoundElement > findMe) || (upperBoundElement < findMe)) { return; } else { /* However, if the element does lie between the lower and upper bounds of this thread's searchspace * we need to narrow down the search further in this search space */ /* The search space for this thread is marked in the output as being the total search space for the next pass */ outputArray[0].x = lowerBound; outputArray[0].y = upperBound; outputArray[0].w = 1; } } /*    */ 

Here is the generated dump of this kernel in the AMD IL language ( binarySearch_Juniper.il file):
 mdef(16383)_out(1)_in(2) mov r0, in0 mov r1, in1 div_zeroop(infinity) r0.x___, r0.x, r1.x mov out0, r0 mend il_cs_2_0 dcl_cb cb0[15] ; Constant buffer that holds ABI data dcl_literal l0, 0x00000004, 0x00000001, 0x00000002, 0x00000003 dcl_literal l1, 0x00FFFFFF, 0xFFFFFFFF, 0xFFFFFFFE, 0xFFFFFFFD dcl_literal l2, 0x0000FFFF, 0xFFFFFFFE, 0x000000FF, 0xFFFFFFFC dcl_literal l3, 0x00000018, 0x00000010, 0x00000008, 0xFFFFFFFF dcl_literal l4, 0xFFFFFF00, 0xFFFF0000, 0xFF00FFFF, 0xFFFF00FF dcl_literal l5, 0x00000000, 0x00000004, 0x00000008, 0x0000000C dcl_literal l6, 0x00000020, 0x00000020, 0x00000020, 0x00000020 dcl_literal l7, 0x00000018, 0x0000001F, 0x00000010, 0x0000001F dcl_literal l8, 0x80000000, 0x80000000, 0x80000000, 0x80000000 call 1024;$ endmain func 1024 ; __OpenCL_binarySearch_kernel mov r1013, cb0[8].x mov r1019, l1.0000 dcl_max_thread_per_group 256 dcl_raw_uav_id(11) dcl_arena_uav_id(8) mov r0.__z_, vThreadGrpIdFlat0.x mov r1022.xyz0, vTidInGrp0.xyz mov r1023.xyz0, vThreadGrpId0.xyz imad r1021.xyz0, r1023.xyzz, cb0[1].xyzz, r1022.xyzz iadd r1021.xyz0, r1021.xyz0, cb0[6].xyz0 iadd r1023.xyz0, r1023.xyz0, cb0[7].xyz0 mov r1023.___w, r0.z ishl r1023.___w, r1023.w, l0.z mov r1018.x___, l0.0000 udiv r1024.xyz_, r1021.xyzz, cb0[10].xyzz imad r1025.xyz0, r1023.xyzz, cb0[10].xyzz, r1022.xyzz dcl_literal l13, 0x00000001, 0x00000001, 0x00000001, 0x00000001; f32:i32 1 dcl_literal l11, 0x00000002, 0x00000002, 0x00000002, 0x00000002; f32:i32 2 dcl_literal l12, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF; f32:i32 4294967295 dcl_cb cb1[6] ; Kernel arg setup: outputArray mov r1.x, cb1[0].x ; Kernel arg setup: sortedArray mov r1.y, cb1[1].x ; Kernel arg setup: findMe mov r1.z, cb1[2].x ; Kernel arg setup: globalLowerBound mov r1.w, cb1[3].x ; Kernel arg setup: globalUpperBound ; Kernel arg setup: subdivSize mov r2.y, cb1[5].x call 1029 ; binarySearch ret endfunc ; __OpenCL_binarySearch_kernel ;ARGSTART:__OpenCL_binarySearch_kernel ;version:2:0:88 ;device:juniper ;uniqueid:1024 ;memory:hwprivate:0 ;memory:hwregion:0 ;memory:hwlocal:0 ;pointer:outputArray:i32:1:1:0:uav:11:16:RW ;pointer:sortedArray:i32:1:1:16:uav:11:4:RO ;value:findMe:i32:1:1:32 ;value:globalLowerBound:i32:1:1:48 ;value:globalUpperBound:i32:1:1:64 ;value:subdivSize:i32:1:1:80 ;function:1:1029 ;uavid:11 ;privateid:1 ;ARGEND:__OpenCL_binarySearch_kernel func 1029 ; binarySearch ; @__OpenCL_binarySearch_kernel ; BB#0: ; %entry mov r65.x___, r2.y mov r65.__z_, r1.z mov r65.___w, r1.y mov r66, r1021.xyz0 mov r66.x___, r66.x000 imul r66.x___, r66.x, r65.x iadd r65._y__, r66.x, r1.w mov r66.x___, l11 ishl r66._y__, r65.y, r66.x iadd r66._y__, r65.w, r66.y mov r1010.x___, r66.y uav_raw_load_id(11)_cached r1011.x___, r1010.x mov r66._y__, r1011.x uge r66._y__, r65.z, r66.y if_logicalnz r66.y iadd r65.x___, r65.x, r65.y mov r66._y__, l12 iadd r65.x___, r65.x, r66.y ishl r66.x___, r65.x, r66.x iadd r65.___w, r65.w, r66.x mov r1010.x___, r65.w uav_raw_load_id(11)_cached r1011.x___, r1010.x mov r65.___w, r1011.x ult r65.__z_, r65.w, r65.z if_logicalnz r65.z else mov r1010.x___, r1.x uav_raw_load_id(11)_cached r1011, r1010 mov r66, r1011 iadd r66, r66.0yzw, r65.y000 iadd r66, r66.x0zw, r65.0x00 mov r65.x___, l13 iadd r66, r66.xyz0, r65.000x mov r1011, r66 mov r1010.x___, r1.x uav_raw_store_id(11) mem0, r1010.x, r1011 endif else endif ret endfunc ; binarySearch ;ARGSTART:binarySearch ;uniqueid:1029 ;ARGEND:binarySearch end 

Here is the generated disassembled ISA file ( binarySearch_Juniper.isa file):
 ShaderType = IL_SHADER_COMPUTE TargetChip = c ; ------------- SC_SRCSHADER Dump ------------------ SC_SHADERSTATE: u32NumIntVSConst = 0 SC_SHADERSTATE: u32NumIntPSConst = 0 SC_SHADERSTATE: u32NumIntGSConst = 0 SC_SHADERSTATE: u32NumBoolVSConst = 0 SC_SHADERSTATE: u32NumBoolPSConst = 0 SC_SHADERSTATE: u32NumBoolGSConst = 0 SC_SHADERSTATE: u32NumFloatVSConst = 0 SC_SHADERSTATE: u32NumFloatPSConst = 0 SC_SHADERSTATE: u32NumFloatGSConst = 0 fConstantsAvailable = 1237488 iConstantsAvailable = 1237456 bConstantsAvailable = 1237520 u32SCOptions[0] = 0x01A00000 SCOption_IGNORE_SAMPLE_L_BUG SCOption_FLOAT_DO_NOT_DIST SCOption_FLOAT_DO_NOT_REASSOC u32SCOptions[1] = 0x00202000 SCOption_R600_ERROR_ON_DOUBLE_MEMEXP SCOption_SET_VPM_FOR_SCATTER u32SCOptions[2] = 0x00000041 SCOption_R800_UAV_NONARRAY_FIXUP SCOption_R800_UAV_NONUAV_SYNC_WORKAROUND_BUG216513_1 ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(12) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15) 0 x: LSHR R1.x, KC1[0].x, 2 t: MULLO_INT ____, R1.x, KC0[1].x 1 y: ADD_INT ____, R0.x, PS0 2 w: ADD_INT ____, PV1.y, KC0[6].x 3 t: MULLO_INT ____, PV2.w, KC1[5].x 4 y: ADD_INT R1.y, KC1[3].x, PS3 5 x: LSHL ____, PV4.y, 2 6 w: ADD_INT ____, KC1[1].x, PV5.x 7 y: LSHR R0.y, PV6.w, 2 01 TEX: ADDR(64) CNT(1) 8 VFETCH R0.x___, R0.y, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU_PUSH_BEFORE: ADDR(44) CNT(2) KCACHE0(CB1:0-15) 9 z: SETGE_UINT R0.z, KC0[2].x, R0.x 10 x: PREDNE_INT ____, R0.z, 0.0f UPDATE_EXEC_MASK UPDATE_PRED 03 JUMP POP_CNT(1) ADDR(13) 04 ALU: ADDR(46) CNT(7) KCACHE0(CB1:0-15) 11 w: ADD_INT ____, KC0[5].x, R1.y 12 z: ADD_INT R1.z, -1, PV11.w 13 x: LSHL ____, PV12.z, 2 14 z: ADD_INT ____, KC0[1].x, PV13.x 15 y: LSHR R0.y, PV14.z, 2 05 TEX: ADDR(66) CNT(1) 16 VFETCH R0.x___, R0.y, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 06 ALU_PUSH_BEFORE: ADDR(53) CNT(2) KCACHE0(CB1:0-15) 17 w: SETGT_UINT R0.w, KC0[2].x, R0.x 18 x: PREDE_INT ____, R0.w, 0.0f UPDATE_EXEC_MASK UPDATE_PRED 07 JUMP POP_CNT(2) ADDR(13) 08 ALU: ADDR(55) CNT(2) KCACHE0(CB1:0-15) 19 z: LSHR R0.z, KC0[0].x, 4 09 TEX: ADDR(68) CNT(1) 20 VFETCH R0, R0.z, fc175 FORMAT(32_32_32_32_FLOAT) MEGA(16) FETCH_TYPE(NO_INDEX_OFFSET) 10 ALU: ADDR(57) CNT(4) 21 x: MOV R0.x, R1.yy: MOV R0.y, R1.zw: MOV R0.w, (0x00000001, 1.401298464e-45f).x 11 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1], R0, ARRAY_SIZE(4) MARK VPM 12 POP (2) ADDR(13) 13 NOP NO_BARRIER END_OF_PROGRAM ; ----------------- CS Data ------------------------ ; Input Semantic Mappings ; No input mappings GprPoolSize = 0 CodeLen = 560;Bytes PGM_END_CF = 0; words(64 bit) PGM_END_ALU = 0; words(64 bit) PGM_END_FETCH = 0; words(64 bit) MaxScratchRegsNeeded = 0 ;AluPacking = 0.0 ;AluClauses = 0 ;PowerThrottleRate = 0.0 ; texResourceUsage[0] = 0x00000000 ; texResourceUsage[1] = 0x00000000 ; texResourceUsage[2] = 0x00000000 ; texResourceUsage[3] = 0x00000000 ; texResourceUsage[4] = 0x00000000 ; texResourceUsage[5] = 0x00000000 ; texResourceUsage[6] = 0x00000000 ; texResourceUsage[7] = 0x00000000 ; fetch4ResourceUsage[0] = 0x00000000 ; fetch4ResourceUsage[1] = 0x00000000 ; fetch4ResourceUsage[2] = 0x00000000 ; fetch4ResourceUsage[3] = 0x00000000 ; fetch4ResourceUsage[4] = 0x00000000 ; fetch4ResourceUsage[5] = 0x00000000 ; fetch4ResourceUsage[6] = 0x00000000 ; fetch4ResourceUsage[7] = 0x00000000 ; texSamplerUsage = 0x00000000 ; constBufUsage = 0x00000000 ResourcesAffectAlphaOutput[0] = 0x00000000 ResourcesAffectAlphaOutput[1] = 0x00000000 ResourcesAffectAlphaOutput[2] = 0x00000000 ResourcesAffectAlphaOutput[3] = 0x00000000 ResourcesAffectAlphaOutput[4] = 0x00000000 ResourcesAffectAlphaOutput[5] = 0x00000000 ResourcesAffectAlphaOutput[6] = 0x00000000 ResourcesAffectAlphaOutput[7] = 0x00000000 ;SQ_PGM_RESOURCES = 0x30000102 SQ_PGM_RESOURCES:NUM_GPRS = 2 SQ_PGM_RESOURCES:STACK_SIZE = 1 SQ_PGM_RESOURCES:PRIME_CACHE_ENABLE = 1 ;SQ_PGM_RESOURCES_2 = 0x000000C0 SQ_LDS_ALLOC:SIZE = 0x00000000 ; RatOpIsUsed = 0x800 ; NumThreadPerGroupFlattened = 256 ; SetBufferForNumGroup = true 

I do not know about you, but it would be unpleasant for me if my super algorithm for the GPU could be so easily pulled out of the program and analyzed. Especially if the whole essence of the program would be in this algorithm (smile).

Analysis of the situation


This behavior is typical only for the OpenCL compiler from AMD and only when running the application on an AMD GPU. If the Nvidia OpenCL compiler is installed on the system, then, naturally, no files are generated on the disk.

As you understand, this opportunity was left for the developers to analyze their OpenCL code. After all, the resulting files can be pushed into the profiler and see what operations will be a bottleneck in the program. However, if you do not know about this global variable, you can lose your intellectual property fairly quickly.

If you look closely at the generated binarySearch_Juniper.il file, then the hair can stand on end from this code: the original core for OpenCL can be rewritten in AMD IL language in 20 lines, but not in 100! This suggests that the applications written on OpenCL for AMD GPUs at the moment will not be as fast as applications using AMD IL technology to interact with the GPU.

How to understand written in the file binarySearch_Juniper.il , described here .
How you can use the binarySearch_Juniper.il file in your program is described here .

Source: https://habr.com/ru/post/139499/


All Articles