By optimizing the execution time of the algorithm using the
LDPC decoder, the profiler resulted in a function that calculates the following value:

where
a and
b are integers. The number of calls went to millions, and its implementation was enough
simple and unsophisticated:
int LLR(int a, int b) { if (a>0) return (b>0) ? __min(a,b) : -__min(a,-b); else return (b>0) ? -__min(-a,b) : __min(-a,-b); }
The function consists essentially of three consecutive comparison operations. This gives (taking into account the optimization of the compiler) two (if the numbers of different signs) or three (if one) conditional transition to get the result. Recalling the
potential problems of the conveyor with a large number of conditional transitions, it was decided to reduce their number or even get rid of them. To evaluate the performance was written a small
test project #include <Windows.h> static inline int LLR(int a, int b) { if (a>0) return (b>0) ? __min(a,b) : -__min(a,-b); else return (b>0) ? -__min(-a,b) : __min(-a,-b); } int _tmain(int argc, _TCHAR* argv[]) { SetPriorityClass(GetCurrentProcess(),REALTIME_PRIORITY_CLASS); SetThreadPriority(GetCurrentThread(),THREAD_PRIORITY_TIME_CRITICAL); srand(0); int x(0); __int64 t1,t2; QueryPerformanceCounter((LARGE_INTEGER*)&t1); for (size_t i=0;i<MAXUINT>>4;i++) { int a = rand() - RAND_MAX / 2; int b = rand() - RAND_MAX / 2; /* x += LLR(a,b);*/ } QueryPerformanceCounter((LARGE_INTEGER*)&t2); t2 -= t1; QueryPerformanceFrequency((LARGE_INTEGER*)&t1); _tprintf_s(_T("%f"),t2/(t1*1.)); return 0; }
')
The build was done in MSVS 2008 in the Release configuration (default settings) for the x86 platform.
To begin, comment out the call to the computational function to estimate the cycle time with random number generation (unfortunately, the call to
QueryPerformanceCounter () itself is rather slow and leads to a significant distortion of the result if it is done inside the cycle). Run the project several times to make sure that the result is repeatable. On a machine with an Intel Core i7-3770 processor with a frequency of 3.4 GHz, the execution time is on average 9.2 seconds. If you uncomment the call to the calculated function, rebuild the project and repeat the experiment, we get about 11.5 seconds. The increase in the execution time is obvious, and we will fight with it.
Let's try to get rid of conditional statements. First, find out the sign of the product
a and
b . It is incorrect to calculate it in the forehead due to possible overflow. Since only the sign is important for us (that is, the bit value of the sign bit of an integer), it is appropriate to use the
xor operation to determine the sign of the product
a and
b . Next, we shift the result of
a xor b to the right by 31 bits (leaving only the sign bit) and get 0 in the case of a non-negative number and 1 in the case of a negative one. Multiply this value by two, subtract from unity and get -1 for a negative number and 1 for a non-negative one:
int sign = 1-2*((unsigned)a^b)>>31);
Now we calculate the modules
a and
b . Using a similar method, we determine the signed coefficients
a and
b and multiply by them:
int c; c = 1-2*(((unsigned)a)>>31); a *= c; c = 1-2*(((unsigned)b)>>31); b *= c;
Let's turn to the calculation of the minimum. Since
a and
b already have non-negative values, it is possible to calculate the minimum based on the sign of their difference:
int numbers[2], min; numbers[0] = b; numbers[1] = a; a -= b; c = (unsigned(a))>>31; min = numbers[c];
Let's put it all together and get
next function static inline int LLR_2(int a, int b) { int sign, numbers[2]; sign = 1-2*(((unsigned)a^b)>>31); a *= 1-2*(((unsigned)a)>>31); b *= 1-2*(((unsigned)b)>>31); numbers[0] = b; numbers[1] = a; a -= b; return sign*numbers[((unsigned)a)>>31]; }
Replace the
LLR () call with
LLR_2 () and see what happens. In the assembler code, now there is not a single conditional transition, but there appeared three instructions for integer multiplication
imul (multiplication by 2 the compiler replaces itself with a shift). Having banished the test program, we get the following result - the runtime is 9.4 seconds! So, comparing the net times (2.1 and 0.2 seconds, respectively) for the two options for calculating the desired value, we get a tenfold
* increase in the speed of the required operation.
Let's try to go a little further. Integer multiplication is needed only to change the sign of a number that can be performed
directly . In particular, the calculation of the modulus of
a can be implemented as follows:
unsigned int mask[] = {0,(unsigned)-1}; unsigned int constant[] = {0, 1}; int c = ((unsigned)a)>>31; a = a^mask[c]+constant[c];
Finally, replace the right 31 bits with a single left circular shift using the
_rotl () function. Looking ahead, the compiler converts its call directly into a
rol instruction without using
call . Putting it all together again and get
third option static unsigned int mask[] = {0,(unsigned)-1}; static unsigned int constant[] = {0, 1}; static inline int LLR_3(int a, int b) { int sign,c, numbers[2]; sign = (_rotl(a^b,1) & 1); c = ((_rotl(a,1) & 1)); a = (a^mask[c])+constant[c]; c = ((_rotl(b,1) & 1)); b = (b^mask[c])+constant[c]; numbers[0] = b; numbers[1] = a; c = ((_rotl(ab,1) & 1)); return (numbers[c]^mask[sign])+constant[sign]; }
Replace the
LLR () _ 2 call with
LLR_3 () and see that this does not give a significant increase (the execution time is approximately the same 9.4 seconds with a small difference in the third digit in a smaller direction). It turns out that
imul on modern processors runs pretty fast!
Let's return to the algorithm that started it all. The described optimization of only one function managed to reduce the processing time of a single data block from 160 to 140 seconds (when executed on i7), which is more than 10% and is a very good result. Next up are some more similar features ...
And finally, I propose an embodiment of the function of determining the maximum of two whole 32-decimal numbers. It was already written out of academic curiosity. Those interested can not hurry to look under the spoiler and try to implement this themselves. The variant without conditional jumps works about three times faster than the standard
__max () macro (when executed on i7). Good luck in optimizing your programs!
Hidden text static inline int max(int a, int b) { int numbers[2][2]; numbers[0][0] = b; numbers[0][1] = a; numbers[1][0] = a; numbers[1][1] = b; int sign_a = (unsigned)a>>31; int sign_b = (unsigned)b>>31; int sign_ab = (unsigned)(a^b)>>31; int abs_a = (1-2*sign_a)*a; int abs_b = (1-2*sign_b)*b; int sign_a_b = (unsigned)(abs_a-abs_b)>>31; int c0 = sign_ab; int c1 = ((1^(sign_a_b)^(sign_a | sign_b)) & (1^c0)) | (sign_ab & sign_a); int result = numbers[c0][c1]; return result; }
* UPD. In the comments, user shodan gave an example in which rand () is removed from the counting time, and data is read from a pre-formed array. In this case, the performance gain is approximately threefold. Apparently, this is due to the more efficient operation of the conveyor when reading data from an array.