Just recently, it hit a year since our first application appeared on the AppleStore. At first it was quite difficult to figure it all out. Especially when you consider that before this I was not engaged in developing applications for MacOS. This year a lot has been written. Unfortunately, I can’t name the applications that we wrote (I don’t remember everyone, and the management doesn’t approve such things), but I can tell you about several ways to optimize applications for this platform.
Somewhere half a year (or even more) back I had to write an application whose main task was sound processing. For this was written its simple engine that did it all. The application was released and gradually this engine began to be used frequently in other applications of this kind. But recently, the development of the 2nd version of this program has begun. Requirements have increased, and the resources of old iphone phones have not changed. It was here that I had to look for ways to improve the already written code.
Compiler Settings (thumb)
The first thing that comes to mind is to try to squeeze everything the compiler can. Perhaps the most important parameter that can be changed here is to compile the application under the thumb mode. If this mode is enabled, then an abbreviated set of commands will be used to perform our tasks. This set of commands will be encoded more compact code, but we can not use all the processor resources. In particular, using VFP directly will not work. In places where we perform operations on floating-point numbers, you can find a code like this:
')
double prevTime=CFAbsoluteTimeGetCurrent();
{
...
}
double nextTime=CFAbsoluteTimeGetCurrent();
double dt = nextTime-prevTime;
printf( "dt=%f" ,dt);
* This source code was highlighted with Source Code Highlighter .
After compilation, it will look something like this:
blx L_CFAbsoluteTimeGetCurrent$stub
mov r5, r1
blx L_CFAbsoluteTimeGetCurrent$stub
mov r3, r5
mov r2, r4
blx L___subdf3vfp$stub
ldr r6, L7
mov r2, r1
mov r1, r0
mov r0, r6
blx L_printf$stub
* This source code was highlighted with Source Code Highlighter .
Not in thumb mode, the code will be like this:
bl L_CFAbsoluteTimeGetCurrent$stub
fmdrr d8, r0, r1
bl L_CFAbsoluteTimeGetCurrent$stub
fmdrr d6, r0, r1
ldr r0, L7
fsubd d7, d6, d8
fmrrd r1, r2, d7
bl L_printf$stub
* This source code was highlighted with Source Code Highlighter .
As you can see the difference is quite significant. There is no extra function call and all floating point operations occur on the spot, and not somewhere far away and not with us. Probably you will have a question. Does it work faster? The answer is naturally yes, it works faster. Although if your program does not carry out heavy calculations, then it will do. As a plus, thumb mode is a more compact code, which means that theoretically the program will load faster.
By the way, in XCode tools you can set personal parameters for each file, and the thumb mode can be turned off (or vice versa enabled) only for individual project fragments, which is quite convenient.
Algorithm optimization
The next step to speed up the calculations is to throw out as many floating point operations as possible. Instead, transform our numbers into integers multiplied by a factor factor. Naturally the coefficient is better to choose a multiple of degree 2, so that later it is convenient to obtain the necessary data.
Well, we forced the compiler to use all the CPU resources in the places where it is important; if possible, we got rid of floating point operations. Now let's look at the ArmV6 spec (for example,
here ). If you carefully read the descriptions of functions, you can see there are a lot of interesting commands (many of them are also not available in thumb mode).
For example, do you have a task to make a simple lowpass filter or highpass filter? The algorithm ultimately boils down to the calculation of the following formula:
tmp = b0*in0+b1*in1+b2*in2 -a1*out1-a2*out2;
* This source code was highlighted with Source Code Highlighter .
(b0, b1, b2, a1, a2 are constants at a given cut-off frequency)
Now look at the smlad command description. This command multiplies 2-byte 16-bit numbers, summarizes the results and the register you specify. The formula will look like this (bits in square brackets):
result[0:31] = a[0:15]*b[0:15] + a[16:31]*b[16:31] + [0:31]
* This source code was highlighted with Source Code Highlighter .
Those. The calculation of our formula itself can be done in 3 operations. It remains only to solve the question of how to use this function. I have a lot of experience with assembler since Dosi’s time, and in gcc I just work great inserts written in assembler. In general, we will write a function that will use this command:
inline volatile int SignedMultiplyAccDual(int32_t x, int32_t y, int32_t addVal)
{
register int32_t result;
asm volatile ( "smlad %0, %1, %2, %3"
: "=r" (result)
: "r" (x), "r" (y), "r" (addVal)
);
return result;
}
* This source code was highlighted with Source Code Highlighter .
By the way, for convenience, you can make a version of the function for the simulator. And then it will not be convenient to test. I did it like this:
#if defined __arm__
inline volatile int SignedMultiplyAccDual(int32_t x, int32_t y, int32_t addVal)
{
register int32_t result;
asm volatile ( "smlad %0, %1, %2, %3"
: "=r" (result)
: "r" (x), "r" (y), "r" (addVal)
);
return result;
}
inline volatile int SignedMultiplyAcc(int32_t x, int32_t y, int32_t addVal)
{
register int32_t result;
asm volatile ( "mla %0, %1, %2, %3"
: "=r" (result)
: "r" (x), "r" (y), "r" (addVal)
);
return result;
}
#else
inline volatile int SignedMultiplyAcc(int32_t x, int32_t y, int32_t addVal)
{
register int32_t result;
result = x*y+addVal;
return result;
}
inline volatile int SignedMultiplyAccDual(int32_t x, int32_t y, int32_t addVal)
{
register int32_t result;
result = int16_t(x & 0x0000FFFF) * int16_t(y & 0x0000FFFF);
result += int16_t(x >> 16) * int16_t(y >> 16);
result += addVal;
return result;
}
#endif
* This source code was highlighted with Source Code Highlighter .
As a result, the calculation of our formula will look like this:
tmp = fParamsHigh[0]*fValsHigh[0];
tmp = SignedMultiplyAccDual(*(int32_t *)&fParamsHigh[1],*(int32_t *)&fValsHigh[1],tmp);
tmp = SignedMultiplyAccDual(*(int32_t *)&fParamsHigh[3],*(int32_t *)&fValsHigh[3],tmp);
tmp = tmp >> PARAMS_SHL_VAL;
* This source code was highlighted with Source Code Highlighter .
Let's take a look at dysasm:
ldrh r3, [r4, #196]
ldrh r0, [r4, #206]
ldr r2, [r4, #208]
smulbb r3, r3, r0
smlad r3, r1, r2, r3
ldr r1, [r4, #202]
ldr r2, [r4, #212]
smlad r3, r1, r2, r3
mov r3, r3, asr #10
* This source code was highlighted with Source Code Highlighter .
Everything is beautiful and clear. As one friend of mine said, “uploaded. fulfilled. uploaded. spat out. " What was before it was better not to look. There was just horror. So, I had 2 channels in the program, which had a delay effect. For each such effect, 2 filters were needed (one low-pass filter, the other high-pass filter). Total 4 filters. After optimization, looking at the Instruments processor load - we see that instead of ~ 45%, the program eats ~ 35% of processor time. Pretty good result :)
By the way, after reading the documentation I was surprised to find the absence of integer division. As a result, slightly modifying the linear interpolation algorithm (used for resampling on all active channels), the load generally dropped to ~ 30% :)
So, a couple of simple and fairly obvious optimizations reduced the CPU load by about 1/3.
PS Everything was tested on iPhone 3g.