Mix colors correctly or optimize AlphaBlend

I am writing a multiprotocol (but not multiplatform, alas, now only windows) messenger, which so far only supports the TOX protocol. But it's not about the messenger, but about its interface, and more specifically, about its main function - AlphaBlend. Yes, I decided to write my ~~bike~~ GUI. Well, what modern GUI without translucent elements and smooth curves? Therefore, there is an urgent need to mix images with translucency, i.e. alpha blending or alpha blending. Fortunately, in windows GDI there is such a function - AlphaBlend . It works as it should, it does what it needs. But I'm still the builder of bicycles, and I wondered if I could write the same function, but faster. The result of my work under the cut.

Alpha blending theory

Most likely you know this theory, so I will not paint it in detail, I will only note the main points.

So, we have 2 pixels - the source and destination pixels. They need to mix and get a new destination pixel. Each pixel is represented by 4 bytes A, R, G, B, where A is the (non) transparency value of the pixel (0 is completely transparent, 255 is completely opaque), RGB are color components. The classic blending formula is:

TGT_COLOR = TGT_COLOR * (1 - SRC_ALPHA) + SRC_COLOR * SRC_ALPHA

An important point! The unit is in the formula. In our life, the value 255 stands for one. That is, in order to apply the formula, we must first divide the value of each byte by 255. As it is not difficult to see, 255 and 256 are fairly close values, and the division by 256 is just an 8-bit right shift. Therefore, such a simplification is often encountered: instead of operation
')

 (X) * (A/255.0)

do the following:

  (X * A) >> 8

It works well (and most importantly, much faster than honest division), but in the case of alpha blending, the result is not quite correct, namely, the resulting pixel is slightly darker. Next, I will show how you can perform calculations accurately and without loss in speed.

Another important point! Look at the formula. In the second part there is SRC_COLOR * SRC_ALPHA. This multiplication of 3D accelerators is performed by millions and even billions without blinking an eye. But then we are trying to solve the problem using the central processor, and too much multiplication (more precisely 4 extra multiplications) per pixel is not very good. Why superfluous? Because this multiplication can be done in advance by converting the original image. Such images even have a name: premultiplied . I do not know the term in Russian, but translating literally we get "pre-multiplied." And exactly, the GDI function of AlphaBlend requires strictly premultiplied as the source image. It is reasonable.

Well, with the theory finished. In practice, we will work with 32-bit color. One pixel is represented by a 32-bit number, in which 4 bytes, starting from the youngest, mean: B (lue), G (reen), R (ed), A (lpha). Go.

First implementation

My first implementation was:

 uint32 ALPHABLEND_PM(uint32 dst, uint32 src) { uint8 ba = ALPHA(src); //  ALPHA    32-  if (ba == 0) return dst; //  == 0,     == 0,      float a = (float)((double)(ba)* (1.0 / 255.0)); //        :) float not_a = 1.0f - a; //   :   uint B = lround(float(BLUE(dst)) * not_a) + BLUE(src); //  BLUE  0-  32-  uint G = lround(float(GREEN(dst)) * not_a) + GREEN(src); //  GREEN  1-  32-  uint R = lround(float(RED(dst)) * not_a) + RED(src); //  RED  2-  32-  uint A = lround(float(ALPHA(dst)) * not_a) + ALPHA(src); return B | (G << 8) | (R << 16) | (A << 24); //  32-    }

I agree, it does not look very. 4 real (more precisely, 5) multiplications and 4 roundings per pixel is too much. Not surprisingly, this monster lost AlphaBlend approximately 7 times in speed.

Let's try to improve. We will get rid of real multiplications.

 uint32 ALPHABLEND_PM(uint32 dst, uint32 src) { uint not_a = 256 - ALPHA(src); return = src + (((not_a * BLUEx256(dst))>>16) | (((not_a * GREENx256(dst))>>8) & 0xff00) | (((not_a * REDx256(dst))) & 0xff0000) | (((not_a * ALPHAx256(dst))<<8) & 0xff000000)); }

Here, the functions BLUEx256, GREENx256, etc. return the corresponding component shifted left by 8 bits, i.e. multiplied by 256.

This function is remarkable in that it compensates for replacing the division by 255 by shifting 8 bits to the right. Did you notice? If not, be patient, I will describe this point in more detail below.

This implementation is inferior to AlphaBlend by about 3 times in speed. Already better, but still very far from ideal.

Unexpected result

How can the previous function be improved? It seems we did everything we can. However, I managed to improve this function in a way that came as a surprise to me. I tried it just to make sure that nothing happens. However, it turned out.
What if we carry out the operation of multiplying bytes per byte in the table. Not very much will turn out - only 65536 byte. Penny.

We get the following label:

 uint8 __declspec(align(256)) multbl[256][256];

Fill in:

 for (int i = 0; i < 256; ++i) for (int j = 0; j < 256; ++j) { int k = i * j / 255; multbl[i][j] = (uint8)k; }

We try:

 uint32 ALPHABLEND_PM(uint32 dst, uint32 src) { uint8 not_a = 255 - ALPHA(src); return src + ((multbl[not_a][dst & 0xff]) | (((uint)multbl[not_a][(dst >> 8) & 0xff]) << 8) | (((uint)multbl[not_a][(dst >> 16) & 0xff]) << 16) | (((uint)multbl[not_a][(dst >> 24) & 0xff]) << 24)); }

Surprisingly, this function worked one and a half times faster than the previous implementation. True, there is one subtlety - the compiler (in my case it was msvc 2013) worked very well in memory operations. When I tried to write this function on a bare assembler, making, as it seemed to me, everything is much better than the optimizer, I got a function that worked twice as slow as this one. It was a failure. I didn’t understand what exactly I was wrong about - apparently I couldn’t correctly parallelize all the operations - I just left this function to the optimizer.

So. There is nothing more to optimize. Nothing comes to my mind anymore. But AlphaBlend is still two times faster. How did they do it? It seems it's time to retire?

O Compensation for replacing division by 255 shift

There are many ways to quickly divide by 255. I have met this:

 X/255 == (X+1+(X>>8)) >> 8

That's not bad. This is faster than honest division by 255. But this is still too cumbersome. I thought for a long time how to quickly divide by 255 and not lose in quality or speed. How to compensate for color degradation when using shear?

Suppose we have a color component equal to 0xff (255) and we have another component also equal to 0xff (255). Multiplying them, we get:

0xff * 0xff = 0xfe01 . Moving 8 bits to the right, we get 0xfe - the brightness of the component is reduced. Poorly.
But what if we increase one of the components by 1 before multiplying?
0xff * 0x100 = 0xff00 . Hmm, that seems to be it. Check the case when one of the components is 0:
0xff * 1 = 0x00ff , shift to the right by 8 bits, we get 0. Voila! With other components, the result will also be correct.
Now it is easy to find the place of compensation in the second function: uint not_a = 256 - ALPHA (src);
Not 255 - A, but 256 - A, i.e. +1 component before multiplication. For the tabular multiplication method, compensation is not required, since in the table, all values are calculated as necessary.

Heavy Artillery - SSSE3 instructions

It's time to think about optimizing using simd. They say that the Intel compiler can do this without human intervention. Maybe. But I doubt that Intel will cope with AlphaBlend. Well, the maximum - equal to her. But I need to do something faster. Open the directory and go.

The first question that should be asked is what instructions should be used for optimization? I have a suspicion that AlphaBlend is optimized for MMX, otherwise I cannot explain its superiority over the pure x86 implementation. MMX is good, but it is the last century. Now it is difficult to find a computer where there is no support for SSE4. And under SSE, you can optimize it at all, even without bothering to check for the presence of support for these instructions - the probability that your program will run on something below the Pentium 3 is close to zero. I, of course, talk about desktop applications. Exotic beyond the scope of this article.

I opted for SSSE3. This set of instructions is quite common to be confused by optimizing it for it, given the presence in it of very very convenient instructions.

The most useful instruction, which will form the basis of all optimizations, is pshufb ( _mm_shuffle_epi8 intrinsic). It is for her sake and chosen SSSE3. What is her strength? The fact that this instruction allows you to scatter the bytes of the source 16-byte register in any random order, or even throw out these bytes as unnecessary. Those. I can use this instruction in one movement to prepare everything necessary for the necessary calculations. Another important instruction is pmulhuw (intrinsic _mm_mulhi_epu16 ) - this is 8 multiplications and 8 shifts to the right by 16 bits. As if specifically for alpha blending operations. Those. I actually calculate 2 pixels at once with this command.

Well, let's go:

Bed sheet asm code

  lddqu xmm5, [eax] ;   xmm5 16 ,  4   premultiplied  movdqa xmm6, xmm5 ;   xmm6     2-  ;  :  ;        16    pshufb xmm5, preparesrcs_1 pshufb xmm6, preparesrcs_2 ;  ; xmm5   2 ,     16  ; xmm6     2  ;  :  2  4  ;      8 16-  (256-A) ;    xmm7 movdqa xmm2, xmm5 ;   2   xmm2 pshufb xmm2, preparealphas ;         : A0 A0 A0 A0 A1 A1 A1 A1 movdqa xmm7, sub256 ;  xmm7  8 16-  256 psubw xmm7, xmm2 ;    movdqu xmm0, [edx] ; 4   movdqa xmm1, xmm0 ;   xmm1   3 pshufb xmm0, preparetgtc_1 ;   2   16- ,    8 pmulhuw xmm0, xmm7 ;     2   16-  paddw xmm0, xmm5 ;         pshufb xmm0, packcback_1 ;     8  xmm0 ;   -  ,    2,      movdqa xmm2, xmm6 pshufb xmm2, xmm3 movdqa xmm7, xmm4 psubw xmm7, xmm2 pshufb xmm1, preparetgtc_2 pmulhuw xmm1, xmm7 paddw xmm1, xmm6 pshufb xmm1, packcback_2 por xmm0, xmm1 ;   xmm0 4   movdqu [edx], xmm0 ;

As you can see, the simd implementation mixes 4 source pixels at once with 4 destination pixels. Well, then she simd. Behind the ~~frame in~~ this article I will leave the description of the solution to the problem, when you want to mix not a multiple of 4 pixels. Personally, I use for this "single-pixel" calls c ++ implementation.

Results

As a result, this ssse3 implementation works almost 4 times faster (in 3.78 on my hardware) than the AlphaBlend function. This is a very good result. Many programmers (including me) are skeptical of such “bikes”. As a rule, the result is obviously worse than the work of a team of highly qualified specialists. I took up writing my own implementation of the AlphaBlend function, not believing that I could defeat the guys from Microsoft. It was just a sporting interest, which, nevertheless, gave the result.

But that's not all. The fact is that in this article I gave the code for a simple case - when the original image is mixed with the resulting one as it is. But if you read the documentation for the function AlphaBlend , you might have noticed that this function can do additional multiplication by a constant alpha (passed through parameters). I wrote an ssse3 implementation for this case too. An interesting result: AlphaBlend works almost 2 times slower if the constant alpha is not equal to 255, i.e. additional color multiplication required. My implementation degrades in speed by only 4%, which also distinguishes it from the creation of Microsoft.

Links

The code in the article is given only for familiarization with the very principle ssse3 optimization. I did not give here the value of the used constants. If you want to use optimized AlphaBlend in your project, you will have to extract the working code directly from the Isotoxin source code (this is the name of my development).

The Isotoxin repository on the githaba .
Directly the file in which the desired function is located here .

I apologize for not having prepared working examples and did not bring everything into a separate library. If you really need this function, and you have difficulty in getting it yourself from my sources, write me a personal message and I will tell you in detail how to do it.

Source: https://habr.com/ru/post/273109/

All Articles