Optimization history alpha_composite in Pillow 2.0

Recently released the second version of the Python library for working with images Pillow . As many know, this is a fork of the well-known library PIL, which, despite its considerable age, until recently remained the most sane way of working with images in Python. Pillow authors finally decided not only to maintain the old code, but also to add new features. And one of these features is the alpha_composite () function.

This function implements the mixing of two semi-transparent images. This task is a general case of overlaying an image with an alpha channel onto an opaque background, which is probably familiar to many. About mixing translucent images can be read on Wikipedia . About this I once wrote a long time article .

Since then, I have been able to write for my needs a more optimal implementation than is given at the end of that article. And it turned out that this implementation is faster alpha_composite () from the new version of Pillow, written in C. Of course, it flattered me, but I still decided to try to improve the implementation of Pillow.
')
First you need a test bench.

bench.py

from PIL import Image from timeit import repeat from image import paste_composite im1 = Image.open('in1.png') im1.load() im2 = Image.open('in2.png') im2.load() print repeat(lambda: paste_composite(im1.copy(), im2), number=100) print repeat(lambda: Image.alpha_composite(im1, im2), number=100) out1 = im1.copy() paste_composite(out1, im2) out1.save('out1.png') out2 = Image.alpha_composite(im1, im2) out2.save('out2.png')

The script takes the files in1.png and in2.png in the current folder and mixes them a hundred times first with the function paste_composite (), then alpha_composite (). I took the files from the previous article ( one , two ). The runtime paste_composite () averaged me 235ms. The running time of the original alpha_composite () was 400ms.

Why alpha_composite () is so slow, I didn’t have to guess for a long time. Looking at the implementation , it is clear that the function performs all calculations on floating-point numbers. When working with graphics, floating-point numbers make it quite simple to describe the algorithm, but they always have lower performance than integer numbers. The first thing I did was transfer everything to work with integers:

 typedef struct { UINT8 r; UINT8 g; UINT8 b; UINT8 a; } rgba8; Imaging ImagingAlphaComposite(Imaging imDst, Imaging imSrc) { Imaging imOut = ImagingNew(imDst->mode, imDst->xsize, imDst->ysize); for (int y = 0; y < imDst->ysize; y++) { rgba8* dst = (rgba8*) imDst->image[y]; rgba8* src = (rgba8*) imSrc->image[y]; rgba8* out = (rgba8*) imOut->image[y]; for (int x = 0; x < imDst->xsize; x ++) { if (src->a == 0) { *out = *dst; } else { UINT16 blend = dst->a * (255 - src->a); UINT8 outa = src->a + blend / 255; UINT8 coef1 = src->a * 255 / outa; UINT8 coef2 = blend / outa; out->r = (src->r * coef1 + dst->r * coef2) / 255; out->g = (src->g * coef1 + dst->g * coef2) / 255; out->b = (src->b * coef1 + dst->b * coef2) / 255; out->a = outa; } dst++; src++; out++; } } return imOut; }

And immediately received a gain of 5.5 times. The test worked out for 75 ms.

Good result, but it was necessary to check how good the picture was. Nothing is better than taking as a benchmark the result of Photoshop, I did not invent. In appearance, the two images were indistinguishable from those in Photoshop, so I had to figure out how to more accurately visualize the differences.

diff.py

 #!/usr/bin/env python import sys from PIL import Image, ImageMath def chanel_diff(c1, c2): c = ImageMath.eval('127 + c1 - c2', c1=c1, c2=c2).convert('L') return c.point(lambda c: c if 126 <= c <= 128 else 127 + (c - 127) * 10) im1 = Image.open(sys.argv[1]) im2 = Image.open(sys.argv[2]) diff = map(chanel_diff, im1.split(), im2.split()) Image.merge('RGB', diff[:-1]).save('diff.png', optimie=True) diff[-1].convert('RGB').save('diff.alpha.png', optimie=True)

The script accepts 2 file names with pictures as input. There are differences between each pair of channels of these pictures: first, the difference is simply searched and added up with a gray background (value 127) so that you can see deviations in both directions. Then all deviations, stronger than one value, are amplified 10 times, so that they are visible visually. The result of the comparison is recorded in 2 files: rgb channels in diff.png, alpha channel in the form of shades of gray in diff.alpha.png.

It turned out that the difference between the alpha_composite () result and Photoshop is quite significant:

Looking carefully at the code again, I realized that I had forgotten about rounding: integers are divided with a drop of the remainder. If we want to get a rounded result, we need to add half of the divisor to the dividend:

 UINT16 blend = dst->a * (255 - src->a); UINT8 outa = src->a + (blend + 127) / 255; UINT8 coef1 = src->a * 255 / outa; UINT8 coef2 = blend / outa; out->r = (src->r * coef1 + dst->r * coef2 + 127) / 255; out->g = (src->g * coef1 + dst->g * coef2 + 127) / 255; out->b = (src->b * coef1 + dst->b * coef2 + 127) / 255; out->a = outa;

Work time increased to 94 ms. But the differences from the reference has become much smaller. However, they did not disappear at all.

In principle, one could dwell on this: the discrepancies that are present at this point are mainly concentrated in pixels with high transparency, which makes the value of the color channel not so significant. But having checked the original version of alpha_composite () just in case, I found out that it practically does not disagree with the standard. It was impossible to stop there.

Obviously, the resulting inaccuracies are the result of insufficient accuracy of the coefficients when multiplying. You can store additional significant bits for each variable, which improves division accuracy. The variable blend is calculated without division, nothing needs to be done with it. The variable outa is obtained by dividing the blend by 255. When calculating, we shift the values of the blend and src-> a by 4 bits up. As a result, there will be 12 significant bits in the outa. In both coefficients, the division by outa occurs, which is now more by 4 bits. Plus the coefficients themselves would like to get 4 bits more. As a result, the dividend is shifted by 8 bits.

 UINT16 blend = dst->a * (255 - src->a); // 16 bit max UINT16 outa = (src->a << 4) + ((blend << 4) + 127) / 255; // 12 UINT16 coef1 = ((src->a * 255) << 8) / outa; // 12 UINT16 coef2 = (blend << 8) / outa; // 12 out->r = ((src->r * coef1 + dst->r * coef2 + 0x7ff) / 255) >> 4; out->g = ((src->g * coef1 + dst->g * coef2 + 0x7ff) / 255) >> 4; out->b = ((src->b * coef1 + dst->b * coef2 + 0x7ff) / 255) >> 4; out->a = (outa + 0x7) >> 4;

Of course, the same 4 bits should be used to shift the result of all calculations before placing it in the image pixels. This option works even slightly slower: 108 ms, but it practically has no pixels that differ by more than one value from the standard. The quality result is achieved, you can again think about performance.

Obviously, such code is not the most optimal from the point of view of the processor. Division is one of the most difficult tasks, and here, as many as 6 divisions are performed for each pixel. It turned out that getting rid of most of them is quite simple. In the same PIL, in the Paste.c file, a method of quick division with rounding is described:
a + 127 / 255 ≈ ((a + 128) + ((a + 128) >> 8)) >> 8

As a result, one division is replaced by 2 shifts and one addition, which is faster on most processors. Having grouped one of the shifts with the already existing shift by 4, I received the following code:

 UINT16 blend = dst->a * (255 - src->a); UINT16 outa = (src->a << 4) + (((blend << 4) + (blend >> 4) + 0x80) >> 8); UINT16 coef1 = (((src->a << 8) - src->a) << 8) / outa; // 12 UINT16 coef2 = (blend << 8) / outa; // 12 UINT32 tmpr = src->r * coef1 + dst->r * coef2 + 0x800; out->r = ((tmpr >> 8) + tmpr) >> 12; UINT32 tmpg = src->g * coef1 + dst->g * coef2 + 0x800; out->g = ((tmpg >> 8) + tmpg) >> 12; UINT32 tmpb = src->b * coef1 + dst->b * coef2 + 0x800; out->b = ((tmpb >> 8) + tmpb) >> 12; out->a = (outa + 0x7) >> 4;

It is executed already in 65 ms and does not make any changes in the result compared with the previous version.

Overall, the speed of work was improved by 6 times, a pull request was sent to the repository. One could try to achieve greater similarity with the standard. For example, I managed to perfectly repeat the alpha channel generated by Photoshop, but at the same time more distortions were introduced to the pixels.

Big update

Once again, I thought about what was done and decided that it was foolish to be equal to the Photoshop algorithm, which is unknown for what tasks it was made and the implementation of which cannot even be viewed. It is necessary to take the work of the algorithm on floats for reference. I have slightly modified the script that finds the differences so that it displays some statistics.

diff.py

 #!/usr/bin/env python import sys from PIL import Image, ImageMath def chanel_diff(c1, c2): return ImageMath.eval('127 + c1 - c2', c1=c1, c2=c2).convert('L') def highlight(c): return c.point(lambda c: c if 126 <= c <= 128 else 127 + (c - 127) * 10) im1 = Image.open(sys.argv[1]) im2 = Image.open(sys.argv[2]) diff = map(chanel_diff, im1.split(), im2.split()) if len(sys.argv) >= 4: highlight(Image.merge('RGB', diff[:-1])).save('%s.png' % sys.argv[3]) highlight(diff[-1]).convert('RGB').save('%s.alpha.png' % sys.argv[3]) def stats(ch): return sorted((c, n) for n, c in ch.getcolors()) for ch, stat in zip(['red ', 'grn ', 'blu ', 'alp '], map(stats, diff)): print ch, ' '.join('{}: {:>5}'.format(c, n) for c, n in stat)

I also prepared a very serious test case of two 16 megapixel pictures, where each combination of pixels was unique.

make_testcase.py

 def prepare_test_images(dim): """Plese, be careful with dim > 32. Result image is have dim ** 4 pixels (ie 1Mpx for 32 dim or 4Gpx for 256 dim). """ i1 = bytearray(dim ** 4 * 2) i2 = bytearray(dim ** 4 * 2) res = 255.0 / (dim - 1) rangedim = range(dim) pos = 0 for l1 in rangedim: for l2 in rangedim: for a1 in rangedim: for a2 in rangedim: i1[pos] = int(res * l1) i1[pos + 1] = int(res * a1) i2[pos] = int(res * l2) i2[pos + 1] = int(res * a2) pos += 2 print '%s of %s' % (l1, dim) i1 = Image.frombytes('LA', (dim ** 2, dim ** 2), bytes(i1)) i2 = Image.frombytes('LA', (dim ** 2, dim ** 2), bytes(i2)) return i1.convert('RGBA'), i2.convert('RGBA') im1, im2 = prepare_test_images(63) im1.save('im1.png') im2.save('im2.png')

After that, I decided to start from the other end: do not try to increase the accuracy of the integer algorithm, but try to bring the floating point algorithm to work with integers. This is where I started:

 double dsta = dst->a / 255.0; double srca = src->a / 255.0; double blend = dsta * (1.0 - srca); double outa = srca + blend; double coef1 = srca / outa; double coef2 = 1 - coef1; double tmpr = src->r * coef1 + dst->r * coef2; out->r = (UINT8) (tmpr + 0.5); double tmpg = src->g * coef1 + dst->g * coef2; out->g = (UINT8) (tmpg + 0.5); double tmpb = src->b * coef1 + dst->b * coef2; out->b = (UINT8) (tmpb + 0.5); out->a = (UINT8) (outa * 255.0 + 0.5);

And that's what I came to:

 UINT16 blend = dst->a * (255 - src->a); UINT16 outa255 = src->a * 255 + blend; // There we use 7 bits for precision. // We could use more, but we go beyond 32 bits. UINT16 coef1 = src->a * 255 * 255 * 128 / outa255; UINT16 coef2 = 255 * 128 - coef1; #define SHIFTFORDIV255(a)\ ((a >> 8) + a >> 8) UINT32 tmpr = src->r * coef1 + dst->r * coef2 + (0x80 << 7); out->r = SHIFTFORDIV255(tmpr) >> 7; UINT32 tmpg = src->g * coef1 + dst->g * coef2 + (0x80 << 7); out->g = SHIFTFORDIV255(tmpg) >> 7; UINT32 tmpb = src->b * coef1 + dst->b * coef2 + (0x80 << 7); out->b = SHIFTFORDIV255(tmpb) >> 7; out->a = SHIFTFORDIV255(outa255 + 0x80);

Source: https://habr.com/ru/post/174005/

All Articles

Optimization history alpha_composite in Pillow 2.0

More articles: