Is it worth optimizing image processing in C ++ using SIMD?

SIMD and image processing

Image processing (here we consciously limit ourselves to only raster images and omit a wide class of vector images), as a rule, is a set of simple operations that are applied to each point of the image. If we take into account that the color channels that make up an image dot (pixel) are usually represented as integers of small dimension, then image processing is reduced to a huge number of one-type operations on 1-2 byte integers.

Performing such operations using general-purpose processor instructions is not entirely effective, as they are optimized for processing 4-8 byte numbers. In fact, when working with 1 byte numbers, their efficiency will drop 4-8 times. Processor manufacturers have long ago proposed a fairly effective solution to increase performance when working with such data - SIMD (Single instruction, multiple data) - sets of vector instructions that allow you to process a large amount of data in one operation. If we consider the x86 platform, then there are a large number of SIMD extensions on it: MMX, 3DNow !, SSE, SSE2, SSE3, SSEE3, SSE4.1, SSE4.2, AVX and AVX2. Most of these extensions are intended for vector processing of real numbers. If we touch on vector integer instructions, we can select the MMX, SSE2 and AVX2 command sets, which respectively contain operations for working with 8, 16, and 32-byte vectors. The MMX expansion first appeared in the Pentium MMX processor in 1997 and is now of historical interest, as all modern x86 processors support the more advanced SSE2 instruction set, which was first introduced in the Intel Pentium 4 processor, released in 2000. However, life does not stand still, and in 2013 the 4th generation iCore processor family came out with the support of AVX2 instructions, which will be widely distributed in the near future.

And so, we have some algorithm written in C ++ that performs image processing, the speed of which is critical for this application (such algorithms are most often found in video processing). Given the presence of SIMD processor extensions that allow you to potentially speed up the execution speed of the algorithm 10 or more times, it would be logical to try to use them in any way. And so the question is how easy it is to do? Below I will try to answer it by listing in detail all the pros and cons that the developer will have to face in the process of using SIMD processor extensions in his project.

pros

The use of SIMD instructions allows simultaneous processing of several initial data at once, thereby significantly reducing the total running time of the algorithm.
Extended instruction sets contain specialized commands (for example, finding the average or maximum number for a pair of numbers, adding or subtracting with saturation) that replace several general-purpose instructions. The use of these commands can give an additional very significant gain in speed of execution.
In order to enable SIMD processor expansion, contrary to common misconception, it is not necessary to use the assembler. All modern C ++ compilers support the so-called intrinsic functions that the compiler translates directly into commands for the CPU. Therefore, when using SIMD, you can completely remain within the framework of the C / C ++ syntax in the usual IDE.

Minuses

Loss of universality of the code that uses processor extensions. Indeed, there are a large number of processor architectures, each of which may contain (or maybe not) its own set of vector instructions. Therefore, either we have to limit the applicability of our program to any particular architecture, or write several implementations of the algorithm for each of the architectures we require.
Input and output data should be stored in memory in an orderly manner (preferably sequentially, although it is possible to extract every second or every fourth element quite efficiently). The tighter the input and output data lie, the fewer operations for reordering we will have to perform and the more efficient our algorithm will be. This requirement makes it necessary to revise data storage methods, for example, to store a color image as a set of separate color planes, rather than in a single plane with alternating color channels for each point (as in BGR-24).
It is desirable that the input and output data are aligned in memory (on the 16-byte boundary for SSE2, on the 32-byte boundary for AVX2), since reading and storing of the aligned data is much faster. If we control the creation of input and output data, then their alignment is achieved by using special memory allocators. Otherwise, we must either accept the loss of speed of the algorithms, or write two versions of the algorithm for the case of aligned or non-aligned data.
If during processing of images in general-purpose registers, the problem of overflowing numbers in the computation process is almost not encountered (we just use the standard type int, which is unlikely to overflow when working with 8-bit data), then when processing using vector instructions this problem needs to be constantly kept under control. If, for example, the algorithm may overflow 8-bit data in the calculation process, then we have to convert them to 16-bit, perform the calculation and do the inverse transformation. Of course, for this there are special vector instructions for packing and unpacking vectors, but all this slows down the algorithm, as well as the fact that vector instructions on 16-bit data process two times less data per 1 time, compared to 8-bit ones.
When processing vector instructions, there is no possibility of conditional execution of commands for each element. In some cases, conditional transitions can be emulated, for which you have to execute both branches of the code, and then combine the results of their work. This naturally affects the performance of the algorithm, and if the code is saturated with conditional transitions, then the meaning in SIMD optimization is lost.
When processing the image, points that are located on the borders are usually processed in a special way. For vector instructions, the case is complicated by the fact that in the boundary block of points a part of points is processed according to a general algorithm, and a part according to an algorithm for boundary points, which greatly complicates the development.
The image width is not always a multiple of the size of the vector. Therefore, for processing images with an arbitrary width, it is necessary to specially process the last block in a line, which can be only partially filled.
Not all mathematical operations that are available for scalar registers have their analogs for vector instructions. For example, there is no integer division. In these cases, you have to either refuse to vectorize, or try to emulate them using the available instructions, which leads to a decrease in efficiency and complexity of the code.
The readability of a code containing vector instructions is usually much inferior to the readability of a scalar code.
Debugging code containing vector instructions is much more complicated, since the programmer has to track the correctness of not separate scalar values, but whole vectors. And modern IDE, in spite of significant progress in this area, display vector data much worse than scalar ones.
Experience shows that writing a properly working algorithm using vector instructions is practically impossible without the presence of unit tests that cover all the characteristic cases in which it will be used.

As can be seen from the above, the use of SIMD processor extensions can give a big performance boost, but at the same time it is accompanied by a significant complication of the development process. Whether one is worth the other depends on the conditions of a particular project and is left to the discretion of the developer.

Afterword

The author of this article, by the nature of his work (developing algorithms for video analytics), has devoted quite a lot of time to the process of optimizing image processing algorithms in C ++ using various vector instructions (mainly SSE2 and AVX2). In particular, the result of this work was a library of image processing algorithms in C ++ with open source, which is available in the public domain . If readers find this topic interesting, then I am ready to write a few articles that will describe the optimization features with specific examples.

Source: https://habr.com/ru/post/204234/

All Articles

Is it worth optimizing image processing in C ++ using SIMD?

SIMD and image processing

pros

Minuses

Afterword

More articles: