Massively parallel image stabilization

Foreword

Good day! Today I decided to share with you the secret - one of my favorite bikes.

I will start from afar - for quite a long time I worked at one radio factory in Chelyabinsk, and we had (in general, and now have, I’m just not there already) one mega-project: an optical-electronic module for the protection of physical objects. This is such a healthy thing on a rotating installation, with three cameras for all occasions (color - day, B & W photosensitive - for twilight, and a thermal imager - for night observation). Such a module is taken, placed on a tower 50 meters in height - and it is possible, day and night, to keep under surveillance a territory within a radius of 4-5 kilometers. Details I will not write, not about that post. To whom it is interesting - they will find.

Of course, there were many interesting image processing tasks. On one of these, I want to tell. Namely, how to use massive parallel computing to compensate for camera shake in real time, or why SURF is not always suitable. Welcome under cat.

The essence of the problem

As I mentioned earlier - the modules are placed on rather high towers. Because of the strong wind at such a height, they are constantly shaking small, and the smaller the viewing angle of the camera is, the more noticeable it is. And the cameraman should observe the camera image, and for a rather long time (shift). It is very hard, it may be banal to vomit. That is why one of the main requirements for such systems is the presence of a mechanism to minimize the shaking of the picture. Somewhere gyro-stabilized platforms are used for this, but far from everywhere. The most commonly used software stabilization, and most often - it is SURF. But it is suitable for this bad.
')

Why SURF and similar algorithms are bad for this

In order:

1 - SURF is not intended for this. The only thing we need to solve this problem is to find the offset of the new frame relative to the previous one, in pixels. The camera does not rotate along the image plane, and does not move closer or further to the image. SURF is more suitable for tracking moving objects in a video sequence, but not for finding the displacement of the entire image.

2 - Speed. I will not go into details, I will only say that the time limits are serious - the algorithm must find the offset between the last two frames in less than 13 milliseconds. As less as possible. From SURF we didn’t even get that close.

3 - Interference. The displacement of control points, but not the entire frame, is analyzed. Due to mapping errors, there are problems with filtering out false matches, which are quite difficult to solve. On a lively picture, this caused rare “jerks” of the image in one direction or another.

My bike

By the time I started to solve this problem, I already had some experience with nVidia CUDA . Rewrote a couple of video processing filters - brightness, contrast. The increase in speed compared to processing on the CPU has made me incredibly pleased, it was from that time that I was carried away by massively parallel computing and GPGPU in particular.

I decided to use massive-parallel computations for solving this problem. At first I was looking for ready-made solutions, I tried. I was looking for a SURF implementation on CUDA, but, as far as I remember, I did not find anything suitable. And he decided to invent his own. By the way , I came across a great article by Comrade BigObfuscator , after which I realized which way I was going to dig.

I called the algorithm "Ratel Deshake", and it consists of 3 stages:
- preparation of the new frame
- transfer of a frame to a segment-integral representation
- search for the best match

All three stages are quite well suited for massively-parallel processing. I implemented it on CUDA, but, as far as I understand, all this can be implemented quite well on FPGA, for example.

Preparing a new frame

At this stage, you need to remove everything that does not interest us from the input data. More precisely - the color, and, if possible, interference. For this I used the transfer of the image in black and white and the selection of contours with a threshold. If in the form of code, then something like this:

struct pixel { unsigned char R, G, B; } pixel Image[width, height]; unsigned int GSImage[width, height]; unsigned int processedImage[width, height]; for (int x = 0; x < width; x++) { for (int y = 0; y < height; y++) { GSImage[x, y] = Image[x, y].R + Image[x, y].G + Image[x, y].B; } } for (int x = 1; x < width; x++) { for (int y = 1; y < height; y++) { preprocessedImage[x, y] = 0; if (GSImage[x, y] - GSImage[x - 1, y] > threshold) preprocessedImage[x, y]++; if (GSImage[x, y] - GSImage[x, y - 1] > threshold) preprocessedImage[x, y]++; if (GSImage[x, y] - GSImage[x + 1, y] > threshold) preprocessedImage[x, y]++; if (GSImage[x, y] - GSImage[x, y + 1] > threshold) preprocessedImage[x, y]++; } }

Frame transfer to the segment-integral representation

Perhaps the most interesting stage. First you need to clarify some points.

Firstly, what is the integral representation of the image. This can be understood by reading here this article , if you have not done this on the link above. In short, this is a two-dimensional (in our case) array, in which an element with X and Y coordinates is equal to the sum of all elements of the original array with coordinates [0..X, 0..Y].

Secondly, what I call the segment-integral representation. Because of the features of my algorithm, we need a two-dimensional array in which an element with X and Y coordinates is equal to the sum of all elements of the original array with coordinates [(XN) .. X, (YM) .. Y], where N and M are arbitrary numbers affecting the speed / accuracy ratio of the algorithm. For example, I used the values 15 and 15, i.e. the element in the resulting array was a sum of 256 elements (a 16x16 square) of the original array. Yes, in fact this is some kind of image blur.

But back to the integral representation of the image. Through some thought it can be understood that in one stream it can be optimally calculated as follows:

 unsigned int Image[width, height]; unsigned int processedImage[width, height]; processedImage[0, 0] = Image[0, 0]; for (int x = 1; x < width; x++) { processedImage[x, 0] = Image[x, 0] + processedImage[x-1, 0]; } for (int y = 1; y < height; y++) { processedImage[0, y] = Image[0, y] + processedImage[0, y-1]; } for (int x = 1; x < width; x++) { for (int y = 1; y < height; y++) { processedImage[x, y] = Image[x,y]+processedImage[x-1,y]+processedImage[x,y-1]-processedImage[x-1,y-1]; } }

But there is one problem. It is not possible to break this part of the algorithm into a large number of threads. Or will it work out? Of course. It is enough to divide it into two separate stages - horizontally and vertically. Then each of these stages is perfectly paralleled by the number of streams equal to the number of pixels horizontally and vertically, respectively. Like this:

 unsigned int Image[width, height]; unsigned int horizontalImage[width, height]; unsigned int processedImage[width, height]; //  for (int y = 0; y < height; y++) { horizontalImage[0, y] = Image[0, y]; for (int x = 1; x < width; x++) { horizontalImage[x, y] = Image[x, y] + horizontalImage[x-1, y]; } } //  for (int x = 0; x < width; x++) { processedImage[x, 0] = horizontalImage[x, 0]; for (int y = 1; y < height; y++) { processedImage[x, y] =horizontalImage[x, y] + processedImage[x, y-1]; } }

And now the same thing, but for the segment-integral representation with N = M = 15:

 unsigned int Image[width, height]; unsigned int horizontalImage[width, height]; unsigned int processedImage[width, height]; //  for (int y = 0; y < height; y++) { horizontalImage[0, y] = Image[0, y]; for (int x = 1; x < 16; x++) { horizontalImage[x, y] = Image[x, y] + horizontalImage[x-1, y]; } for (int x = 16; x < width; x++) { horizontalImage[x, y] = Image[x, y] + horizontalImage[x-1, y] - Image[x-16, y]; } } //  for (int x = 0; x < width; x++) { processedImage[x, 0] = horizontalImage[x, 0] for (int y = 1; y < 16; y++) { processedImage[x, y] = horizontalImage[x, y] + processedImage[x, y-1]; } for (int y = 16; y < height; y++) { processedImage[x, y] = horizontalImage[x, y] + processedImage[x, y-1] - horizontalImage[x, y-16]; } }

Such a small witch. In fact, you can parallelize even more if you divide the image into several separate parts, but these are already separate dances with a tambourine. In any case, after all this, we have a segment-integral representation of the image. At this stage, one more important thing needs to be done - besides the choice of N and M, it is necessary to choose how many “blocks” will be compared.

This point is also worth explaining in detail. The fact is that when searching for the optimal match, it is the match between the previous frame and the new one that is sought, and not vice versa. After the iteration of the algorithm has passed - for the next iteration, there is data on the previous frame, but not all, but only the values of a certain number of blocks (in my case, having a size of 16x16). This is another factor affecting the speed / accuracy ratio. Less blocks - more speed, more blocks - more accuracy.

Specifically, in my case, the image size was 704x576 pixels. From the segment-integral representation of the previous frame, I left the values of only 1,140 segments (38x30 blocks from the center of the image, not mutually overlapping, 4560 bytes). It will be clearer in the form of a picture:

Here we have a hypothetical image of 256x256 pixels. For convenience, it is divided into separate blocks of 16x16 pixels (although in the segment-integral representation of the blocks there are much more, and they overlap each other). In the center, an array of blocks is allocated, 10 to 10 pieces. It is the value of these blocks that must be remembered for use in the next iteration of the algorithm. In the code, it will look something like this:

 unsigned int IntegralImage[256, 256]; unsigned int NextIteration[10, 10]; for (int x = 0; x < 10; x++) { for (int y = 0; y < 10; y++) { NextIteration[x, y] = IntegralImage[(x+4)*16-1, (y+4)*16-1]; } }

Why we need this array will become clear in the next step.

Search for the best match

Almost everything, final straight. If only one frame was passed through the algorithm, this stage should be skipped (there is nothing to stabilize yet). If this is not the first frame - from the previous iteration you should have an array of NextIteration. Since it should remain from the previous iteration - at this stage it will be called PrevIteration, do not confuse.

So, we have a segment-integral representation of the new frame and an array of control blocks from the previous one. At this stage, all you need is to find a position in which the superposition of the blocks of the previous frame onto the segment-integral representation of the new frame will give the minimum difference.

An important note is that there is not much point in checking all possible positions, it is better to set the maximum expected deviation in pixels by a constant. For example - plus or minus 30 pixels in X and Y.

First of all, we need an array that displays the difference at a particular offset. An example in the code for a picture with squares:

 unsigned int IntegralImage[256, 256]; unsigned int PrevIteration[10, 10]; unsigned int Differences[61, 61]; for (int X = 0; X < 61; X++) { for (int Y = 0; Y < 61; Y++) { Differences[X, Y] = 0; } } //     +-30      3721  for (int X = 0; X < 61; X++) { for (int Y = 0; Y < 61; Y++) { for (int x = 0; x < 10; x++) { for (int y = 0; y < 10; y++) { Differences[X, Y] += abs(PrevIteration[x,y]-IntegralImage[(x+4)*16-31+X,(y+4)*16-31+Y]; } } } }

That's all. It remains only to find the minimum element in the Differences array. If it is [0, 30], then the new frame is shifted 30 pixels to the right. If [60, 30] - then 30 pixels to the left.

Total

In this article, I did not say anything innovative or very obvious. But this is a significant piece of my experience on this topic, and I sincerely hope that it will come in handy to anyone. The article came out long and confusing, so surely there are obvious mistakes somewhere, or I haven’t clearly stated anything enough - I will be glad of reviews, constructive criticism and any other feedback. Many thanks to those who mastered to the end.

As for my implementation, I got a more or less working test software. On the video card of the GeForce GTX560 level, it was possible to perform the assigned task - to process 75 frames per second in the size of 704x576. And there was still a decent stock on time. Of course, in the article I set forth only general principles - much can be optimized here. Unfortunately, it was decided not to use CUDA in the project, so I have not touched him for more than a year.

Once again I express my gratitude to comrade BigObfuscator for the excellent article, without which I would hardly have thought up anything more confusing.

Raw

As I wrote above - for more than a year I have not dealt with the implementation of this algorithm. The source is, but it is so unsightly that I am ashamed to post it in its current form. If enough people are interested in it, I can shake the dust off the files, put everything in order and put it out.

All good coding and good ideas!

Source: https://habr.com/ru/post/214027/

All Articles