One of the important subtasks of video analytics is tracking objects on video. It is not so primitive that it was necessary to descend to the pixel-by-pixel level, but it is not so complicated as to unequivocally require a multilayered neural network for the solution. Tracking can be used as an end in itself, and as part of other algorithms:
Counting unique people who entered a certain zone or crossed the border in the frame
Identify typical parking routes and people in the store
Automatic rotation of the surveillance camera when the object is shifted
Even without looking at the literature, I can say with confidence that the best way to solve the problem is to use neural networks. In general, one could not write anything further, but it is not always possible to throw a pair of GTX 1080Ti into the task. Who cares how to track objects in the video in such cases, please under the cat. I will try not just to explain how the ASEF and MOSSE trackers work, but to take you to the decision so that the formulas seem obvious. To begin, let's answer a preliminary question: why invent something, when can we feed a tensorflow a pack of video and leave the computer for a couple of weeks? The neural network approaches have one serious drawback: even on modern video cards, it is difficult to get from networks of good rate of fire. This is not a problem if we analyze the recorded video, but it puts a stick in the wheels if you want to work in real time. Suppose we want to process video from five cameras at 10 FPS. Even with such relatively mild conditions, the neural network should have inference time less than milliseconds (in conditions of complete non-parallelism). For comparison, YoloV3, a classifier network with a relatively simple architecture, can spit out the image in milliseconds In addition, solutions based on powerful graphics cards can be trite roads.
This article assumes that you have already dealt with computer image processing, are familiar with the basic operations of linear algebra (convolution, norm, matrix inversion), and understand in general terms the Fourier transform. ')
Hereinafter:
denotes the elementwise multiplication of matrices and
denotes matrix convolution and
denotes what - frequency matrix of the fast Fourier transform applied to the image .
- denotes the sum of the squares of the elements of the matrix
Trivial solution
At first glance, the task of tracking a specific item does not seem so difficult.
Let us have consecutive video frames size on pixels At the opening frame of the video a rectangle is circled around some object on . It is required to find the location of this body in all other frames. .
Remove the noise in the images, then normalize each of them in the range from -1 to 1, so that the total brightness changes do not affect the detection. Take the first frame of the video without markup . If a and - adjacent video frames with good FPS, then the desired object is unlikely far removed from its original position. Take advantage of this. Cut out rectangle from the place where the body was previously located. "Dragging" through and at each point we calculate the square of the sum of differences
where to calculate the difference in need to combine center with element at , and the missing values are zero. After that in the matrix looking for a minimum; its location minus the coordinates of the center and will be the displacement of the desired object on .
In order for the abrupt transition to zero not to “ring” during detection, it is better to initially take a rectangle a little more than necessary and smoothly reduce to zero values closer to the boundaries and . To do this, each of need to multiply by mask. In the case of square objects, the good old exhibitor will do. (Where - the center of the window), but in general it is better to take the two-dimensional window of Henning.
For window taken from the position predicted on the frame , and so on.
Example
Instead norms can be used ( ) and minus cross-correlation of matrices ( ). Both are considered slightly faster than , but they have their own characteristics. non-differentiable and less sensitive to large differences in pixel values. Cross-correlation can give false positives if the sample is low contrast and there are very light or very dark areas in the image.
-version of metrics does not have such a flaw:
, The "energy" of the selected site acts as a balancing factor ( , the sum of the squares of the pixel values of the sample is the same for all window positions and has no practical significance here).
Even such a primitive algorithm does a pretty good job in the case of linear movement of objects (for example, a camera looking down on the conveyor). However, due to the simplicity of the model and performance, this tracking method has several drawbacks:
A simple linear motion of an object without changes in nature occurs infrequently. As a rule, bodies in the field of view of the camera may undergo some classes of changes. In order of increasing complexity: size increase / decrease, rotations, affine transformations, projective transformations, nonlinear transformations, changes of the object. Even if you omit object changes and nonlinear transformations, I would like the algorithm to be restored after relatively simple turns and changes in size. Obviously, the above procedure does not have this property. Probably, it will still give a noticeable response on the object, but it will be difficult to determine the exact location of the sample, and the track will be intermittent.
Example
We showed the algorithm only one positive sample, it is difficult to say which response will give if another similar object appears in the window. Well, if the desired object is contrasting and has a rare structure, but what if we want to follow the car in the stream of other machines? The tracker may unpredictably jump from one car to another.
At each frame, we discard all the prehistory. Probably, it should also be used somehow.
Moreover, we learn only at one point of the image. It would be better if near the correct location of the object the tracker will also give a good response. A bit counterintuitive, but think: if the filter is in the exact location of the object in the picture gives the best value and in - some random, then it is too sensitive to small details that can easily change. Conversely, if and in approximately the same good values, the filter is “hooked” on larger and, we hope, more permanent features.
With a naive implementation of the tracking procedure, for each pixel of the frame we multiply the entire window with the selected object by the corresponding part of this frame. The complexity of this operation - . In such cases, it is not very pleasant to track even objects 50 by 50 pixels. This problem is partially solved by reducing the size of the video, but when you reduce the image to less than 240 pixels in width, even large significant details start to get lost, which makes the algorithm meaningless.
ASEF, MOSSE
Trivial approach ++?
Roll up the sleeves and try to solve the above problems.
Augment the original image. Let's apply to it some slack affine transformations. You can also add noise or change gamma. Thus, instead of a single image, a microdatasset is obtained from pictures. There are a lot of images, but one window. So now we will not just cut a rectangle from the image, but look for some filter which will give a good response for everyone . Transform the problem into the minimization problem:
Where - the sum of squares of pixel differences between and the corresponding site of the exact location of the object on -this synthetic image created from the frame, which has a true markup.
In addition, you can sample the rectangles far from the location of the monitored object and maximize the difference shown above.
With the suggestion that the filter give a good response at points near the exact location of the object is more difficult. We know that in applying filter with with metrics should give 0, next - more, far - even more. Moreover, we do not have any preferred direction, the response should be centrally symmetric with respect to . It looks like we can mathematically express what the response of the filter applied to the reference images should look like! The exact view may differ depending on the specific attenuation function of the response, but everyone loves Gaussians? So suppose that applied to ideally should give a result . Therefore, the minimization problem turns into:
Now we minimize not the response at one point, but minimize the deviation of the response from the desired one.
Wait a second ... We did it. equations with variables to minimize. It seems we overdid it. Let's go back a bit.
Main reception
Of all the problems, the greatest difficulty is the difficulty . Is it possible to come up with something here besides tricky splitting of one search window into several small ones or search in the picture in small resolution and fine-tuning at high precision?
It turns out you can! Mathematical analysis tells us that the convolution of functions in ordinary space is the multiplication of their Fourier transforms. Apply the fast Fourier transform to images, multiply their frequencies element by element, and then we can turn the result back into a matrix for , which is much faster than folding matrices in an honest way. Fourier! Who would have thought! In the century of tensorflow, it can still help us in computer vision.
(This, by the way, demonstrates a general mathematical principle: if you don’t want to solve a problem in space transfer it to space decide there and bring the decision back. The workaround is often shorter than the straight.)
As shown above, we can use cross-correlation to localize the sample in the image. But cross-correlation is a convolution with horizontally and vertically reflected . Mathematical analysis suggests that in this case it will be necessary to multiply the frequencies on complex conjugate matrix to frequency matrix :
Where - function of the ideal response on the reference image. note that we minimize the metric, and maximize the convolutional one, so now, the more response, the better.
If we had one image, then we would find the exact filter matrix of frequencies:
where in the right part we mean element by element division. But a little earlier we generated images from the source. We can apply them with the Fourier approach. There is no filter with frequencies that would perfectly suit all images, but you can get something pretty good. There are two ways to solve a problem:
You can find a set of ideal filters, and then average them into one. This is the way the authors of Average of Synthetic Exact Filters (ASEF) work:
Here we use the Fourier transform linearity property. By adding the frequencies, as shown above, we seem to be averaging several filter weights.
You can find filter frequencies that satisfy all the pictures on average, just like for :
To find the minimum, you need to take the derivative of the filter elements:
Hmm, as if similar elements are involved in formulas. With formulas for ASEF and MOSSE completely coincide. for one image can be represented as
Substitute in the formula for ASEF and get
Aha Now it’s much better to see that ASEF and MOSSE differ only in the method of averaging the filters! It is argued that MOSSE generates better filters than ASEF. It sounds logical: adjusting for the entire pack of images in general is better than filter averaging.
After we received , we calculate the frequency domain response then translate it into a spatial domain and look for the maximum in the resulting matrix . Where the maximum - there is a new position of the object.
Additional points
The sum terms in the denominators of formulas have an interesting physical meaning. Is the energy spectrum of a rectangle with of that image.
Pay attention to the interesting symmetry. It was necessary to multiply the filter frequencies by the image frequencies in order to get a response. Now you need to multiply the frequency response to the image frequency (and normalize) to get the filter frequency.
In real life, division by division may result in division by zero, so a regularizing constant is usually added to the denominator. . It is argued that such a regularization causes the filter to pay more attention to the low frequencies, which improves the generalizing ability.
When processing a real video, you usually want to save information about the monitored object obtained from previous frames. When moving to the next frame, you can not calculate from scratch, and update the previous one. Formula update for ASEF:
For MOSSE, you need to accumulate the numerator and denominator separately:
Where - learning speed.
It is important to remember that the Fourier transform is not exactly the same as the calculation honestly, as described at the beginning of the article. When calculating the FFT, the missing elements are not zeroed out, but are substituted from the reverse side, as if the image were looping from right to left and from bottom to top. But at the very beginning of the article we have already decided to darken the edges so this problem will not have a noticeable effect.
As mentioned above, cross-correlation has an unpleasant feature: in general, a light filter can give a strong response in the white areas of the image, even if they do not match in contrast areas. These problems are not limited. Even a single matching pixel with a strongly positive or highly negative value can interfere with the filter if the whole sample is low contrast. To smooth out this effect, in the preprocessing you need to include a non-linear transformation of image pixels, which will “press” too light and too dark areas to the middle. Because of this, the coincidence of these contrasting sections makes a stronger contribution to the metric. The ASEF and MOSSE articles use the logarithm:
where are the pixels from 0 to 255. In my opinion, this is too hard, and ignores the problem of the strong response of the dark filter on the black areas. The following scheme works better:
Then comes the normalization, and it turns out that most of the elements are concentrated around zero.
How with this algorithm to determine that the tracked object disappeared from the frame? Here will help a more detailed analysis of the response received from the next frame. The creators of MOSSE offer PSR - Peak to Sidelobe Ratio. Let be - maximum element corresponding to the new position of the object . Exclude from consideration the square around this high. We calculate the mean and standard deviation of the remaining pixels ( ). Then
If this value is above a certain threshold, then the detection is considered successful. The threshold is usually taken in the region between 3 and 10. For confident detections, the PSR is usually kept above 20.
(note that here PSR means not at all what it usually means in signal processing theory; so don't google it, nothing good will come of it)
The algorithm is extremely simple. Performing the tracking procedure on Core-i7 in pictures with 320x400 using OpenCV implementation takes from 0.5 to 2 milliseconds depending on the size of the object being tracked.
MOSSE algorithm
Let's put it all together.
General state:
Filter frequency matrix: Auxiliary matrix for calculating the frequency of the filter: Frequency matrix of the desired ideal response: Learning speed during tracking: The rectangle of the current position of the object: Number of transformations: Response threshold:
Auxiliary function: Training . Login: image current learning rate
Not yet typed transformations:
Apply a small random affine transformation to the image with the center in the center.
Cut a rectangle with an object from an image
Apply a mask to it to smooth the edges.
Translate to the frequency domain:
If a then replace and on and . Otherwise:
Calculate filter frequencies:
Initialization . Login: image , a rectangle around the position of the object
Prepare the desired response . This is usually a fully zanulirovannaya matrix with a small Gaussian in the center.
Training : , 1.0
Translate to the frequency domain:
Tracking : Login: image
Cut rectangle of for the existing previous position of the object
Apply a mask to it to smooth the edges.
Translate to the frequency domain:
Translate in the spatial domain:
Find the maximum in :
Calculate the power of response
If a out with failure
Update position . Adjust R, if it went beyond the image, or it was decided that the object has increased / decreased.
Training : ,
Return
Implementation details may vary. For example,
Can only be preprocessing and not the whole image.
can be re-created for each image transformation with varying function and width of response.
It is possible to simultaneously train several different filters on several scales of an object, in order to detect movements far and near.
What it looks like
For starters, a few examples of two-dimensional Fourier transform.
Some simple examples
Let me remind you that the result of the transformation is complex-valued. The pictures below show the groups “image - absolute values of the frequency domain in the usual scale - absolute values of the frequency domain in the logarithmic scale”.
Vertical lines:
The picture changes from left to right the same for any vertical position. Moreover, the change is periodic, with a clear period and a clear pattern. Therefore, in the pictures of the frequencies you see only the frequencies along the axis .
Cell:
Note that there are expected frequency series along the axes. and and strange spurious frequencies. They appeared due to the fact that, firstly, the picture is finite, whereas the Fourier transform decomposes into a beautiful amount only for an infinite periodic signal. Secondly, it can be seen that the image does not form an exact period at the edges.
Slanted lines:
Both the frequencies corresponding to the main direction and the parasitic frequencies are again visible.
Inclined lines plus distortion:
On the image of frequencies, several characteristic directions are visible, but intuitively it becomes difficult to imagine a picture of them.
For images of the present world, it is even more difficult to imagine a picture in the head according to its frequencies:
(frequencies in the center are closed so that they do not “light up” the rest of the spectrum)
We now turn to the real example of work:
Pack of pictures
Image with marked object:
Cut and pre-processed object, its spectrum in the usual and logarithmic scale ( ):
Desired Response ( ):
Filter frequencies in the normal and logarithmic scale ( ):
Please note that they are not involved in the algorithm anywhere - they can be counted only for the sake of interest. Also note that the filter looks like what the hell . One would expect that the filter would be something similar to the original image, but this is not always true. A filter similar to the image itself would hardly give the desired gaussian response.
The response from the next frame:
Although not as clean as the desired response, it is easy to determine the maximum.
The same example with a narrower desired response:
Pack of pictures
Already:
:
More already:
:
With a very narrow maximum on the filter, an eye is clearly visible instead of a black spot.
for the previous three cases when used for learning 16 transformations of the input image:
More pack of pictures
Wide maximum:
Average maximum:
Narrow maximum:
The more transformations, the less the filter clings to random elements. It is especially clearly seen that random black and white spots disappeared from the middle. . On the other hand, for narrow gaussians, training in several pictures can play a minus: look at the “jingle” formed in the filter around the eye.
If you want to see how it looks live, download from here my test repository with the implementation of MOSSE with the output of debug images. You can find more MOSSE options on the githaba. In addition, it is in OpenCV .
Conclusion
Thank you for your attention, habrovchane. MOSSE and ASEF tracking are not the most complex algorithms in the world. The easier it is not only to apply them effectively, but also to understand what their creators were guided by. I hope that my explanation has helped you to penetrate into the mind of researchers, to follow the course of their thoughts. This can be useful: machine learning is not a static area of knowledge, it has a place for creativity and its own research. Try to dig into some well-established algorithm: sprinkle unnecessary limbs to accelerate or add a couple so that it will work better in your particular case. You'll like it!
This article was written with the support of the company DSSL.