📜 ⬆️ ⬇️

Optical trackers: ASEF and MOSSE

One of the important subtasks of video analytics is tracking objects on video. It is not so primitive that it was necessary to descend to the pixel-by-pixel level, but it is not so complicated as to unequivocally require a multilayered neural network for the solution. Tracking can be used as an end in itself, and as part of other algorithms:


Even without looking at the literature, I can say with confidence that the best way to solve the problem is to use neural networks. In general, one could not write anything further, but it is not always possible to throw a pair of GTX 1080Ti into the task. Who cares how to track objects in the video in such cases, please under the cat. I will try not just to explain how the ASEF and MOSSE trackers work, but to take you to the decision so that the formulas seem obvious.

To begin, let's answer a preliminary question: why invent something, when can we feed a tensorflow a pack of video and leave the computer for a couple of weeks? The neural network approaches have one serious drawback: even on modern video cards, it is difficult to get from networks of good rate of fire. This is not a problem if we analyze the recorded video, but it puts a stick in the wheels if you want to work in real time. Suppose we want to process video from five cameras at 10 FPS. Even with such relatively mild conditions, the neural network should have inference time less than  frac10005 times10=$2milliseconds (in conditions of complete non-parallelism). For comparison, YoloV3, a classifier network with a relatively simple architecture, can spit out the image in 50milliseconds In addition, solutions based on powerful graphics cards can be trite roads.

This article assumes that you have already dealt with computer image processing, are familiar with the basic operations of linear algebra (convolution, norm, matrix inversion), and understand in general terms the Fourier transform.
')
Hereinafter:


Trivial solution


At first glance, the task of tracking a specific item does not seem so difficult.

Let us have Tconsecutive video frames Itsize won hpixels At the opening frame of the video I0a rectangle is circled around some object F0mon n. It is required to find the location of this body in all other frames. It.

Remove the noise in the images, then normalize each of them in the range from -1 to 1, so that the total brightness changes do not affect the detection. Take the first frame of the video without markup I1. If a I0and I1- adjacent video frames with good FPS, then the desired object is unlikely far removed from its original position. Take advantage of this. Cut out I1rectangle F1from the place where the body was previously located. "Dragging" F0through F1and at each point we calculate the square of the sum of differences

GL2(i,j)= parallelF1(i,j)F0 parallel2,i in[0,m],j in[0,n]

where to calculate the difference in GL2(i,j)need to combine center F0with element (i,j)at F1, and the missing values ​​are zero. After that in the matrix GL2looking for a minimum; its location minus the coordinates of the center F1and will be the displacement of the desired object on I1.

In order for the abrupt transition to zero not to “ring” during detection, it is better to initially take a rectangle a little more than necessary and smoothly reduce to zero values ​​closer to the boundaries F0and F1. To do this, each of Fneed to multiply by mask. In the case of square objects, the good old exhibitor will do. A= exp frac(xi)2+(yj)2 sigma2(Where (x,y)- the center of the window), but in general it is better to take the two-dimensional window of Henning.

For I2window F2taken from the position predicted on the frame I1, and so on.

Example


Instead L2norms can be used L1( GL1(i,j)=|F1(i,j)F0|) and minus cross-correlation of matrices ( GnCC(i,j)= sumklF1,kl(i,j)F0,kl,k in[0,m],l in[0,n]). Both are considered slightly faster than L2, but they have their own characteristics. L1non-differentiable and less sensitive to large differences in pixel values. Cross-correlation can give false positives if the sample is low contrast and there are very light or very dark areas in the image.

L2-version of metrics does not have such a flaw:

GL2= sum(F1,kl(i,j)F0,kl)2

= sum(F1,kl(i,j))22F1,kl(i,j)F0,kl+(F0,kl)2

= sum(F1,kl(i,j))2 sum2F1,kl(i,j)F0,kl+ sum(F0,kl)2

=EF1(i,j)+2GnCC(i,j)+EF0

EF1(i,j), The "energy" of the selected site Itacts as a balancing factor ( EF0, the sum of the squares of the pixel values ​​of the sample is the same for all window positions and has no practical significance here).

Even such a primitive algorithm does a pretty good job in the case of linear movement of objects (for example, a camera looking down on the conveyor). However, due to the simplicity of the model and performance, this tracking method has several drawbacks:

  1. A simple linear motion of an object without changes in nature occurs infrequently. As a rule, bodies in the field of view of the camera may undergo some classes of changes. In order of increasing complexity: size increase / decrease, rotations, affine transformations, projective transformations, nonlinear transformations, changes of the object. Even if you omit object changes and nonlinear transformations, I would like the algorithm to be restored after relatively simple turns and changes in size. Obviously, the above procedure does not have this property. Probably, F0it will still give a noticeable response on the object, but it will be difficult to determine the exact location of the sample, and the track will be intermittent.

    Example

  2. We showed the algorithm only one positive sample, it is difficult to say which response will give F0if another similar object appears in the window. Well, if the desired object is contrasting and has a rare structure, but what if we want to follow the car in the stream of other machines? The tracker may unpredictably jump from one car to another.
  3. At each frame, we discard all the prehistory. Probably, it should also be used somehow.
  4. Moreover, we learn only at one point of the image. It would be better if near the correct location of the object the tracker will also give a good response. A bit counterintuitive, but think: if the filter is in the exact location of the object in the picture (x,y)gives the best value and in (x+1,y+1)- some random, then it is too sensitive to small details that can easily change. Conversely, if (x,y)and in (x+1,y+1)approximately the same good values, the filter is “hooked” on larger and, we hope, more permanent features.
  5. With a naive implementation of the tracking procedure, for each pixel of the frame we multiply the entire window with the selected object by the corresponding part of this frame. The complexity of this operation - O(m2n2). In such cases, it is not very pleasant to track even objects 50 by 50 pixels. This problem is partially solved by reducing the size of the video, but when you reduce the image to less than 240 pixels in width, even large significant details start to get lost, which makes the algorithm meaningless.

ASEF, MOSSE


Trivial approach ++?


Roll up the sleeves and try to solve the above problems.

Augment the original image. Let's apply to it some slack affine transformations. You can also add noise or change gamma. Thus, instead of a single image, a microdatasset is obtained from Ppictures. There are a lot of images, but one window. So now we will not just cut a rectangle from the image, but look for some filter Wwhich will give a good response for everyone Ip. Transform the problem into the minimization problem:

W: minW parallelFpW parallel2,p in[1,P]

Where  parallelFpW parallel2- the sum of squares of pixel differences between Wand the corresponding site of the exact location of the object on p-this synthetic image created from the frame, which has a true markup.

In addition, you can sample the rectangles far from the location of the monitored object and maximize the difference shown above.

With the suggestion that the filter give a good response at points near the exact location of the object is more difficult. We know that in (x,y)applying filter with L2with metrics should give 0, next - more, far - even more. Moreover, we do not have any preferred direction, the response should be centrally symmetric with respect to (x,y). It looks like we can mathematically express what the response of the filter applied to the reference images should look like! The exact view may differ depending on the specific attenuation function of the response, but everyone loves Gaussians? So suppose that Wapplied to Fpideally should give a result Gp=1 exp frac(xi)2+(yj)2 sigma2. Therefore, the minimization problem turns into:

Dp(i,j)= parallelFp(i,j)W parallel2

W: minW parallelDp(i,j)Gp(i,j) parallel2,p in[1,P]

Now we minimize not the response at one point, but minimize the deviation of the response from the desired one.

Wait a second ... We did it. P timesm timesnequations with m timesnvariables to minimize. It seems we overdid it. Let's go back a bit.

Main reception


Of all the problems, the greatest difficulty is the difficulty O(m2n2). Is it possible to come up with something here besides tricky splitting of one search window into several small ones or search in the picture in small resolution and fine-tuning at high precision?

It turns out you can! Mathematical analysis tells us that the convolution of functions in ordinary space is the multiplication of their Fourier transforms. Apply the fast Fourier transform to images, multiply their frequencies element by element, and then we can turn the result back into a matrix for O(mn logmn), which is much faster than folding matrices in an honest way. Fourier! Who would have thought! In the century of tensorflow, it can still help us in computer vision.

(This, by the way, demonstrates a general mathematical principle: if you don’t want to solve a problem in space Xtransfer it to space Ydecide there and bring the decision back. The workaround is often shorter than the straight.)

As shown above, we can use cross-correlation to localize the sample in the image. But cross-correlation is a convolution with horizontally and vertically reflected W. Mathematical analysis suggests that in this case it will be necessary to multiply the frequencies Fon complex conjugate matrix to frequency matrix W:

 hatW( omega, nu)= mathcalF(W(x,y))

 hatF( omega, nu)= mathcalF(F(x,y))

 hatGconv( omega, nu)= mathcalF(Gconv(x,y))

Gconv=F otimesW rightarrow hatGconv= hatF odot hatW

Where Gconv= exp frac(xi)2+(yj)2 sigma2- function of the ideal response on the reference image. note that L2we minimize the metric, and maximize the convolutional one, so now, the more response, the better.

If we had one image, then we would find the exact filter matrix of frequencies:

 hatW= frac hatGconv hatF

where in the right part we mean element by element division. But a little earlier we generated Pimages from the source. We can apply them with the Fourier approach. There is no filter with frequencies that would perfectly suit all images, but you can get something pretty good. There are two ways to solve a problem:

  1. You can find a set of ideal filters, and then average them into one. This is the way the authors of Average of Synthetic Exact Filters (ASEF) work:

     hatW= frac1P sump=1P hatWp= frac1P sump=1P frac hatGp hatFp

    Here we use the Fourier transform linearity property. By adding the frequencies, as shown above, we seem to be averaging several filter weights.
  2. You can find filter frequencies that satisfy all the pictures on average, just like for L2:

     hatW: min hatW sump=1P parallel hatFp odot hatW hatGp parallel2

    To find the minimum, you need to take the derivative of the filter elements:

     frac delta delta hatW sump=1P parallel hatFp odot hatW hatGp parallel2=0

    The honest take of this derivative can be found in Visual Object Tracking using Adaptive Correlation Filters , which offers Minimum Output Sum of Squared Error (MOSSE) filters. The result is:

     hatW= frac sump=1P hatGp odot hatFp sump=1P hatFp odot hatFp

Hmm, as if similar elements are involved in formulas. With P=1formulas for ASEF and MOSSE completely coincide.  hatWfor one image can be represented as

 hatW= frac hatGp hatFp= frac hatGp odot hatFp hatFp odot hatFp

Substitute in the formula for ASEF and get

 hatW= sump=1P frac hatGp odot hatFp hatFp odot hatFp

Aha Now it’s much better to see that ASEF and MOSSE differ only in the method of averaging the filters! It is argued that MOSSE generates better filters than ASEF. It sounds logical: adjusting for the entire pack of images in general is better than filter averaging.

After we received  hatW, we calculate the frequency domain response  hatGconv= hatF odot hatWthen translate it into a spatial domain and look for the maximum in the resulting matrix G. Where the maximum - there is a new position of the object.

Additional points



MOSSE algorithm


Let's put it all together.

General state:

Filter frequency matrix:  hatW
Auxiliary matrix for calculating the frequency of the filter: A,B
Frequency matrix of the desired ideal response:  hatG
Learning speed during tracking:  eta
The rectangle of the current position of the object: R
Number of transformations: P
Response threshold: Psrthr

Auxiliary function: Training . Login: image Icurrent learning rate  etacurrent

  1. Anew:=0,Bnew:=0
  2. Not yet typed Ptransformations:
    1. Apply a small random affine transformation to the image with the center in the center. R
    2. Cut a rectangle with an object from an image F
    3. Apply a mask to it to smooth the edges.
    4. Translate Fto the frequency domain:  hatF
    5. Anew=Anew+ hatG odot hatF
    6. Bnew=Bnew+ hatF odot hatF

  3. If a  etacurrent geq1.0then replace Aand Bon Anewand Bnew. Otherwise:

    B:= etaBnew+(1 eta)B

    A:= etaAnew+(1 eta)A
  4. Calculate filter frequencies:

     hatW= fracAB

Initialization . Login: image I, a rectangle around the position of the object Rinit

  1. R:=Rinit
  2. Prepare the desired response G. This is usually a fully zanulirovannaya matrix with a small Gaussian in the center.
  3. Training : I, 1.0
  4. Translate Gto the frequency domain:  hatG

Tracking : Login: image I

  1. Cut rectangle Fof Ifor the existing previous position of the object R
  2. Apply a mask to it to smooth the edges.
  3. Translate Fto the frequency domain:  hatF
  4.  hatGresponse= hatW odot hatF
  5. Translate  hatGresponsein the spatial domain: Gresponse
  6. Find the maximum in Gresponse: gmax,(x,y)
  7. Calculate the power of response Psr:= fracgmax musl sigmasl
  8. If a PSR<PSRthrout with failure
  9. Update position R. Adjust R, if it went beyond the image, or it was decided that the object has increased / decreased.
  10. Training : I,  eta
  11. Return R

Implementation details may vary. For example,


What it looks like


For starters, a few examples of two-dimensional Fourier transform.

Some simple examples
Let me remind you that the result of the transformation is complex-valued. The pictures below show the groups “image - absolute values ​​of the frequency domain in the usual scale - absolute values ​​of the frequency domain in the logarithmic scale”.

Vertical lines:





The picture changes from left to right the same for any vertical position. Moreover, the change is periodic, with a clear period and a clear pattern. Therefore, in the pictures of the frequencies you see only the frequencies along the axis x=0.

Cell:





Note that there are expected frequency series along the axes. x=0and y=0and strange spurious frequencies. They appeared due to the fact that, firstly, the picture is finite, whereas the Fourier transform decomposes into a beautiful amount only for an infinite periodic signal. Secondly, it can be seen that the image does not form an exact period at the edges.

Slanted lines:





Both the frequencies corresponding to the main direction and the parasitic frequencies are again visible.

Inclined lines plus distortion:





On the image of frequencies, several characteristic directions are visible, but intuitively it becomes difficult to imagine a picture of them.

For images of the present world, it is even more difficult to imagine a picture in the head according to its frequencies:





(frequencies in the center are closed so that they do not “light up” the rest of the spectrum)

We now turn to the real example of work:

Pack of pictures
Image with marked object:



Cut and pre-processed object, its spectrum in the usual and logarithmic scale ( F,| hatF|, log| hatF|):





Desired Response ( G):



Filter frequencies in the normal and logarithmic scale ( W,| hatW|):




Clearly expressed filter weights (without applying transformations F):



Please note that they are not involved in the algorithm anywhere - they can be counted only for the sake of interest. Also note that the filter looks like what the hell . One would expect that the filter would be something similar to the original image, but this is not always true. A filter similar to the image itself would hardly give the desired gaussian response.

The response from the next frame:



Although not as clean as the desired response, it is easy to determine the maximum.

The same example with a narrower desired response:

Pack of pictures
Already:


W:


More already:


W:


With a very narrow maximum on the filter, an eye is clearly visible instead of a black spot.

Wfor the previous three cases Gwhen used for learning 16 transformations of the input image:

More pack of pictures
Wide maximum:



Average maximum:



Narrow maximum:



The more transformations, the less the filter clings to random elements. It is especially clearly seen that random black and white spots disappeared from the middle. W. On the other hand, for narrow gaussians, training in several pictures can play a minus: look at the “jingle” formed in the filter around the eye.

If you want to see how it looks live, download from here my test repository with the implementation of MOSSE with the output of debug images. You can find more MOSSE options on the githaba. In addition, it is in OpenCV .

Conclusion


Thank you for your attention, habrovchane. MOSSE and ASEF tracking are not the most complex algorithms in the world. The easier it is not only to apply them effectively, but also to understand what their creators were guided by. I hope that my explanation has helped you to penetrate into the mind of researchers, to follow the course of their thoughts. This can be useful: machine learning is not a static area of ​​knowledge, it has a place for creativity and its own research. Try to dig into some well-established algorithm: sprinkle unnecessary limbs to accelerate or add a couple so that it will work better in your particular case. You'll like it!

This article was written with the support of the company DSSL.

Source: https://habr.com/ru/post/421285/


All Articles