How does the video analytics

Recently, I read an article in which the author, using a simple example, tells how the motion recognition algorithm works. It reminded me of my own research into video streaming analytics algorithms. Many people know that there is a great project OpenCV . This is an extensive cross-platform library of computer vision, containing many different algorithms. However, to understand it is not so easy. You can find many publications and examples of how and where to use machine vision, but not how it works. Namely, this is often not enough to understand the process, especially when you just start to study this topic.

In this article I will talk about the architecture of video analytics.

The general scheme of the video analysis is presented below.

')
The process is divided into several successive stages. At the output of each of them, information about what is happening in the frame is supplemented with more and more details. There may also be feedback between the stages to more subtly react to changes in the frame.
Consider the scheme in more detail.

What are the video stream

First you need to decide what is a video stream. Although there are many formats of video data, their essence boils down to one thing: a sequence of frames with a certain frequency per second. A frame is an image characterized by resolution and format (the number of bits per pixel and their interpretation: which bits are responsible for which color component). Inside a stream, frame compression can be used to reduce the amount of data transferred, but when displayed on the screen, frames are always expanded to their original state. Directly, analytic algorithms always work with uncompressed frames as well.

Thus, the video stream is characterized by frame rates, their format and resolution.

It is important to note that the analyst always deals with only one frame at the same time. That is, they are processed sequentially. In addition, during the next processing, it is important to know how much time passed after the previous frame. This value can be calculated from the frequency, but a more practical approach is to support the frame with a timestamp.

Changing the size and format of the frame

The first stage is frame preparation. As a rule, it is significantly reduced in size. The fact is that in the further processing each pixel of the image will be involved. Accordingly, the smaller the frame, the faster everything will work. Naturally, during compression, some of the information in the frame is lost. But this is not critical, but even useful. The objects with which the analyst works are basically large enough so as not to disappear from the frame during compression. But all sorts of "noise" associated with the quality of the camera, light, natural factors will be reduced.

Resolution change occurs by combining several pixels of the original image into one. The part of the information that will be stored depends on the way of combining.

For example, a square of 3x3 pixels of the original image must be converted to one pixel of the result. You can sum up all 9 pixels, you can take the sum of only 4 angular pixels, and you can only one central.

4 corner pixels:

The sum of all pixels:

Center pixel:

The result will be everywhere a little different in speed and quality. But sometimes it happens that a way that loses more information gives a smoother picture than the one that uses all the pixels.

Another action at this stage is to change the image format. Color images, as a rule, are not used, as this also increases the frame processing time. For example, RGB24 contains 3 bytes for each pixel. And only one Y8, while not much inferior to the information content of the first.

Y8 = (R + G + B) / 3.

The result is the same image, but in grayscale:

Background patterns

This is the most important stage of processing. The goal of this stage is to form the background of the scene and get the difference between the background and the new frame. The quality of the entire circuit will depend on the algorithms of this stage. If an object is accepted as a background or, on the contrary, a part of the background is selected as an object, then it will be difficult to correct something further.

In the simplest case, as a background, you can take a frame with an empty scene:

Select a frame with the object:

If we translate these frames into Y8 and subtract the background from the frame with the object, we get the following:

For convenience, you can make binarization: replace the value of all pixels large 0 by 255. As a result, from the gray gradation, we move on to the black and white image:

It seems to be all right, the object is separated from the background, has clear boundaries. But, first of all, the shadow of the object also stood out. And secondly, artifacts from image noise are visible at the top of the frame.

In practice, this approach is no good. Any shadow, a highlight of light, a change in camera brightness will spoil the whole result. This is the whole complexity of the task. Objects should be separated from the background, while it is necessary to ignore natural factors and image noise: light highlights, shadows from buildings and clouds, fluctuating plant branches, frame compression artifacts, etc. Moreover, if you look for the left object, then on the contrary, it should not be part of the background.

There are many algorithms that solve these problems with different efficiency. From simple averaging of the background to using probabilistic models and machine learning. Many of them are in OpenCV. And you can combine several approaches, which will give an even better result. But the more complex the algorithm, the longer it takes to process the next frame. With live video at least 12.5 frames per second, the system has only 80 ms to process. Therefore, the choice of the optimal solution will depend on the task and resources allocated for its implementation.

Zone formation

Differential frame formed. We see white objects on it on a black background:

Now we need to separate the objects from each other and form zones combining the pixels of the objects:

This can be done using, for example, the connected component labeling .

Here all defects of the background model are immediately visible. The man on top is divided into several parts, many artifacts, shadows from people. However, some of these shortcomings can be corrected at this stage. Knowing the area of the object, its height and width, the density of pixels, you can filter out extra objects.

On the frame above, blue frames indicate objects that are involved in further processing, and green ones - filtered ones. There are also errors. As you can see, the man on the top, divided into several parts, was also filtered because of its size. This problem can be solved, for example, by using perspective.

Other errors are possible. For example, several objects can be combined into one. So at this stage there is a big field for experiments.

Zone tracking

Finally, at the last stage, zones are transformed into objects. It uses the result of processing the last few frames. The main task is to determine that the zone on two adjacent frames is the same object. Signs can be very diverse: size, pixel density, color characteristics, motion direction prediction, etc. This is where the frame time stamps are important. They allow you to calculate the speed of the object and the distance traveled by it.

At this stage, you can correct one-time errors of the previous one. For example, glued objects can be divided, given their history of movement. On the other hand, there may be problems. The most important of them is the intersection of 2 objects. A special case of this problem is when a larger object shields a smaller object for a long time.

Objects for accounting in the background model

In the architecture there may be feedbacks that improve the work of the previous stages. The first thing that comes to mind is to use information about objects in the scene when forming the background.

For example, it is possible to distinguish the object set aside and not to make it part of the background. Or fight with “ghosts”: if there was a person on the stage when creating the background, then when he leaves, a “ghost” object will appear in his place. Realizing that in this place the trajectory of the movement of the object begins, you can quickly remove the “ghost” in the background.

Result

The result of the work of all stages is a list of objects in the scene. Each of them is characterized by size, density, speed, trajectory, direction of movement and other parameters.

This list is used in scene analytics. It is possible to determine the intersection of the object line or movement in the wrong direction. You can count the number of objects in a given zone, idle reeling, falling and many other events.

Conclusion

Modern video analytics systems have achieved very good results, but so far they remain a complex multistage process. Moreover, knowledge of the theory does not always give a good practical result.

In my opinion, creating a good machine vision system is a very complex process. Tuning algorithms is a very laborious and lengthy business, in which the subtleties of software implementation also interfere. It requires a lot of experiments. And, although OpenCV is invaluable in this matter, it is not a guarantee of the result, because the tools it contains must still be able to be properly used.
I hope that this article will help you to understand how it all works and what OpenCV tools you can use at what stages.

Source: https://habr.com/ru/post/271207/

All Articles