
Previously, we built all our intellectual modules on traditional video analysis algorithms (hereinafter, we will call them “classic”). We, of course, knew about neural networks, and tried to use them back in 2008. In particular, we could compare the images of people by clusters. But the results were not outstanding (including due to the low level of development of neural networks). And we became adherents of the “classics” of machine vision for many years. And all the neural networks were in our heads :)
With the advent of convolutional neural networks, there is also a hope that they will be able to show themselves well in solving video analysis tasks: first, to give higher accuracy under the same conditions in which we used the previous algorithms, and second, to expand the range of these very working conditions.
')
On top of that, this development method seemed much more reliable - it immediately allows you to conclude "go or not go": when you start working with the "classical" algorithm, you don’t immediately understand whether the path is right or not, you can solve the problem this way or not. And it takes some (often significant) time to get to a result that can be assessed. For example, we spent about a month messing around with a stereo headset for a camera to implement a new visitor count, but we didn’t get anything sensible with it (
see the article “The Birth of a Supernova: How New Functions Appear on the Example of 3D Visitor Count” ). And with neural networks, everything is clearer: from a small sample of several images, you can evaluate whether it will go or not. If not, change the sample and check again. Find the right data type and approach is much faster, and then you just need to improve the sample to get better results.
When the task appeared to create a
helmet -free detector (for details on how it appeared
in the article “Custdev in the development of video surveillance products” ), we
did not immediately
understand how to solve it using traditional methods. We decided to check whether it will be possible to do this with the use of neural networks. And in general, are modern neural networks as good as they say? ..
So, the work of the detector should be reduced to finding the human head and determining whether there is a helmet on it or not. In the development process, we tried
to solve the problem in two ways .
In advance, the neural network does not know that it needs to distinguish two kinds of people precisely by the presence of a helmet. All that she has is two sets of images (people with and without helmets), and she tries to find signs on them by which these two sets can be distinguished. She knows which picture corresponds to which set, but does not know why, and seeks to choose her parameters so that she can give the correct answers as often as possible.
I wayFirst, we filed images on the neural network, in which people got into the frame in full growth. We immediately had fears that we would not get enough accuracy, but if it worked, the development would be as fast and simple as possible.

The fears were confirmed: in such pictures, the neural network could not really learn - the accuracy of finding the helmets on the new test suite was about 70%. It was completely unacceptable for the operation of the module, but at the same time, it proved that it is possible to solve the problem using neural networks!
In general, the accuracy of the helmet detector is made up of sensitivity (responsible for “catching” people without helmets) and the percentage of false positives (responsible for mistaken “catching” people with helmets). In a real enterprise, where the detector will be used, people in most cases wear helmets, so even a small percentage of false positives will turn into a large amount of incorrect data in the output.
Accuracy was taken as the initial benchmark: at least 60% of sensitivity and no more than 3% of false positives. And in fact, these were serious demands.
Teaching on the images of people in full growth such accuracy was not achieved. Perhaps it was the fact that on such pictures besides the human head in a helmet or without a helmet, there are many other elements that the neural network is “distracted”, taking for essential signs something that in fact is not.
II methodWe decided that we can help the neural network if we show not the whole person in full growth, but only his head (with or without a helmet). To highlight the image of the head, we applied the appropriate classifier, which we had long written for one of the modules, and trained new convolutional neural networks using the results of its work.

By the way, practice has shown that it is not so important how many layers and neurons there are in a neural network, and in general its parameters are not so important. The main thing is the quality of the training sample. On a large and diverse sample, there is a good chance of success; on a small sample, a neural network will simply remember the correct answers, but will not acquire the ability to generalize and will not be able to give the correct answers on new pictures for it.
Our sample was medium in size (several thousand pictures of heads in helmets and without helmets), it included helmets of different colors and slightly different in shape. To improve the results and avoid retraining, we had to seriously engage in augmentation techniques (artificially expanding the training set) and regularization (limiting the parameters of the neural network). As a result, on the test sample, the accuracy reached 85-88%. This is a good indicator, but in order to reduce errors, we did the post-processing: the decision that a person without a helmet needs to output an “alarm” is not taken on one frame, but on the results of the analysis of each individual on several frames in a row.
In the course of testing, we also were not very satisfied with the work of the head detector, so we made a refinement of the heads found in the image ... also using the neural network. In fact, in one and in the other case, it is not one network, but several, combined in a cascade for greater accuracy (but here we will call them simply neural networks).
For our neural network, we took the classical convolutional architecture, which has worked well in classification problems. But they tried different architectures, including the most modern and sophisticated ones - out of a hundred layers with hundreds of millions of parameters. In principle, with the complication of the neural network, the result has not improved. In our experience, we confirmed that the Vapnik-Chervonenkis theorem works: the complexity of the classifier must correspond to the complexity of the problem. If the classifier is too complex, it will just remember all the answers and will not work. If it is too simple, it will not be able to learn.
We had a fairly simple neural network to solve the relatively simple task of detecting helmets.
The second method was the most effective. In the end, we solved the problem and
1) for 2.5 months, we developed a
working module that went to the first objects for test use. According to our estimates, the development of classical methods would take us no less than six months.
2) in the absence of helmets we use 2 sets of neural networks trained on different data. The first one finds the heads of people in the frame, and the second one determines whether this head is in a helmet or not.
3) reached the stated accuracy threshold - more than 60% of sensitivity for 1.5% of false positives.
Conclusion: it is possible and even necessary to use neural networks for solving video analysis tasks, in particular, detecting the absence of a helmet on a person.The first successful experience poses a legitimate question: are we now
developing all video analysis modules using neural networks? And while the answer to it is definitely difficult.
There are modules in translation which we now see no point on the neural network. Because there so everything is well solved by classical methods. For example, counting visitors (especially in the new 3D implementation). Now on the classical methods of machine vision, it works very well and reaches an accuracy of 98%. And if we applied neural networks, it is not yet known whether they would work this way or not. But for the detection of smoke and fire neural networks are exactly suitable.
If we deduce the criterion of applicability of neural networks in video analysis, it can be formulated in some way: if it is clear in advance how to use what signs, then you can get by with the “classics”, otherwise you can try neural networks.
There is a good sign in 3D counting - this is the distance to the point. Or in the detector of left objects, for example, it is also easy to find it - a special point on the border of the object, which you can follow and compare it, or contour. But in the fire it is not clear what kind of signs to take. Colour? - there is always something the same color as the fire. The form? - fire can be of various forms. Flicker in time? - but it is not clear exactly what it should be. Coming up with signs in advance here is a bad job, so it's better that the neural network does it.
But back to our task.
So she solved. The answer and the corresponding conclusions were obtained:
