Another article about the recognition of workers without helmets neural networks

Hi, Habr! My name is Vladimir, I am a student of the 4th course of KubGTU (unfortunately).

Some time ago, I came across an article on the development of a CV-system for detecting workers without helmets, and decided to share my own experience in this field, obtained during an internship at an industrial company in the summer of 2017. The theory and practice of OpenCV and TensorFlow in the context of the task of detecting people and helmets - right under the cut.

KDPV, shot in real time from a surveillance camera

Given: access to video surveillance cameras of industrial facilities of the employer, 2-4 trainee students (during the development, the number of people involved in the project changed).

Objective: to develop a prototype system, in real time, detecting employees without helmets. The choice of technology - at its discretion. Our choice fell on Python, as a language that allows with the minimum effort to implement the first working prototype, and - at first - OpenCV, as a library of machine vision, about which we were most heard.

OpenCV is a library that implements mostly classical methods of computer vision, such as cascade classifiers. The essence of this approach is to create a so-called. ensemble of weak classifiers, i.e. such that their ratio of objects correctly classified by them to the total number of positive responses was at least slightly more than 0.5. One such classifier is not able to give any result, but the union of thousands of such classifiers can give an extremely accurate result.

An example of how an ensemble of weak classifiers is capable of performing a fairly accurate classification. A source

Obviously, the task of finding a person without a helmet comes down to the tasks of finding a person as such and ... a helmet! Or her absence. Access to the video cameras allowed us to quickly assemble the first dataset from cropped photos of the helmets and the people themselves, both in helmets and without them (they were found quickly enough), and later to increase its volume to 2k + photos.

Markup images for training

At this stage, the first unpleasant feature of OpenCV was found - the official documentation was scattered and in places simply referred to the book of one of the leading developers of the library. For many parameters, values had to be chosen experimentally.

The first launch of a classifier trained on helmets detected them with an accuracy of about 60% in isolated cases of false positives! We felt that we were on the right track. The task of detecting people turned out to be much more difficult: unlike helmets, people appeared in the frame from a large number of angles and, in general, demanded from the classifier much more advanced abilities to generalize. While I was working on refining the classifier trained on helmets, as an alternative, we tested the CV-classical object detection based on the Canny contour extraction algorithm and the counting of moving objects.

In parallel, we developed a subsystem for processing the data received from the classifier. The logic of operation is simple: a frame is taken from a surveillance camera and transmitted to the classifier, a check is made whether the number of recognized people and helmets in a frame matches, if a person without a helmet is found, an entry is made in the database with information on the number of recognized objects, and the frame itself is saved for manual analysis. This solution had one more advantage - the frames saved due to the classifier error allowed to train it on the data with which it could not cope.

And then a new problem arose: the overwhelming majority of frames were saved due to recognition errors, and not personnel without helmets. Training on fresh data somewhat improved the recognition result, however helmets were recognized in ~ 75% of cases (with single false positives), and the figures of people overlapping each other in the frame were correctly counted only in a little more than half of the cases. I convinced the project manager to give me a week to develop a neural network detector.

One of the features that make working with the NA easy to use - at least compared to classifiers - is the end-to-end approach: in the process of learning the classifier, in addition to the spaced images of helmets / people, background images and images were needed to validate the classifier while the conversion process is controlled by many different non-trivial parameters, not to mention the parameters of the classifier itself! In the case of working with algorithms for counting moving objects and others, the process becomes even more complex, filters are pre-applied to images, the background is removed, and so on. End-to-end learning requires the developer to "just" markup dataset and the parameters of the learning model.

The ML framework TensorFlow and the tensorflow / models repository that recently appeared at that time met my requirements - it was fairly well documented, it was possible to quickly write a working prototype (the most popular architectures work practically out of the box ), while the functionality is fully suitable for further development if the prototype is successful. After adapting the existing tutorial to the existing dataset using 101-layer reznet (the principles of convolutional neural networks have been repeatedly covered in Habré, I will only allow myself to refer to the articles [1] , [2] ) trained in COCO dataset (which includes photos of people), I immediately got more than 90% accuracy! This was a convincing argument to start developing a SNA-based helmet detector.

Trained on a third-party dataset SNS easily recognizes people standing nearby, but makes a mistake where it was not expected to recognize it at all :)

During training, TensorFlow models can generate checkpoint files that allow you to compile and test NAs at different stages, which can be useful if something went wrong during the additional training. The compiled model is an oriented computational graph, the initial vertices of which are the input data (in the case of working with images, the color values of each pixel), and the final vertices are the recognition results.

In addition to the data about the model itself, the checkpoint may contain metadata about the learning process itself, which can be visualized using Tensorboard .

A cherished schedule for reducing learning errors

After testing a number of architectures, ResNet-50 was chosen as providing the optimum between speed and recognition quality. After weighing all the pros and cons, it was decided to leave this trained network as it is, because it already gave an acceptable result, and to train on helmets a simpler Single Shot Detector (SSD) network [3] , which gave less accuracy in recognizing people but provided a satisfactory 90% + when working with helmets. This seemingly illogical solution was due to the fact that the additional use of SSDs slightly increased the time spent on self recognition, but significantly reduced the time spent on training and testing the network with different parameters and updated data (from several days to 20-30 hours on the GTX 1060 6GB), which means it has increased the development iteration.

Thus, several conclusions can be made: firstly, modern NA frameworks have a low entry threshold (but, undoubtedly, their effective use requires deep knowledge in machine learning) and are much more convenient and functional in solving image recognition problems; secondly, students are useful for the rapid development of prototypes and technology testing;)

I will be glad to answer questions and constructive criticism in the comments.

Source: https://habr.com/ru/post/354092/

All Articles

Another article about the recognition of workers without helmets neural networks

More articles: