Deep Learning, now in OpenCV

This article is a brief overview of the capabilities of dnn, an OpenCV module designed to work with neural networks. If you are wondering what it is, what it can do and how fast it works, welcome to cat.

Perhaps many would agree that OpenCV is the most well-known library of computer vision. For a long time of its existence, it has gained an extensive audience of users and has become, de facto, the standard in the field of computer vision. Many out-of-the-box algorithms, open source code, great support, a large community of users and developers, the ability to use the library in C, C ++, Python (as well as Matlab, C #, Java) under various operating systems is far from complete a list of what allows OpenCV to remain relevant. But OpenCV does not stand still - the functionality is constantly added. And today I want to talk about the new features of OpenCV in the field of Deep Learning.

Download and retrieve results (predictions) using models created in any of the three popular frameworks (Caffe, TensorFlow, Torch), fast work on the CPU, support for the main layers of neural networks and, as always, cross-platform, open source code and support In this I am going to tell in this article.
')
First of all, I would like to introduce myself. My name is Rybnikov Alexander. I am an Intel engineer and I implement Deep Learning functionality in the OpenCV library.

A few words about how OpenCV works. This library is a set of modules, each of which is associated with a specific area of computer vision. There is a standard set of modules - “must have”, so to speak, for any computer vision task. Implementing well-known algorithms, these modules are well developed and tested. All of them are represented in the main OpenCV repositories . There is also a repository with additional modules that implement experimental or new functionality. The requirements for experimental modules, for obvious reasons, are softer. And, as a rule, when one of these modules becomes sufficiently developed, formed and in demand, it can be transferred to the main repository.

This article is associated with one of the modules, which recently took an honorable place in the main repository - with the dnn module (hereinafter simply dnn).

(N + 1) -th framework for deep learning, this is why?

Why did you need deep learning in opencv? In recent years, in many areas, deep learning (in some sources, deep learning) has shown results far exceeding those of classical algorithms. This also applies to the field of computer vision, where the mass of problems is solved using neural networks. In light of this fact, it seems logical to give OpenCV users the ability to work with neural networks.

Why was the way of writing something of their own, instead of using existing implementations? There are several reasons for this.

First, it is possible to achieve the lightness of the solution. Leaving only the ability to perform a forward pass over the network, you can simplify the code, speed up the installation and assembly process.

Secondly, with its implementation, it is possible to reduce external dependencies to a minimum. This will simplify the distribution of applications using dnn. And, if previously the project used the OpenCV library, it would not be difficult to add support for deep networks to such a project.

Also, developing your solution, it is possible to make it universal, not tied to any particular framework, its limitations and disadvantages. If you have your own implementation, all ways are available to optimize and speed up the code.

A proprietary module for launching deep networks greatly simplifies the procedure for creating hybrid algorithms that combine the speed of classical computer vision and the remarkable generalizing ability of deep neural networks.

It is worth noting that the module is not, strictly speaking, a full-fledged framework for deep learning. At the moment, the module presents only the possibility of obtaining the results of the network.

Main features

The main possibility of dnn is, of course, loading and running neural networks (inference). In this case, the model can be created in any of the three deep learning frameworks - Caffe, TensorFlow or Torch; the way it is loaded and used is preserved regardless of where it was created.

By supporting three popular frameworks at once, we can simply combine the results of the models loaded from them without the need to create everything anew in one single framework.

When loading, the models are converted into an internal representation that is close to that used in Caffe. This happened for historical reasons - the support of Caffe was added the very first. However, there is no one-to-one correspondence between the representations.

All basic layers are supported: from basic (Convolution and Fully connected) to more specialized ones - more than 30 in total.

List of supported layers

Absval
AveragePooling
Batch normalization
Concatenation
Convolution (with dilation)
Crop
DetectionOutput
Dropout
Eltwise
Flatten
FullConvolution
Fullyconnected
LRN
Lstm
MaxPooling
MaxUnpooling
MVN
NormalizeBBox
Padding
Permute
Power
PReLU
PriorBox
Relu
Rnn
Scale
Shift
Sigmoid
Slice
Softmax
Split
Tanh

If you have not found in this list the layer that is required for you, do not despair. You can create a request to add support for the layer of interest to you (and our team will try to help you in the near future), or implement everything yourself and submit a pull request.

In addition to supporting individual layers, support for specific neural network architectures is also important. The module contains examples for classification ( AlexNet , GoogLeNet , ResNet , SqueezeNet ), segmentation ( FCN , ENet ), object detection ( SSD ); Many of these models are tested on the original datasets, but more on that later.

Assembly

If you are an experienced OpenCV user, feel free to skip this section. If not, then I will try as briefly as possible to talk about how to get working examples from source code for Linux or Windows.

Brief assembly instructions

You first need to install git (or Git Bash for Windows), [cmake] (http://cmake.org) and the C ++ compiler (Visual Studio for Windows, Xcode on Mac, clang or gcc for Linux). If you are going to use OpenCV from Python, then you must also install Python itself (the latest versions 2.7.x or 3.x will do) and the corresponding numpy version.

Let's start with repository cloning:

mkdir git && cd git git clone https://github.com/opencv/opencv.git

On Windows, repository cloning can also be performed, for example, using TortoiseGit or SmartGit. Next, let's start generating files for the assembly:

 cd .. mkdir build && cd build cmake ../git/opencv -DBUILD_EXAMPLES=ON

(for Windows, hereinafter, you need to replace cmake with the full path to the cmake file to be launched, for example, “C: \ Program Files \ CMake \ bin \ cmake.exe” or use the cmake GUI)

Now directly build:

 make -j5 (Linux) cmake --build . --config Release -- /m:5 (Windows)

After that dnn is ready to use.
The above instructions are quite brief, so I’ll also provide links to step-by-step instructions for installing OpenCV on Windows and Linux .

Examples of using

According to a good tradition, each OpenCV module includes usage examples. dnn is not an exception, C ++ and Python examples are available in the samples subdirectory in the source code repository. In the examples there are comments, and in general everything is quite simple.

I will give here a brief example that performs image classification using the GoogLeNet model. In Python, our example will look like this:

 import numpy as np import cv2 as cv # read names of classes with open('synset_words.txt') as f: classes = [x[x.find(' ') + 1:] for x in f] image = cv.imread('space_shuttle.jpg') # create tensor with 224x224 spatial size and subtract mean values (104, 117, 123) # from corresponding channels (R, G, B) input = cv.dnn.blobFromImage(image, 1, (224, 224), (104, 117, 123)) # load model from caffe net = cv.dnn.readNetFromCaffe('bvlc_googlenet.prototxt', 'bvlc_googlenet.caffemodel') # feed input tensor to the model net.setInput(input) # perform inference and get output out = net.forward() # get indices with the highest probability indexes = np.argsort(out[0])[-5:] for i in reversed(indexes): print('class:', classes[i], ' probability:', out[0][i])

This code loads a picture, conducts a little preprocessing and gets a network output for the image. The preprocessing consists in scaling the image so that the smallest of the sides becomes equal to 224, cutting out the central part and subtracting the average value from the elements of each channel. These operations are necessary, since the model was trained on images of a given size (224 x 224) with just such preprocessing.

The output tensor is interpreted as the vector of probabilities that an image belongs to a particular class, and the names for the 5 classes with the highest probabilities are output to the console.

Looks easy, right? If you write the same thing in C ++, the code will get a bit longer. However, the most important thing - the names of the functions and the logic of working with the module - will remain the same.

Accuracy

How to understand that one trained model is better than another? It is necessary to compare quality metrics for both models. Very often, the struggle at the top of the ranking of the best models goes for a fraction of percent quality. Since dnn reads and converts models from various frameworks into its internal representation, there are questions of preserving the quality after the model is transformed: did the model not “spoil” after loading? Without answers to these questions, which means without checking it is difficult to talk about the full use of dnn.

I have tested models from the available examples for various frameworks and various tasks: AlexNet (Caffe), GoogLeNet (Caffe), GoogLeNet (TensorFlow), ResNet-50 (Caffe), SqueezeNet v1.1 (Caffe) for the problem of classifying objects; FCN (Caffe), ENet (Torch) for the task of semantic segmentation. The results are shown in Tables 1 and 2.

Model (source framework)	Published value acc @ top-5	The measured value acc @ top-5 in the source framework	Measured value acc @ top-5 in dnn	The average difference per element between the output tensors of the framework and dnn	The maximum difference between the output tensors of the framework and dnn
AlexNet (Caffe)	80.2%	79.1%	79.1%	6.5E-10	3.01E-06
GoogLeNet (Caffe)	88.9%	88.5%	88.5%	1.18E-09	1.33E-05
GoogLeNet (TensorFlow)	-	89.4%	89.4%	1.84E-09	1.47E-05
ResNet-50 (Caffe)	92.2%	91.8%	91.8%	8.73E-10	4.29E-06
SqueezeNet v1.1 (Caffe)	80.3%	80.4%	80.4%	1.91E-09	6.77E-06

Table 1. Results of quality assessment for the classification task. The measurements were performed on the ImageNet 2012 validation kit (ILSVRC2012 val, 50,000 examples).

Model (framework)	Published mean IOU value	Measured mean IOU value in the source framework	Measured mean IOU in dnn	The average difference per element between the output tensors of the framework and dnn	The maximum difference between the output tensors of the framework and dnn
FCN (Caffe)	65.5%	60.402874%	60.402879%	3.1E-7	1.53E-5
ENet (Torch)	58.3%	59.1368%	59.1369%	3.2E-5	1.20

Table 2. Quality assessment results for the semantic segmentation task. The explanation of the large maximum difference for ENet is hereinafter referred to.

The results for FCN are calculated for the validation set of the PASCAL VOC 2012 segmentation part (736 examples). The results for ENet are calculated on the Cityscapes validation set (500 examples).

A few words should be said about the meaning of the above numbers. For classification problems of the generally accepted quality metric of models is accuracy for the top-5 network responses (accuracy @ top-5, [1]): if the correct answer is among the 5 network responses with the maximum confidence indicators (confidence), then this network response is counted as correct . Accordingly, accuracy is the ratio of the number of correct answers to the number of examples. This method of measurement makes it possible to take into account not always the correct marking of data, when, for example, an object is marked that is far from central to the frame.

For semantic segmentation problems, several metrics are used - pixel accuracy and pixel mean over intersection (mean IOU) [5]. Pixel accuracy is the ratio of the number of correctly classified pixels to the number of all pixels. mean IOU is a more complex characteristic: it is the class-averaged ratio of correctly marked pixels to the sum of the number of pixels of a given class and the number of pixels marked as a given class.

From the tables, it follows that for classification and segmentation problems, there is no difference in accuracy between model launches in the original framework and in dnn. This remarkable fact means that the module can be safely used without fear of unpredictable results. All scripts for testing are also available here , so you can verify for yourself that the results obtained are correct.

The difference between the numbers published and obtained in experiments can be explained by the fact that the authors of the models carry out all the calculations using the GPU, while I used the CPU implementations. It was also noted that different libraries can decode jpeg format in different ways. This could have an impact on the results for FCN, since the PASCAL VOC 2012 datasets contain images of this particular format, and the models for semantic segmentation are quite sensitive to changes in the distribution of input data.

As you noticed, in Table 2 there is an abnormally large maximum difference between the dnn and Torch outputs for the ENet model. I was also interested in this fact and then I will briefly describe the reasons for its occurrence.

Why is there a big difference between dnn and torch for eet?

The ENet model uses several MaxPooling operations. This operation selects the maximum element in the vicinity of each position and writes this maximum value to the output tensor, and also passes on the indices of the selected maximum elements. These indices are then used by the operation, in a sense, the inverse of the given - MaxUnpooling. This operation writes the elements of the input tensor in the output position, corresponding to the indices. In this place, a big error occurs: in a certain neighborhood, the MaxPooling operation selects an element with the wrong index; the difference between the correct Torch output and the dnn output for this layer lies within the computational error (10E-7), and the difference in the indices corresponds to the neighboring elements of the neighborhood. That is, as a result of a small fluctuation, the neighboring element became somewhat larger than the element with the correct index. The result of the MaxUnpooling operation, however, depends not only on the output of the previous layer, but also on the indices of the corresponding MaxPooling operation, which is located much earlier (at the beginning of the computational graph of the model). Thus, MaxUnpooling writes the element with the correct value to the wrong position. As a result, an error accumulates.

Unfortunately, it is not possible to eliminate this error, since the root causes of its occurrence are most likely related to slightly different implementations of the algorithms used during training and in inference and are not related to the presence of an error in the implementation.
However, it is fair to say that the average error on the element of the output tensor remains low (see Table 2) - that is, errors in the indices are quite rare. Moreover, the presence of this error does not lead to a deterioration in the quality of the model, as evidenced by the numbers in the same Table 2.

Performance

One of the goals that we set for ourselves when developing dnn is to achieve decent module performance on various architectures. Not so long ago, optimization was carried out under the CPU, with the result that now dnn shows good results in terms of speed.

I spent working time for different models when using them - the results in Table 3.

Model (source framework)	Image resolution	The performance of the original framework, CPU (acceleration library); memory consumption	Performance dnn, CPU (acceleration relative to the source framework); memory consumption
AlexNet (Caffe)	227x227	23.7 ms (MKL); 945 MB	14.7 ms (1.6x); 713 MB
GoogLeNet (Caffe)	224x224	44.6 ms (MKL); 197 MB	20.1 ms (2.2x); 172 MB
ResNet-50 (Caffe)	224x224	70.2 ms (MKL); 386 MB	58.8 ms (1.2x); 224 MB
SqueezeNet v1.1 (Caffe)	227x227	12.4 ms (MKL); 113 MB	5.3 ms (2.3x); 38 MB
GoogLeNet (TensorFlow)	224x224	17.9 ms (Eigen); 310 MB	21.1 ms (0.8x); 135 MB
FCN (Caffe)	various (500x350 on average)	3873.6 ms (MKL); 4453 MB	1229.8 ms (3.1x); 1332 MB
ENet (Torch)	1024x512	1105.0 ms; 828 MB	218.7 ms (5.1x); 190 MB

Table 3. The results of measurements of the operating time of various models. Experiments were conducted using an Intel Core i7-6700k.

Time measurements were made with averaging over 50 starts and performed as follows: for dnn, a timer built into OpenCV was used; caffe time was used for caffe; For Torch and TensorFlow, existing timekeeping functions were used.

As follows from Table 3, dnn in most cases exceeds the performance of the original frameworks. Actual dnn performance data from OpenCV on various models in comparison with other frameworks can also be found here .

Future plans

Deep learning has taken a significant place in computer vision and, accordingly, we have big plans to develop this functionality in OpenCV. They relate to improving usability, processing the internal architecture of the module itself and improving performance.

In improving the user experience, we focus primarily on the wishes of the users themselves. We strive to add the functionality that developers and researchers need in real-world tasks. In addition, the plans include the addition of network visualization, as well as the expansion of the set of supported layers.

As for performance, despite the many optimizations we have completed, we still have ideas on how to improve the results. One of these ideas is to reduce the bitness of the calculations. This procedure is called quantization. Roughly speaking, throw out a part of the digits at the input and the layer weights before calculating convolutions (fp32 → fp16), or calculate the scaling factors that convert the range of input numbers to int or short range. This will increase the speed (due to the use of faster operations with integers), but perhaps the accuracy will suffer a little. However, publications and experiments in this area show that even sufficiently strong quantization in certain cases does not lead to a noticeable drop in quality.

Parallel execution of layers is another optimization idea. In the current implementation, only one layer works at a time. Each layer uses paralleling as much as possible during the calculations. However, in some cases, the computation graph can be parallelized at the level of the layers themselves. This could potentially give each thread more work, thereby reducing overhead.

Now for the release is preparing something quite interesting. I think few have heard of the programming language Halide . It is not Turing-complete — some constructs cannot be implemented on it; maybe that's why he is not popular. However, this disadvantage is at the same time its advantage - the source code written on it can be automatically turned into highly optimized for different hardware: CPU, GPU, DSP. And there is no need to be an optimization guru - a special compiler will do everything for you. Already, Halide allows you to get some models to accelerate - and, for example, semantic segmentation with the ENet model works 25 fps for a resolution of 512x256 on the Intel Core i7-6700k (versus 22 fps for dnn without Halide). And, best of all, without rewriting the code, you can use the integrated processor GPU, receiving an additional couple of frames per second.

In fact, we have high hopes for Halide. Due to its unique characteristics, it will allow to receive acceleration of work, without requiring additional manipulations from the user. We strive to ensure that the use of Halide with OpenCV, the user does not have to install additional software to use Halide - the principle of "out of the box" should be maintained. And, as our experiments show, we have every chance to realize this.

Conclusion

Already, dnn has everything to be useful. And every day a growing number of users are discovering its capabilities. However, we still have work to do. I will continue my work on the module, expanding its capabilities and improving the functionality. I hope that this article was interesting and useful for you.

If you have questions, suggestions, problems or you want to contribute by submitting a pull request - welcome to the github repository , as well as to our forum , where my colleagues and I will try to help you. If none of these methods come up, on our website you can find additional ways of communication. I will always be glad to cooperate, constructive comments and suggestions. Thanks for attention!

PS I express my deep gratitude to my colleagues for their help in working and writing this article.

Links

Source: https://habr.com/ru/post/333612/

All Articles