📜 ⬆️ ⬇️

Face Detection with Movidius Neural Compute Stick

Not so long ago, the Movidius Neural Compute Stick (NCS) device was released , which is a hardware accelerator for neural networks with a USB interface. I was interested in the potential use of the device in the field of robotics, so I purchased it and decided to launch some kind of neural network. However, most of the existing examples for NCS solve the problem of image classification, but I wanted to try something else, namely, face detection. In this publication, I would like to share the experience gained during such an experiment.

All code can be found on GitHub .

image


More about NCS


Neural Compute Stick is a device designed to accelerate neural networks (mainly convolutional) at the stage of application (inference). The idea is that NCS can be connected to a robot or drone and run neural networks where there are not enough computational resources for this. For example, NCS can be connected to the Raspberry Pi.
')
The framework for this device, also known as NCSDK , includes APIs for Python and C ++, as well as several useful utilities that allow you to compile a neural network into a format that NCS understands, measure the time it takes to calculate on each layer and test the network. The initial data can be pre-trained neural networks in the Caffe or TensorFlow format.

Model selection


Great, we want to solve the face detection problem. There are two quite popular neural network architectures for detection tasks: Fast-RCNN / Faster-RCNN and YOLO. I did not want to train my model at this stage, so I decided to search for the finished one.

The difficulty lies in the fact that NCSDK does not support all of the features available in Caffe and TensorFlow, so arbitrary architecture may simply not compile. For example, not all types of layers are supported, and the architecture should have exactly one input layer (a full list of restrictions and supported layers for Caffe can be seen here ). The first face detection model I was able to find (Faster-RCNN) did not satisfy both requirements.

Then I stumbled upon a trained model of the YOLO architecture. The only problem was that the neural network was in Darknet format, although the architecture itself looked appropriate for NCS, so the idea was to convert the neural network to Caffe format.

Model conversion


To convert the model, I decided to use this project , which allows you to switch between the formats Darknet, Pytorch and Caffe.

I run the converter in the Docker container - this is a trick that appeared due to the fact that the version of Caffe installed by NCSDK did not like the converter, and I did not want to touch the system configuration:

sudo docker run -v `pwd`:/workspace/data \ -u `id -u` -ti dlconverter:latest bash -c \ "python ./pytorch-caffe-darknet-convert/darknet2caffe.py \ ./data/yolo-face.cfg ./data/yolo-face_final.weights \ ./data/yolo-face.prototxt ./data/yolo-face.caffemodel" 

It turns out something interesting: the model is converted, but a warning is issued that the layers like Crop, Dropout and Detection could not be recognized, which is why the converter missed them. Is it possible to do without these layers? It turns out you can. If you look closely at the Darknet code , you will notice the following:

A layer of type Crop is needed only at the learning stage. He is committed to expanding the sample, turning the image at random angles and cutting out random fragments from it. At the application stage, it is not required.

Dropout is a little more interesting. A dropout layer is also needed mainly at the training stage (the Dropout layer resets the outputs of neurons with probability p) in order to avoid retraining and increase the ability of the model to generalize. At the application stage, it can be eliminated, but it is necessary to scale the outputs of the neurons so that the expectation of values ​​at the inputs of the next layer does not change, so that the model’s behavior is preserved (divided by 1p). If you look at the Darknet code, you can see that the Dropout layer not only clears some outputs, but also scales all others, so the Dropout layer can be safely removed.

As for the layer detection, it is the last one and is engaged in transferring the outputs from the penultimate layer to a more readable form, and also considers the loss function at the training stage. We do not need to count the loss function, but translating the result into a readable form is useful. In the end, I decided to just use the function for the last layer straight from Darknet (having edited it a little), from there I also took the function for NMS (Non maximum suppression - removal of redundant limiting frames). They are in the files detection_layer.c and detection_layer.h.

Here it is worth making a remark about what the penultimate layer does. In the architecture of YOLO (You only look once), the image is divided into blocks by grid size. n timesn(in this case n=11), and for each block is predicted 5m+cvalues ​​where m=2- the number of bounding boxes for each block, and c=1- number of classes. The values ​​themselves are: coordinates, width, height and confidence value for each of mframes (that is, a total of five values), as well as the probability of finding an object in this block for each class. Total is obtained n timesn times(5m+c)=1331values. The Detection layer only separates different types of data and brings them into a structured view.

One problem remains: the converter generates a .prototxt file with the format of the input layer, which NCSDK cannot parse. The differences are exclusively decorative: the converter records the size of the input layer in the format:

 input_dim: x input_dim: y input_dim: z input_dim: w 

And the mvNCCompile utility, which should compile a neural network into a clear NCS file, wants to see the format:

 input_shape { dim: x dim: y dim: z dim: w } 

The utils / fix_proto_input_format.py Python script is designed to solve this problem (not by doing this yourself).

Compiling model


Now that the model has been translated into the Caffe format, you can compile it. This is done quite simply:

 mvNCCompile -s 12 -o graph -w yolo-face.caffemodel yolo-face-fix.prototxt 

This command should generate a binary graph file, which is a graph of computations in a format understood by NCS.

Image preprocessing


It is important to properly organize the processing of images before loading them into the graph of calculations, otherwise the neural network will not work as intended. As data for the neural network, I will use the frames from the webcam obtained using OpenCV.

Judging by the code of the demo for Darknet, before downloading the image you need to compress it to the size 448 times448(and not caring about proportions), normalized to the segment [0,1]each pixel and invert the channel order from BGR to RGB. In general, the BGR variant is considered standard in Caffe and OpenCV, and RGB is considered standard in Darknet, but the converter knows nothing about it, and as a result, the channels still need to be inverted.

Contact NCS and download data


Here it is worth noting that I use C ++, not Python, because I am guided by the use of the device in robotics and I believe that with C ++ you can achieve greater speed. It is because of this that additional difficulties appear: the computation graph receives as input and outputs data in the fp16 format (16-bit floating point numbers), which are not implemented in C ++ by default. In the NCSDK examples, this problem is solved using the floattofp16 and fp16tofloat functions torn from Numpy, so I use the same solution.

In order to start interacting with NCS, you need to perform a number of actions:


The functions mvncLoadTensor and mvncGetResult are used to load data and get results. In this case, you need to remember about the conversion of data and results in fp16 and back.

To stop working with NCS, you need to free the resources allocated for the calculation graph (mvncDeallocateGraph) and close the device (mvncCloseDevice).

Since interacting with NCS requires quite a lot of actions, I wrote a wrapper class that (besides the constructor and destructor) has only two functions: load_file to initialize the device and the graph, and load_tensor to load the data and get the result.

Profiler


NCSDK has a useful utility that allows you not only to evaluate the performance of each layer, but also to create a calculation graph scheme that reflects the characteristics of each element (by the way, it’s not necessary to transfer the layer weights themselves):

 mvNCProfile yolo-face-fix.prototxt -w yolo-face.caffemodel -s 12 

What happened in the end


The final result is as follows:

The original model (.cfg and .weights) is converted into Caffe format (.prototxt and .caffemodel) using a converter, the input format is corrected in the .prototxt file using a Python script, after which the model is compiled into a graph file - convert and graph to Makefile

In the program itself, for each received frame, preprocessing is performed, translated into the fp16 format and loaded into the calculation graph. The result is transferred from fp16 to float format and passed to a function that simulates the work of the last Detection layer. Then apply Non maximum suppression.

The demo proudly delivers 4.5 frames per second - this is not enough. The problem, apparently, is that this architecture is of the type that is sharpened more for accuracy than for speed. The speed of work can be significantly increased if you use "mobile" architectures like Tiny YOLO - for this you have to look for a new model or train your own. However, this example shows that a Darknet neural network can be compiled and run on the Neural Compute Stick.

Source: https://habr.com/ru/post/347438/


All Articles