Yesterday I published
an article about machine learning and NVIDIA DIGITS. As promised, today's article - why everything is not so good + an example of selecting objects in a frame on DIGITS.
NVIDIA raised a wave of public relations about the DetectNet mesh developed and implemented in DIGITS. The grid is positioned as a solution for searching for identical / similar objects in the image.

What it is
At the beginning of the year, I
mentioned several times about the
Yolo fun net. In general, all the people with whom I spoke, treated her rather negatively, with the words that Faster-RCNN is much faster and simpler. But, NVIDIA engineers were
inspired by it and assembled their Caffe grid, calling it DetectNet.
The grid principle is the same as in Yolo. The network output for an image (N * a * N * a) is an N * N * 5 array, in which for each region of the original image of a * a size, 5 parameters are entered: the presence of the object and its size:
')
Plus grids:
- Quickly counts. I did it for 10-20ms per frame. At a time when Faster-RCNN spent 100-150.
- Just learning and customizing. With Faster-RCNN it was necessary to potter for a long time.
Minus one: there are solutions with better detection.
Common words before I start the story
Unlike category recognition, which I wrote about yesterday, the detection of objects is done poorly. Not user friendly. Most of the article will be on how to start this miracle. Unfortunately, this approach kills the original idea of DIGITS, that you can do something without understanding the logic of the system and its mathematics.
But if you still run - it is convenient to use.
What we will recognize
A couple of years ago we had a completely crazy
idea with license plates. Which resulted in a whole series of
articles on it. Including was a decent
database of photos that we posted.
I decided to use part of the developments and re-detect the numbers through DIGITS. So we will use them.
The base is properly labeled, I had a very small, for other
purposes . But it was enough to train.
Go
Selecting in the main menu “New Dataset-> Images-> Object Detection” we get to the menu for creating datasets. Here you must specify:
- Training image folder - a folder with images
- Training label folder - folder with textual captions for images
- Validation image folder - a folder with images for verification
- Validation label folder - a folder with textual signatures to them
- Pad image - If the image is smaller than the one specified here, then it will be supplemented with a black background. If more - the creation of the base will fall ¯ \ _ (ツ) _ / ¯
- Resize image - to what size to resize the image
- Minimum box size - it’s best to set this value. This is the minimum object size during validation.
There is a difficulty. How to make a text-caption to the image with its description?
The example on GitHub from NVIDIA in the official DIGITS repository is silent about this, mentioning only that it is the same as in the kitti. I was somewhat surprised by this approach to users of a ready-made box of the framework. But ok. I went, downloaded the database and the docks to it, read it. File format:
Car 0.00 0 1.95 96.59 181.90 405.06 371.40 1.52 1.61 3.59 -3.49 1.62 7.68 1.53 Car 0.00 0 1.24 730.55 186.66 1028.77 371.36 1.51 1.65 4.28 2.61 1.69 8.27 1.53 Car 0.00 0 1.77 401.35 177.13 508.22 249.68 1.48 1.64 3.95 -3.52 1.59 16.82 1.57
File Description:
Naturally, most of the parameters are not needed here. In reality, you can leave only the “bbox” parameter, the rest will not be used anyway.
As it turned out, for DIGITS there was also a second
tutorial , where the file format was still signed. But he was not in the repository DIGITS ¯ \ _ (ツ) _ / ¯
It confirmed that my guesses about what to use were correct:

Begin to teach
The class. The base is made, we begin to train. For training you need to set the same settings as specified in the
example :
- Subtract Mean to None
- base learning rate at 0.0001
- ADAM solver
- Choose your base
- Select the “Custom Network” tab. Copy the text from the file "/caffe-caffe-0.15/examples/kitti/detectnet_network.prototxt" into it (this is in the caffe from nvidia, of course).
- It is also recommended to download a pre-trained GoogleNet model here . Specify it in “Pretrained model (s)”
Also, I did the following. For the “detectnet_network.prototxt” copied mesh, I replaced all image size values “1248, 352” with image sizes from my base. Without this, learning fell. Well, naturally, there is no such thing in one tutorial ... ¯ \ _ (ツ) _ / ¯
Loss chart falls, training has gone. But ... The accuracy graph is at zero. What?!
None of the two tutorials that I found answered this question. Went to dig into the description of the grid. Where to dig, it was clear right away. Once the loss falls, the training is on. Error in validation pipeline. And really. In the network configuration there is a block:
layer { name: "cluster" type: "Python" bottom: "coverage" bottom: "bboxes" top: "bbox-list" python_param { module: "caffe.layers.detectnet.clustering" layer: "ClusterDetections" param_str: "1024, 640, 16, 0.05, 1, 0.02, 5, 1" } }
It looks suspicious. Opening the description of the
clustering layer you can find a comment:
It becomes clear that these are thresholds. Zarandomil there 3 numbers without delving into the essence. Training went + began to grow validation. Hours for 5 reached some reasonable thresholds.

But here's a bummer. With successful training, 100% of the pictures were not distributed. I had to dig and understand what this layer means.
The layer implements the collection of the obtained hypotheses into a single solution. As the main tool, the OpenCV module “cv.groupRectangles” is used here. This is a function that associates groups of rectangles into one rectangle. As you remember, the network has such a structure that in the vicinity of the object there should be a lot of positives. They need to be collected in a single solution. The collection algorithm has a bunch of parameters.
- gridbox_cvg_threshold (0.05) - object detection threshold. Essentially the credibility of the fact that we found a number. The smaller - the more detections.
- gridbox_rect_threshold (1) - how many detectors should work for a “have a number” decision
- gridbox_rect_eps (0.02) - how many times the size of rectangles can be different to combine them into one hypothesis
- min_height - the minimum height of the object
Now it’s enough just to pick them up so that it all works. And now humor. Taki was also the
third tutorial, where part of the whole thing is described.
But not all ¯ \ _ (ツ) _ / ¯
What is the result
As a result, you can see what the grid has allocated:
It works well. At first glance, better than the Haar that we used. But it immediately became clear that a small training base (~ 1500 frames) - makes itself felt. The database did not take into account dirty numbers => they are not detected. The database did not take into account the strong perspective numbers => they are not detected. Not considered too large / too small. Well, you understand. In short, you need not be lazy and mark out thousands of 5 numbers normally.
When recognizing, you can see funny pictures with activation cards (
1 ,
2 ,
3 ). It can be seen that at each next level the number is seen more and more clearly.
How to start
Pleasant moment - the result can be run with a code of ~ 20 lines. And it will be a ready-made number detector:
import numpy as np import sys caffe_root = '../'
Here I posted a file for the grid and the weight of the trained network, if anyone needs it.