Neural networks in number detection

License plate recognition is still the best-selling computer vision solution. Hundreds, if not thousands of products have been competing in this market for the past 20-25 years. This is partly why convolutional neural networks (CNN) do not beat the previous algorithmic approaches on the market.

But the experience of recent years says that the CNN algorithms make it possible to make reliable and flexible solutions for application. There is one more convenience: with this approach, you can always improve the reliability of the solution by an order of magnitude after the actual implementation due to retraining. In addition, such algorithms are perfectly implemented on GPU (graphics modules), which are much more efficient from the point of view of energy consumption than conventional processors. And NVidia's Jetson TX platform so simply consumes very little by the standards of modern computing devices. Visual "energy superiority":

')
Of course, the superiority of the Jetson TX1 over the Intel Core i7 is exaggerated, because There are always side problems: capturing images from the camera, working with memory, it is advisable not to transfer all calculations to the GPU. And yet, it looks tempting.

It is easy to estimate the consumption budget for the complete system:

So, even for an autonomous solution, powered from the sun and wind, you can install 3-4 cameras with infrared sources and 2 Jetson.

It became even more attractive! So you need to do the recognition on the Jetson TX1, and do it well.
Just say thanks to the NVidia team in Russia. Thanks to their help, everything worked out. The guys gave us Jetson TX1 for the experiments and promised to let us test out the TX2 .

Number Search Algorithm

The search for the boundaries and corners of license plates in the photographs consisted of three stages, implemented by various trained CNNs.

Determining the position of the license plate (center):

The output layer of the convolution network is the probability field of finding the center of a number at a given point. The output is such beautiful “Gauss”. Of course, false positives are possible here. Moreover, the thresholds are chosen in order to minimize the probability of missing a number (by increasing the number of false notes).

Over the past couple of years, tens of thousands of various numbers have come to our freely accessible server. With such a base, we managed to train a much more reliable detector than the previously used Haar.

Estimation of license plate numbers:

Here at the same time we screen out some of the false positives.

Searching for the best “homography” transformation bringing the car number into the familiar look:

Here 2 more convolutional neural networks worked: detecting the boundaries of the number, determining the most plausible hypothesis.

The result of the first:

And the second one chooses the best homography:

The result is a normalized image of the license plate number, which must be recognized further. Below are a few examples of such normalized tablets (some of them are not readable even by the eye):

Such a multi-level approach proved to be very reliable. At each stage, the probability of missing is minimized, and the convolutional neural network at each next level learns not only to perform its main function, but also to remove false positives. Thus, the inaccuracy of each CNN is compensated for in the next step.

In order to make all this work, we used 2 types of convolutional networks:

1) Standard classification option, similar to VGG:

2) The architecture intended for segmentation, described in more detail in a previous article .

Of course, these architectures had to be a little easier to work on the Jetson TX1. In addition, several tricks with loss functions and output layers helped to improve the generalization when training on a not-so-huge base that was available.

Number Recognition Algorithm

To be honest, it seemed that the problems with text recognition had been thoroughly solved for several years by modern Deep Learning algorithms. But for some reason not.

Here is a relatively recent overview of existing problems and solutions .

On Habré recently there was an article about using LSTM and CNN for text recognition in the passport.

But the described approach requires manual marking of the entire base (the border between the characters). And we had a serious size base with a different layout (image + text):

After all, there was the algorithm of the previous generation, which often quite successfully recognized the car number (sometimes making mistakes). So, it was necessary to come up with an approach that would allow to train the neural network without attracting information about the position of each character. This approach is even more valuable due to the fact that it is much easier to automatically increase the training sample. In a system in which the eyes check and correct errors (for example, to send a fine), an extension of the training set is automatically generated. Yes, even if the correction is not carried out, retraining will help after the accumulation of recognized numbers.

Any convolutional network has a pre-programmed structure in which it is assumed that the same cores (convolutions) are applied to different positions of the image. Due to this, a significant reduction in the number of scales is achieved.

You can take a look at this property of convolution networks on the other hand, as suggested by AlexeyR ( read here ). In short, the task of generalization is inextricably linked to the task of transforming input information into a context in which it is more common or better described. Armed with this concept (or inspired), we will try to solve this seemingly simple task of recognizing text in an image with the markup we have (no position of signs, just a list of them).

Let's build a small context-dependent area of 36 minicolumns ( more ):

The image shows only 10 transformations. Here's what the responses to each of the contexts should look like in the above picture.

It should be noted here that far the most efficient coding inside minicolumns was used, but given the 22 concepts, this is not a big problem.

Convolutional networks are conceived so well that these contextual transformations are automatically obtained at the output of any convolutional layer, which means there is no point in shifting the input images, it is enough to take the output of any convolutional layer, perform several transformations already implemented by the layers in Caffe, and run SGD training.

And as a result of the training, we will get a “cut” for 36 minicolumns:

Let us single out local maxima and let us know at the output of the combination: what they learned and in what context. And then we will collect everything in order from left to right: A902YT190

This problem was solved with the help of CAFFE and the convolution network architecture. But if we had to consider more complex and non-obvious transformations (scale, turns in space, perspective), then we would have to use more computing resources and create much more serious algorithms.

In addition, we had quite a bit to sweat over the regions, the quality of which leaves much to be desired. But it’s such a big story that you don’t want to include it in this long article.

At the same time, it is worth mentioning separately. If there are no regions, then any numbers with monotonous font are recognized by the same grid. Just enough to provide a fairly large base of examples.

Applications

Of course, the most obvious is traffic control. But there are a few more applications that we have encountered. We wanted to make a fairly universal algorithm. So I had to think about what kind of use there are:

1) Traffic Control (application 1)

- High resolution cameras: 3-8MP
- IR illuminator
- Predictable orientation and scale
- Possible manual adjustment after installation
- Up to 10-20 numbers per frame

2) Traffic control at the checkpoint (application 2)

- Low resolution cameras (0.3MP - 1MP)
- IR illuminator
- Predictable size and number area
- Possible manual adjustment after installation
- Only one number at a time in the frame

3) Photos “with hands” (application 3)

- High resolution camera
- No IR illumination

Unpredictable room size and area. It turned out that the latter application is the most expensive in terms of computing resources. Mainly due to the fact that the scale of numbers is poorly predictable. The simplest application is the second. In the frame is not more than one number. The scale of the rooms varies slightly. The first application is also quite successful and is based on computational needs in the middle. But with all these situations Jetson TX1 successfully copes. Only the latter does not fit into the real-time scale, but takes about 1 second to be calculated in a non-optimized code.

Jetson TX1 frame calculation time budget

Performance tests are preliminary, optimization is still possible, especially for specific applications.

And, of course, if you run the same algorithm on 1080 / Titan video cards, you can get almost a tenfold increase in speed. For the third application it is already ~ 100ms per image.

Telegram Bot

That article is bad on the habr, which does not end with the story of writing Telegram bot! So, of course, we did one on Jetson TX1 so that you could take a picture and test the algorithm. The bot name is Recognitor

This bot is now running on the Jetson TX1 demo board:

The Jetson TX1 recognizes images at home (notice the fan — it turns on every time a new image arrives).

It turned out that Telegram Bot is an excellent support tool when working with computer vision programs. For example, you can organize manual verification of recognition results with it. The Telegram API provides an excellent feature set for this. ZlodeiBaal here already used the bot telegram .

It's simple: you attach an image to a message (one! It seems the telegram allows you to send several, but only the last will be analyzed). An image is returned with marked frames and a few lines with all the likely-found numbers. Each line also shows the percentage - this is the conditional probability that the number is actually recognized. Less than 50% - something with the number is not that (the part did not fit into the frame or it is not a number at all).

If you want to leave us a comment on a specific photo, then just write it with the text, without attaching the picture. It will definitely be saved in the file log.txt, which we will very likely read.

Do not forget that the algorithm is now launched in test mode and only standard Russian numbers or yellow taxi numbers are expected (no transit, trailers, etc.).

Learning set

We had about 25,000 images at our disposal. Most of the base - snapshots from a phone with a random position and number scale:

Several thousand pictures were taken from checkpoints:

About a thousand shots - the standard view for cameras recording violations of traffic rules.

Most of the base was based on the sample, which we laid out in open access a long time ago:

yadi.sk/d/EAfnQ947criHW
yadi.sk/d/0H2AipxrcrXqy
yadi.sk/d/U41QZ8v7cpJ6R

Eventually

We collected the whole stack of license plate recognition algorithms based on convolutional neural networks for Jetson TX1 with sufficient speed to work with real-time video.

Recognition reliability has improved significantly relative to the previous algorithm . It is difficult to cite specific figures, since they vary greatly depending on the application. But the difference is well visible to the naked eye - Recognitor

Due to the fact that all algorithms are retraining, you can easily expand the applicability of the solution. Up to change of objects of recognition.

Source: https://habr.com/ru/post/329636/

All Articles