Detection and localization of arbitrary text on images obtained using mobile phone cameras

I sometimes travel to different countries, and the language barrier, quite often, becomes a serious obstacle for me. And if in countries where the languages of the German group are used, I can sort out somehow, in countries such as China, Israel and the Arab countries without an accompanying person, the journey turns into a mysterious quest. It is impossible to understand the local schedule of buses / trains / trains, street names in small cities are very rarely in English. And the problem with choosing what to eat from the menu in an incomprehensible language is generally akin to walking through a minefield.
Since I am a developer for iOS, I thought, why not write such an application: point the camera at a sign / schedule / menu and immediately get a translation into Russian.

A brief search on the App Store showed that there is only one or two similar applications, and the Russian language is not among the supported ones. So, the path is open and, you can try, write such an application. There is a reservation that the conversation is not about those applications that photograph black text on a white sheet of paper and then digitize and translate it. Such applications are really car and small truck. This is an application that can highlight text on a natural image, for example, on a bus photo, it is necessary to highlight text on the route plate and translate it so that the user can understand where this bus is going. Or an actual question for me from the menu, I really want to know what you order to eat.

The main task in the application is the detection and localization of the text, and then its selection and binarization for “feeding” in OCR, for example tesseract. And if the algorithms for detecting text in scanned documents have long been known and have reached 99% accuracy, then the detection of text of arbitrary size in photographs is still a current area of research. The more interesting the task will be, I thought, and undertook to study the algorithms.
Naturally, there is no universal algorithm for finding any text on any image, usually different algorithms are used for different tasks plus heuristic methods. To begin with, we formalize the problem: for our purposes, it is necessary to find a text that is sufficiently contrasted with the surrounding background, located horizontally, the angle of inclination does not exceed 20 degrees, and it can be written in different sizes and colors.
')
After reviewing the algorithms, I ~~began to reinvent the wheel~~ in implementation. I decided to write everything myself, without using opencv, for deeper immersion in the subject. Based on the so-called edge based method. And that's what I ended up with.
In the beginning we receive the image from the phone camera in the BGRA format.

We translate it into a grayscale and build a Gaussian pyramid of images. At each level of the pyramid, we will find the text of a certain dimension. At the lowest level, we detect fonts with a height of about k to 2 * k-1 pixels, then from 2 * k to 4 * k-1, and so on. Actually, it was necessary to use 4 images in the pyramid, but we remember that we only have an iPhone, and not a four-core i7, so we will limit ourselves to 3 images.

We apply the Sobel operator to highlight vertical boundaries. And we filter the result, simply removing too short segments to cut off the noise.

To the vertical boundaries selected by the Sobel operator, a morphological closing operation is applicable. Horizontally by the width of the font, and vertically by 5. We again filter the result. We skip only the one that fits into the height of the font we are looking for from k to 2 * k-1, and at least 3 characters long. We get this result.

We perform the same operations with the next level of the pyramid.

Then we combine all the results into one. Then in the selected area we make adaptive binarization and we end up with such an image. It is already quite suitable for further recognition in OCR. It can be seen that the largest font was undecided, due to the fact that one more image in the Gaussian pyramid is missing.

Below are examples of the algorithm on more complex images, it is clear that some more refinement is needed.
Image processing time 640x480 on iPhone 5, about 0.3 sec.

PS I will answer questions in comments, about grammatical errors, write in a personal.

Source: https://habr.com/ru/post/180609/

All Articles

Detection and localization of arbitrary text on images obtained using mobile phone cameras

More articles: