
Recently, we quite often tell in a blog about our recognition technologies that work on mobile devices and recognize photos taken by cameras of these devices. Now we move on and learn to work not with photos, but with a video stream. And today we want to tell you in more detail what this means and where in everyday life text recognition from a video stream can be useful.
By the way, now we are expanding the team engaged in creating a product for text recognition from a video stream on smartphones. If you are an Android or iOS developer with experience writing high-load applications and you have a desire to develop new technologies with us, hurry to respond to the vacancy .About video streaming and recognition
To begin with, let's say which video stream we work with.
A video stream is a sequence of frames received from the device's camera, in other words, what we see on the smartphone screen when we launch the standard Camera application.
')
Photo of the menu fragment

Frames from a video stream with a menu fragmentIf you compare a single frame of a video stream with a photo, you can see that its quality is lower: the resolution (dpi) of the frame is smaller, the image is often out of focus and blurred, and digital noise is also present. The presence of such defects is absolutely not surprising, because we are not robots to hold the phone in our hands absolutely motionless :), besides, insufficient lighting due to the lack of flash operation. This makes the task of recognizing a video stream much more difficult than recognizing a photo, but the benefits that can be gained in the end are worth the effort to solve it:
Improving the quality of recognitionWe
have already talked about our technologies that allow us to quickly check the suitability of a photo for recognition. In the case of a video stream, you can go even further and immediately choose from the image stream, the most suitable for processing. Yes, many quality shots are worse than photos, but some may turn out better, there are plenty to choose from.
In addition, a significant part of the errors in the recognition results is due to random noise, glare, defocus, etc. These defects do not repeat from frame to frame, therefore, a symbol that was incorrectly recognized on one frame can be correctly recognized on the next one. By aggregating text from several frames this way, you can improve the quality and achieve even better recognition results than for the photo.
More convenient applicationsRecognition from a video stream requires almost no action from the user. There is no need to click on the button "take a picture", to ensure that all the necessary text in the photo was in focus. Simply point the camera at the text and recognition will start automatically. In addition, since text processing is performed directly "off-frame", the results can be displayed instantly, for example, on top of the original text (to simplify the task of verifying the results)

or, on the contrary, depending on the recognized text, supplement the frame with special pictures, inscriptions, etc., i.e. create so-called applications of augmented reality. In addition, when recognizing from a video stream, the image is not saved anywhere and does not clog the memory of the device.
All this brings joy to the user and makes the application convenient.
How can video stream recognition be useful?
In our future articles, we will describe in more detail how best to work with the video stream, but for now let's see where this technology can be applied.
Keyboard Alternative
One of the most obvious scenarios is replacing text input from the keyboard with input from a video stream. Probably everyone has been in a situation where there is a booklet in their hands, on which the necessary e-mail address is written, and you have to manually drive it into the browser line. It would be much more convenient and faster to just point the camera of the phone at it.
In order to provide real-time processing and catch up with the constantly changing scenes in the viewfinder, you need to recognize it very quickly. To achieve this, you have to resort to many different means. First, enable all device processor cores using parallel recognition. In this case, it is possible to recognize in parallel several frames taken with a slight delay, or to break the processing of one frame into several stages performed in different workflows. Also, to increase the processing speed, one has to use various hardware acceleration tools available on the device.
Instant Translator
Another task is translation. Arriving in an exotic country, it is difficult to navigate on the streets and in restaurants, since all the inscriptions on the signs, in the menu, etc. in an unknown language, and the locals may simply not speak English. For example, here’s what you can see on the streets and in restaurants of a Chinese city:


The application-translator in such a situation, of course, can help, but only if you correctly type all the necessary hieroglyphs. It is more convenient if the application allows you to take a picture of the text, and then automatically recognize and translate it. However, it will be problematic to place the entire restaurant menu on one photo. Have to take pictures several times. But using text recognition from a video stream, you can translate instantly: the user points the camera at the text, the application behind the scenes recognizes and translates, and on the screen substitutes the source text for its translation, like this:

Those. you get such a kind of augmented reality, in which everything is written in a language you can understand.
The need to recognize and translate text on objects around us (for example, on signs that are surrounded by foliage trees; the menu where dishes are painted nearby, etc.) leads to additional difficulties, namely the need to determine where frames, in fact, is the text. The methods used in FineReader for analyzing binarized images, sharpened for processing text documents, are not suitable in this case. After binarization (translation into a black and white look), ordinary objects often become similar to text - for example, the windows of buildings can form a whole line:

Binarized image
For the processing of frames of a general form, one has to resort to more complex algorithms, to use a special object classifier trained on packages of images of signs and street signs. The categorizer allows you to understand whether there are letters and lines in the image, separating them from garbage. Such a mechanism is well described in the
article .
Data capture
Another area for the application of recognition from the video stream is the extraction of data from documents (identity documents, payments, etc.). Currently, almost all banks in their mobile applications offer the service of payment of housing and communal payments. In order to make such a payment, now you have to manually reprint long lines of numbers in which you can easily make a mistake (subscriber code, personal account number, etc.). And automatic recognition of the data needed for payment will simplify and speed up this process.
"Smart Camera"
One of the most interesting scenarios can be called “smart camera”. We are talking about the usual application "Camera", which is on every smartphone, supplemented by the recognition functionality of the video stream. When you hover such a camera on a business card, for example, it will automatically create the corresponding contact in the phone book, for a QR code, open the desired link in the browser, for an invitation to an event, create an event in the calendar, etc. Often, users take pictures of a piece of document if they need to remember the information written there, and the smart camera automatically recognizes the text before the user has time to click on the “take a picture” button and offers to save the recognized data as a note or reminder.
But you have to keep in mind that creating a “smart camera” is fraught with additional difficulties. The constant operation of such a demanding process as recognition will greatly discharge the device battery. And practice shows that recognition will need no more than 10-20% of the total time of use of the camera. Still, photos of cats and selfies are more popular than photos of documents, business cards, bar codes combined. Therefore, it is necessary to somehow regulate the power consumption of the camera application. For example, use a special fast text detector, which will analyze the frame and give an answer whether there is text on it or not. And then for frames with text, additionally call recognition.
That is all that they wanted to talk about. If you have ideas, where else could the recognition technology from the video stream be useful, let's discuss in the comments.
Olga Titova,
product department for developers