For this post we were inspired by an interview that our European office gave to one computer magazine. It was about ABBYYFineReader and recognition technologies. Among the questions was something like this:
What were the main challenges to overcome when developing the software? Were there any particularly knotty problems?
In response, I just want to give an official tirade, about the fact that images are very different, fuzzy photos, low resolution, dirty paper, fancy letters ... In general, even without knowing anything or almost nothing about our technologies, we can say something plausible.
')
And here is a reason to think. Still, from the point of view of the complexity of the tasks, this is not so interesting - the poor quality of the image and decorative fonts. We could say about the same thing five years ago, and ten, and twenty. Yes, undoubtedly, there is progress - and for most of the versions, the well-known columnist and our old friend Sergey Golubitsky found exactly those pictures that turned out to be on the “front line” of our technologies - so that their new version of FineReader began to process almost perfectly, and the old one "Stumbled."
But in order to talk about the difficulties we are still facing, it is worth resorting to a little metaphor. Here are the difficulties you have to solve the following problem.
Hmm ... Most likely, most normal people will think that it is not so easy to extract the roots of the fifth degree. And if you ask "what difficulties arise with the solution of this problem," you will want to tell you how to extract the roots of the fifth degree, putting the number in a row and using vector operations ... But for the "artificial mind", the extraction of the root will be almost the easiest of its actions . The real difficulties will be different:
- Recognize handwritten numbers
- Disassemble and decompose the formula
- In general, somehow guess what to do
You can see that in this example, the simpler the part of the task for a person, the harder it is for the “artificial mind”. You can’t put it in marketing materials, you won’t say in an interview ... "We became less mistaken with the analysis of the formula" - "Gee-gee-gee, every fool can make out the formula, you are doing nonsense."
And with the last point there is a special snag. How to understand what is there to do at all? Who can promise that we will have to solve such a problem once more?
We have just approached the connection between metaphor and recognition tasks. The very phrase “recognition problem” is a trap. Because when the task to recognize something is already set, it is a well-formulated question. And as we
know , asking the right question is already an essential part of the solution to the problem.
As an example, we present the problem of correcting skew in the photographs of the text. At first glance, it does not look very complicated. We need to find what seems to us like lines of text, determine the angle of inclination, rotate the image - that's all.
The trouble is that often the directions of text strings are several in one image. Which one is the most important for the task designer? In the latest versions of FineReader, skew correction algorithm estimates how many directions of text lines can be distinguished, which of them is the most informative, and rotates the document so that the recognizer can read the “main” text. Unfortunately, sometimes he is “basic” not in terms of the questioner.
But besides skewing on the image, there may be perspective distortions, and then the problem cannot be solved with one turn. Photographs of two-page documents, for example, passports, may have perspective distortions for each page. Curved lines on a magazine spread can arise both as a result of sheet curvature, and because of the “artistic” layout. Do I need to unbend them in the latter case and try to recognize them? Without more complex models of documents there is no longer enough, but in order to determine which one fits the image, you need the appropriate classifier. With the promising distortions, we have already learned how to fight, and with the remaining cases there is still a long way to perfection.
Next, we started learning to segment the page. Select columns of text, pictures, tables ... How to distinguish text from a picture? Yes, there is nothing easier, to our services Over9000 articles on this topic. Implemented, everything is fine, run and see the following image:
In earlier versions of FineReader, our clever text classifier happily informed us that there was a bunch of well-organized text, and the table analyzer happily drew a very well otsegmentirovanny table on it ... But the person who translated the electronic database tutorial seems to it’s not at all necessary: ​​in this case, just an example of an MS Access screenshot is needed, and no one was going to recognize the table and somehow use the data from it. In recent versions, we taught FineReader not to touch the contents of the screenshot and leave it with a picture.
Or such a surprise
Clearly, we have the text. Several blocks of text. But even a not very attentive reader may notice that there is something else there.
In a good way, of course, this diagram should be disassembled, the whole text should be recognized, and the frames should be drawn by vector commands. But bad luck, we do not know in advance what format it will have to be recorded in and what commands will be available to us. Therefore, it will often be reasonable to save this diagram as a picture. Yes, the text, yes, as a picture. Artificial intelligence came into complete confusion, spat on these people and went away to routinely calculate the roots of the fifth degree.
The question is, and what to do about it?
What is important here is not even the answer to the previous question. Even if we understand it, we will ask ourselves another question:
And how often will it be necessary to solve such problems? Yes, the question is the same as in the formula example. And the answer is not always simple.
For us, it turned out that among the problems with page segmentation, the proportion of those caused by inadequate perception by us of the screenshot and diagram is quite large. Therefore, in the new versions we “explained” to our engine that such objects exist, how they are arranged and how they differ from tables, blocks of text and just pictures. But if we thought that the problem was rare and “exotic”, then most likely we wouldn’t have done anything - what would have enraged those who process pages with lots of screenshots and diagrams.
Do you think such problems only with unusual objects? Nothing of the kind - in the same segmentation there are only examples with text. Say, most of the currently popular page segmentation algorithms (just the question - will it be interesting if we write a small review on them?) They say that if there are two groups of texts at a respectful distance from each other, then they should be in different blocks. Logically, damn it - do not mix two columns in one block. “Welcome, please come,” the universe tells us:
“Well ...”, the developer says, “we make the list numbers from the list not fall off ... Let's say, columns of numbers alone will try to append to the normal text”. “Ay well done!”, Says the universe:
Here we will not be able to glue the numbers (line numbers) to the text without breaking its coherence and the possibility of further use - the thing for which the user launched our program. But nothing, spitting on his hands, in the next version coped with this.
What is this all about? It seems that one of the challenges for us, the creators of the recognizer, is the variability and, as a result, the under-formulation of the tasks before us. Moreover, it can be considered that this is one of the important differences between artificial intelligence and natural intelligence. Because the under-statedness of the task is a familiar environment for the natural mind and a curse for the scientist and the engineer. Returning to the recognition system, one of the directions of development is just a more thorough and verified formulation of the problem, with most of the subtasks either solved for a long time or do not pose a special difficulty. And the question for us can only be the solution of which of the tasks at the moment is most needed by our users. It so happened that these are not “program features”, but the development of technologies.
(with
logicview )