📜 ⬆️ ⬇️

ABBYY Labs - what's new?

Time to sum up the first ABBYY Labs project. Let me remind you that the goal of the project is to enable students in the learning process to solve problems that are more close to real than it happens in the normal educational process. And “immerse” them at the same time in the appropriate environment: the environment in which the development took place - actually operating IT-company.

Students of the Faculty of Innovation and High Technologies MIPT were perplexed by the recognition of formulas and export to TeX (the initial formulation of the problem is here ).

What did they do?

ABBYY Mobile OCR Engine for Windows was taken as a tool for recognizing the characters themselves, which, of course, is cool, but the formulas are completely “not sharpened”. Students were divided into 2 groups: one was engaged in the analysis, and the second - recognition and export.
')
Analysis

Image analysis in OCR is the process of finding blocks containing different information in terms of OCR. “The picture is there, and the text is there,” says the analyzer. The original formula engine does not see - he just does not have such a thing. This guys had to try to fix it.

To control the quality of their work, students used a cunning move. For a large number of documents in TeX format, they received information about where the formula is located on each page. How? They took the source of mathematical articles and compiled with a TeX compiler into pictures. That they were going to recognize. Then they “spoiled” the same source code: the background for the formulas was made black, and the text color outside the formulas was made white. And also compiled into pictures. On the tainted pictures, only the black rectangles of the “formulas” turned out to be, which can be easily distinguished and thus recognized their “true” coordinates. After that, the original, “uncorrupted” images were driven through the analyzer and they looked at how much different the coordinates of the formula rectangle differ from the reliably determined coordinates of the black rectangle. Here is ready automatic quality control system!

The task of selecting a rectangle that exactly contains a formula is the most difficult part of the project. To solve it, signs with weights (i.e., the significance of these signs) were identified, varying which can be improved (or degraded :) the quality of analysis. Signs were of several kinds:
• Vocabulary signs were looking for math functions such as sin, lim, f (x) on a previously recognized image.
• Geometric searched for too wide or too high "sticks" - fractions, integrals, parts of matrices.
• Signs of text parameters were attempted to be determined by line spacing (if a formula crept into the text, the line spacing will be longer).

What came out of it:
• Such rectangles of lines were found by the usual Mobile OCR Engine (I recall that it was not adapted to the recognition of formulas):
• Such a rectangle of the formula was found by the analyzer, which is aware of the existence of the “formula” object:
Surely there is a set of scales for signs that will give the best results. Given the availability of a ready-made quality control system, this is a task of time, because the search for the best weights can be largely automated.

Recognition

One of the tasks was the "finishing" of recognition, and it is not about character recognition, but about obtaining data about their relative position. A regular recognizer is oriented on lines of text, and a mathematical formula can be “multi-storey”, expanding a line. Moreover, even banal two "floors" of a fraction can put an ordinary recognizer at a dead end: it most likely draws from the text "-" (which the horizontal fraction line consists of), numbers and letters in disarray. Will call all this superscripts and subscripts and mixes. The guys were able to achieve better recognition of fractions, powers, superscripts and superscripts. And not only the indices on the right above or below the right of the symbol, but also the limits in the sum formula.

Export

As a result of recognition, a tree is obtained from operators and operands. In the process of traversing the tree is a string for TeX. There were no problems with this.




Impressions


The employees of the Math OCR group (the very students) are satisfied with the results. They have mastered the development tools and got the programming experience, different, in their opinion, from what they got in the usual educational process. It seems to us that the guys turned out to be a well-coordinated team, and this improved the quality of their work.

This, of course, does not close the ABBYY Labs project. Wait for news next school year!

Source: https://habr.com/ru/post/136506/


All Articles