Recognition of document images using the roulette algorithm

V.A. Malykh, D.L. Sholomov, V.V. Arlazarov

To achieve good quality recognition of critical fields on forms, you must use additional information. Often, for this purpose, a check digit or other redundant information is specially entered into the format of the recognized field.

This article proposes a universal "roulette" algorithm for recognizing fields with a check function.
The article also presents the results of practical testing of the proposed algorithm and, in addition, a general classification of the verification algorithms.

Introduction

In the modern formulation, the problem of recognition is, first of all, for the so-called business forms. That is, commercial property documents, primarily financial ones. An example of a business form is the waybill, which is one of the main types of documents used in trade.
')
Business forms are characterized by uneven information located in different fields of the form. First of all, the important fields are the fields of amounts, account numbers, etc. An example of a critical field is the passport number in a form that uses passport information.

Various methods are used to improve the recognition of critical form fields. In particular, methods are used with the introduction of additional redundant information in the data. A well-known example of such a method from the field of information theory is Hamming codes [ 3 ]. A number of methods in the field of text recognition are proposed in [ 1 ].

As applied to the recognition problem, there is a class of fields containing additional information in its structure, which can serve to verify the recognition correctness. And also to correct errors, if such a task is set.

It is possible to divide the use of additional information conditionally into two types - corrective and rejection checks. A rejection check is characterized by the use of predefined values for matching (for example, a widespread dictionary check). In this case, in the absence of the value obtained in recognition in the dictionary, we make a decision on the incorrectness of recognition.

A corrective check is different from a rejection in that we can try to restore an incorrectly recognized value.

For each character, there are recognition alternatives. You can check the value by replacing one (or several) characters with its alternative. Such a method applied to a value without control data is much less effective - because we are actually trying to guess what was recognized incorrectly. Due to the fact that the probability of error, first of all, depends on the symbol itself, it is impossible to draw an unequivocal conclusion as to which of the symbols was recognized incorrectly on the basis of general considerations. On the other hand, having control information, we can verify the correctness of replacing a symbol with its alternative.

Due to the fact that the control value algorithm is chosen so that the close values of the main data correspond to the substantially different control data, and taking into account the small probability of error, we come to what can restore the original data with a large degree of confidence.

Such additional information can be expressed in any form, but the so-called checksums are most common.

Mathematical formulation of the problem

The formulation of the recognition problem in the most general form is given, for example, in [ 4 ]. The article uses a narrow formulation of the problem from [ 5 ]. The problem of recognition with correction is reduced to the enumeration of the elements of the vector of alternatives

for each character

from the word

. For each set

where

vector element

corresponding to

to th recognizable character which we will name interpretation, its transformation to linear sequence which is exposed to the corresponding is made

The total number of possible interpretations is given by the formula

where

- the number of characters in the word

.

Already for 2 variants for each character a word 15 characters long, this formula gives 32768 interpretation variants, which, with a rather complex verification function

, can lead to long delays in recognition. But, as experience of practical application shows, most of the words are recognized when checking one variant for each character, i.e. for the word length of characters you need to consider only about 15 recognition options.

Adjustment algorithm

Algorithm proposed for rejection and / or recovery of data with control values.

Due to the fact that the probability of an error in any symbol is the same, the algorithm does not distinguish between the control and the ordinary bits. The algorithm successively replaces the alternatives, combining them for all symbols, until the combination of alternatives satisfies the test used. Due to the complexity of the check digit check algorithm, it is possible to significantly reduce the likelihood of incorrect recognition.

The principle of the algorithm is reduced to a sequential enumeration of the interpretation of words

and applying checks to them

. When describing an algorithm using a pseudo-code word

labeled RecognitionResult, and the verification function

labeled Test.

Roulette(Test, RecognitionResult) CharCounter[RecognitionResult.Length] for i = 0 to RecognitionResult.Length charCounter[i] = 0 while Test(RecognitionResult) is not true if UpdateCounter(CharCounter, RecognitionResult, 0) is not true break else for i = 0 to RecognitionResult.Length //    //       RecognitionResult.Char[i].Alt[0] = RecognitionResult.Char[i].Alt[CharCounter[i]]

The algorithm uses the helper function UpdateCounter, with the help of which the iteration of interpretations is directly performed:

 UpdateCounter(CharCounter, RecognitionResult, i) if CharCounter[i] < RecognitionResult.Char[i].Length) CharCounter[i] = CharCounter[i] + 1 return true else if i < RecognitionResult.Length - 1 CharCounter[i] = 0 return UpdateCounter(CharCounter, RecognitionResult, i+1) else return false

The algorithm, called Roulette, takes two parameters as input:

the recognition result presented as a vector of vectors of alternatives of character recognition, i.e. for each of the symbols there is a vector of alternatives; it is labeled RecognitionResult;
a check function that accepts a single value as input and outputs a binary result of passing; labeled Test.

RecognitionResult contains a Char array and a Char array length, specified as a Length. The Char array contains in each of its elements an array of Alt character recognition alternatives. Each element of the Char array contains the length of the Alt alternatives array, specified as Length.

It is assumed that the test function Test accepts only the RecognitionResult recognition result and performs a check on the default values of the elements of the Char array in it.

Practical application example

Consider an example of a checksum for an TIN and its corrective check.

The algorithm for calculating the checksum is as follows (for a 12-digit code):

Step 1. The check number n2 is the remainder of dividing by 11 the sum of the digits of the number multiplied by the corresponding coefficients from table 1 (from the row “calculation of the check number n2”). If the remainder is 10, then n2 = 0.

Step 2. The check number n1 is the remainder of dividing by 11 the sum of the digits of the number multiplied by the corresponding coefficients from table 1 (from the row “calculation of the check number n1”). If the remainder is 10, then n1 = 0.

Table 1.

	k12	k11	k10	k9	k8	k7	k6	k5	k4	k3	k2
calculation of the check number n2 for 12 digits TIN	7	2	four	ten	3	five	9	four	6	eight
calculation of the control number n1 for 12 digits TIN	3	7	2	four	ten	3	five	9	four	6	eight
calculation of the control number n1 for 10 digits TIN			2	four	ten	3	five	9	four	6	eight

In the case of a 10-digit TIN, there is one check digit, in the case of a 12-digit one, two check digits. The coefficients for calculating the check digit are presented in Table 1.

An example of an image entering the recognition algorithm:

Image 1.

In image 1, two potential recognition problems can be identified (the list is non-exhaustive):

the third position "3" is interpreted as "5", in which the top line has disappeared;
The eighth position “7” is interpreted as “2” with an incomplete connecting loop.

Let the first of the described problems be realized - there was an error in recognizing the third position. Additionally, we assume that all positions except the third one are recognized uniquely. Accordingly, we have an alternative recognition in third position.

Make a check for the recognized value: "5253000796"

From which it is clear that the checksum has not converged.

Now we try to replace the characters with their alternatives, that is, according to our assumption, now we check the value “5233000796”

Thus, we were able to correct the incorrectly recognized value using the check digit in it.

Practical application of the algorithm

The results of the work have been applied in the industrial system Cognitive Forms, where the described algorithm is used for additional testing on the types of TIN, OGRN and SNILS fields in various real-life business forms. The Cognitive Forms system is described in article [ 2 ].

Additionally, testing was performed on a stand with a volume of 480 images (hospital sheets), where the recognition of fields of the types described above was checked. The test results are presented in table 2.

Table 2.

Total number of fields	3840
The number of correctly recognized fields without connected algorithm (b / a)	3510 (91.41%)
Number of incorrectly recognized fields b / a	330 (8.59%)
Of them wrong no doubt	171 (4.45%)
Number of correctly recognized fields with connected algorithm (c / a)	3576 (93.12%)
Number of incorrectly recognized fields with / a	264 (6.88%)
Of them wrong no doubt	55 (1.43%)

The study shows that using the described algorithm, we managed to improve the quality of recognition of these fields by 20% (from 91.41% to 93.12%) and, moreover, more than tripled (from 4.45% to 1.43%) to reduce the number of incorrectly recognized fields.

Conclusion

In the future, it is planned to improve field recognition using this algorithm, as well as expand its application to other types of fields.

Literature:

1. Sholomov D.L. Syntax methods of context processing in text recognition tasks. - M., 2007.

2. Arlazarov V.V., Postnikov V.V., Sholomov D.L. Cognitive Forms - a system for mass input of structured documents. // "Management of information flows" Collection of works of the Institute of System Analysis of the Russian Academy of Sciences. / M., URSS, 2002.

3. Peterson W., Weldon E. Codes that correct errors: Trans. from English M .: Mir, 1976.

4. Khaikin S. Neural networks. - M .: Williams, 2006.

5. Arlazarov V.V. Structuring of visual representations of the information environment and methods for determining the reliability of recognition. - M., 2004.

One of the authors of the article V.Malykh is present at Habré - madrugado

Source: https://habr.com/ru/post/209830/

All Articles