Field-level OCR. What is it for?

Surprisingly, but a fact (in the sense that you can’t tell from the picture): under the cut, it’s not about the clouds, but about OCR

We have already announced a cool thing called ABBYY Cloud OCR SDK. It is gradually gaining popularity - the other day the service recognized the millionth page. It seems that this is a good reason to improve the OCR-literacy of current and future users. So, let's begin.

Today we will talk about the existence of two types of recognition - Full Page OCR and Field-level OCR . These approaches differ not only in price, there are fundamental differences between them in why they are needed. Unfortunately, not all new OCR developers understand these differences, and are forced to learn from mistakes. Moreover, many large and well-known players in the Data Capture market still continue to use a single-pass algorithm where a multi-pass will be good (ie, Full Page OCR instead of Field-level). The reasons for their behavior are commonplace: the application was written many years ago, and it is too expensive for them to redo the architecture, the UI, to re-train their partners. And they are forced to pay for it with limitations in the field of quality of recognition.

The need for Field-level OCR arises at the moment when we want to extract some useful data from the document. Such a processing scenario in the industry is called Data Capture. For example, we want to extract from Invoice the name of the company-counterparty, the amount that needs to be paid and the date by which the payment must be made. Moreover, on such documents there are a lot of sums and dates, and here it is important not to miss and choose the right one. Just imagine the consequences of the fact that, for example, if you pay instead of the invoice amount, your phone number will be used.

Usually the processing of such documents looks like this:

')
Depending on the type of documents being processed, various field search methods may be used. In the simplest case - machine-readable forms that “match to the light”, the coordinates of the fields are known in advance, it remains only to combine the sheet with the template, for which various “benchmarks” are used, such as these black squares.

In the case of invoices, this method no longer passes. Here, more sophisticated techniques are applied, which are based on searching for keywords like “Invoice #”, “Due date:”, and starting from them, zones on the document are defined, in which the required fields are located. It uses prior knowledge of how the invoice of this company is usually arranged.

It was at this moment, when we had already determined the locations of the fields or, as in the most advanced systems like FlexiCapture, several possible hypotheses about the location of the fields — it's time to decide how we will extract the text of these fields.

Suppose we have determined the location of the fields correctly, and in the Total Amount field we have the amount, not the phone number. But we still have a great chance to put the wrong point in the amount of “$ 1,000.00,” or simply miss a comma and pay one hundred thousand instead of one thousand dollars. That is why the extraction of such data is usually taken very seriously, and the question “how many percent of recognition errors” is already being replaced by the question “how much time is required for manual verification”. The quality of technology here is no longer measured in the number of incorrect characters, but in the amount of manual labor required to correct system recognition errors. Moreover, this is connected not only with the very accuracy of recognition, but also with her ability to correctly say “I’m not sure about this symbol”, but this is a topic for a separate article.

As I said, there is a class of applications that use only one pass. Already at the stage of determining the type of document, full recognition of the page occurs. Its text serves as the basis for deciding on the type of document (for example, by searching for typical keywords for a given type), and then also for searching for reference elements and determining the location of fields. And the same text is eventually used to extract data from fields.

In our opinion, at this very moment it makes sense to once again recognize specific fields, but by already specifying the recognition parameters. The fact is that by significantly narrowing the possible set of recognized values, we can significantly improve the accuracy of both recognition and definition of uncertainly recognized characters. As a result, it has a positive effect on the main parameter of the system - the amount of manual labor for entering and processing documents.

In particular, knowing which field we will now recognize, we can determine the following parameters:

· Set a regular expression or special language for dates, amounts, bank details, and so on.

· Set the dictionary of possible values, in case this field has a limited set of text values (company name, nomenclature, etc.)

· Set text segmentation parameters (one line or several, monospace text or not, etc.)

· Specify the type of text if the field is printed in a special font like OCR-A , etc.

· Apply special image processing settings (for example, if the text itself is printed in a different color, or often an organization stamp or signature is placed on it)

In principle, all this setting is to disable various automation, which is in any high-quality OCR, and almost always works correctly. But as I said, in the DataCapture scenario, the quality level “almost always works correctly” is not enough. The presence of this “practically” leads to the fact that the recognition results are first checked by automatic rules, and then also manually by the operators, and sometimes in several stages, as people, unfortunately, “almost never make mistakes”.

Let's look at this example. Here, for example, scanned cash voucher.

If you take a closer look, we will see that this is a real nightmare for OCR. Letters stuck together and not fully printed.

And yet, modern OCR technology can handle this. If you just open it in ABBYY FineReader 11 and recognize the default settings, we will see that even out of this horror you can completely pull out something. In general, he coped well, although he made a lot of mistakes. In particular, in the part of particular interest to us - the table with the goods and the price, one of the values was recognized as “$ o.ce” instead of “$ 0.00”.

But why, you say? After all, it is obvious that this is exactly "$ 0.00"! But why is this obvious to you? Because when you look at the image, you understand that this is a cashier's check, and there is a table in its center, and in the third column of the table are numbers, not abracadabra. But how could the OCR program know this? After all, in the left column there can be full abracadabras like “5UAMM575”, why don't they meet in the third one too? The program does not go shopping, and does not make purchases, and with the default settings, it is equally likely to expect to enter as checks, newspapers, and magazine articles with glamorous layout. In the case of a two-pass algorithm, we determine that, generally speaking, this is a check, in addition, we can determine its structure, and also remember that for the checks of AJ Auto Detailing Inc. in the third column there are always sums of $ XXX.XX format and no letters there can not be. Thus, by re-recognizing only these fields, setting the appropriate restrictions (for example, through a regular expression), we are guaranteed to get rid of such errors.

In the case of a single-pass algorithm, we have to completely rely on automation, and set the widest possible image processing settings, since we need to recognize not only these fields, but also the rest of the text, as well as for all possible types of documents, and all for one time. This is a serious limiting factor, which the developers are trying to compensate with more clever rules of post-processing and verification of results, a more efficient manual verification process, etc. All this, of course, is also important, but if there is an opportunity to significantly reduce the number of errors at once, why not use it?

If, in addition to product prices, we are still interested in their nomenclatures, then by connecting the dictionary of available nomenclatures for this supplier, you can also go through these fields too, greatly improving the quality of the data. But why not connect the dictionary immediately? But at first, because we don’t even know the supplier, which means we’ll have to connect all the dictionaries of all the suppliers at once, and this can only worsen the result.

By the way, in this dispute, the processing time argument is usually given, as an explanation of why they do not use the second pass. But the fact is that just this argument is in favor of two-pass recognition, and here's why. When re-recognition is processed negligibly small part of the area of the original document, and even a good deal of automation is disabled and transferred to manual mode. If recognition occurs on the local server, then the time of the additional recognition pass is vanishingly short compared with the time of the first pass. However, in the two-pass scenario, the first pass is used only for extracting the reference text and classification, which are tolerant to recognition errors due to the use of fuzzy comparison algorithms. The text itself does not go directly to the results, which means that instead of a thorough recognition mode, you can use the fast one, thereby reducing the time by two or more times.

In the case of recognition in the cloud, indeed, we will have to make two calls to the server instead of one. But even in this case, the argument about the possibility of achieving higher quality remains in force.

Original article in English on the SDK team blog .

Andrei Isaev
director of product development for developers.

Source: https://habr.com/ru/post/173571/

All Articles

Field-level OCR. What is it for?

More articles: