📜 ⬆️ ⬇️

Classification of documents according to their appearance and content

image Today we will tell you how and why you can use classifiers to separate and sort different documents according to their types.

In the company ABBYY, in addition to programmers, linguists, analysts and other various useful people, there are many qualifiers. Of course, classifiers are not people, but algorithms, but they do work, without which quality text recognition is impossible. One cannot do without them at the most different stages of processing documents - from finding zones containing the text in pictures to recognizing specific characters in lines with text.

But this work does not end the classifiers. They can also process batches of documents and arrange them in “heaps” depending on the type of documents.
')
Imagine the work of the agents of the car insurance company. At each accident, an agent must go to the scene, whose task is to clarify the data about the client and the insured event. As a rule, the agent is armed with a camera and photographs the car, CTP insurance policy, a receipt, a certificate from the traffic police, and other necessary documents.

Then, already in the office, the data from the camera is dropped into the system of the insurance company and transferred to processing. The agent connects the camera to the system, and she takes all the pictures, simultaneously using the classifier, automatically identifying and marking them - this is a photo of the car, this is a photo of the receipt, and this is a certificate and so on. And photos of documents are immediately sent to the people responsible for their processing.

Using the classifier in this case makes the system a little smarter - when photographing a document of some type, it is not necessary to once again report that you are photographing the document and say what is depicted on it.

Or another case.

You bought a ticket to Ibiza and oh, shit, you have to go get a visa. You will be asked for a bundle of documents measuring centimeter thickness, which shows that you are white and fluffy and do not plan to suddenly turn into an illegal migrant. Then the girl in the window for a long time will shift your documents, put them in the correct order, then to transfer in the correct sequence to the consulate, where they will issue you a visa.
And if you apply a classifier, then you can take this entire pack, put it in a stream scanner, scan it quickly, and you will get a package that is disassembled according to the types of documents (application, certificate from bank, certificate of work, copy of passport, etc.) and sorted, it is a pleasure to watch and process. As a result - reducing customer waiting times, reducing the cost of processing, in general, advantages from all sides.

And the last example. Imagine that the camera application on your phone will become a little smarter. You photograph something, and then the phone analyzes the image and offers you actions based on what was in the frame. If there is a picture of your pet, you will be prompted to put it in the library or share the picture with friends. If the business card, then you are offered to put the contact information in the phone book. And if you plan to take a picture of the check, and it turns out to be a check from the restaurant, the phone will tell you how much you can leave a tip, and will remember the amount in the program, leading your expenses.

This is not a frame from the future, but a technology available today in the next version of our technologies.

Thus, if you need to find somewhere from the set of documents you need or, on the contrary, understand what kind of document is in front of you, you can use a classifier that separates documents by type.

Now a few words about how the classifier is arranged.

In order for this machine to work, the classifier must first be trained. You pick up a small database of documents representing each type that you go to identify. If any two documents that you want to define are quite similar, for example, in the case of a questionnaire of the same type, which is filled by different people, a single document of this type will be enough for you to learn.

With the help of this base you are training a classifier. Then you take another base, check the classifier’s work on it, and if you are satisfied with the results, you can start the classifier “into battle”.

During work and training, the classifier uses a set of features that help separate documents of one type from documents of another type. All signs can be divided into graphic and textual.

Graphic signs well divide groups of documents that are very different from each other. Relatively speaking, if you look at a document from a distance so that you cannot read the text on it, but you can understand what type it is, then the graphic signs here will work well.

So, graphic signs can well separate the fused and non-text text, for example, letters and payment receipts. They look at the size of the image, the density of colors in different parts of it, various other characteristic elements such as vertical and horizontal lines.
And if the documents are similar in appearance, or one group, without reading the text, cannot be separated from the other group, then textual signs help. They are very similar to those used in spam filters and allow for the characteristic words to determine whether a document belongs to a particular type. It is convenient to separate letters from contracts, checks from business cards with the help of text signs.

Also, text signs help to separate documents of a similar type, but differing in the value of one or several fields. For example, checks from McDonalds and Teremka are very similar in appearance, but if we consider them as text, the differences will be very noticeable.
As a result, the classifier for each training sample gives more weight to those text or graphic features that allow the best way to separate the documents from this sample by type.

On our tests and tests of our clients, the document classifier by type shows itself quite well. Able to learn even on one image per type, allows you to classify documents at speeds up to 120 pages per minute per processor core, while making less than 1% of errors. We really like it. I would like to like it and you will see his work in real combat conditions.

All the scenarios described in this article can be implemented using the classifier available in ABBYY FineReader Engine 11 . If you have other scenarios in which using a classifier could help solve your problem, please contact us . We will try to help you.

Vasily Panfyorov
Developer Product Department

Source: https://habr.com/ru/post/181398/


All Articles