Cognitive PDF / A - the technology of digitizing text documents for publication on the Internet and long-term archiving

Hi Habr!

We continue to publish about optical recognition technology (OCR, ICR) and understanding of documents developed by Cognitive Technologies. Today, our story about the technology of digitizing text documents Cognitive PDF / A.

In the business area, it is often necessary to scan paper documents for subsequent mailing or archival storage. With high-quality scanning, the resulting image-images are often quite large. For example, an A4-size document scanned in color at a resolution of 300 DPI has a size of about 25 MB. The use of files of such large sizes is inefficient in electronic archives; therefore, compression technologies of the resulting electronic images are becoming more and more interesting. Classical image compression technologies (JPEG, RLE, Deflate, etc.) are not applicable, as in the general case, documents can contain both monochrome text and full-color graphic areas. Lossless image compression algorithms that are effective for monochrome texts are ineffective for full-color graphics, while lossy compression shows high rates for color images, but it distorts text information greatly (Fig. 1). Therefore, it is usually for the compression of images of this type using a combined approach.

Structural compression of document images

We present the idea of structural compression using the example of a log page image (Fig. 2). A classic magazine page can contain a background image, one or more text blocks, graphic elements (photographs, charts, tables, etc.) and some marks. The main idea of structural compression of images of this kind is to select structural blocks, combine these blocks into layers (ie, “splitting” images into text, graphic, and other layers) and compressing each layer in the most appropriate way. So the image of the magazine page in Fig. 2 is stratified into four layers: background, black text area, blue text area, and photo area. To preserve maximum quality, text layers should be compressed with lossless compression algorithms (for example, CCITT Group 4), while for a photo it is acceptable to use lossy compression methods (JPEG). The main place in the algorithms of structural compression is given to the methods of splitting the original image into text and graphic layers.
This approach has received high popularity relatively recently. One of the examples that implement the idea of structural compression, can rightly be considered the format of DjVu.
To compress color images in DjVu, a special technology is used, dividing the original image into three layers: foreground, background, and black and white (single-bit) mask. The mask is saved with the resolution of the source file; it contains the image of the text and other clear details. The resolution of the background, in which there are illustrations and texture of the page, is reduced to save space. The foreground contains color information about parts that are not in the background; its resolution goes down even more. Then the background and foreground are compressed using the wavelet transform (using the IW44 algorithm), and the mask is compressed using the JB2 algorithm.

Despite the high compression ratios of document images, DjVu has a significant drawback: today the format is not standardized, which makes it difficult to use it as a means for creating electronic archives. In addition, the use of the same stratification scheme for all types of documents is not always justified, and even sometimes can lead to a significant distortion of the document. Additionally, it is worth noting that the format completely lacks any means of ensuring the security and confidentiality of documents.

Cognitive PDF / A Technology

We describe the technology Cognitive PDF / A, designed to convert paper documents into electronic form, and the digitization process in accordance with the proposed technology (Fig. 3).
The first stage of processing is the stratification of the original image. As a result, two new images appear. The first one contains the areas of the original image corresponding to the textual information (text layer), and the second - to graphic elements (graphic layer).
In accordance with the architecture of the algorithm, the text layer should not contain any unnecessary areas other than text blocks. Consequently, the image of the text layer can be easily recognized without any prior preparation using external OCR systems.

The final step is to wrap the received layers and the recognized text in PDF / A. The graphic and text layer is subjected to appropriate compression, and the recognized text is packed in such a way as to ensure maximum convenience of searching and copying information in the document.
Thus, Cognitive PDF / A technology consists of three main parts: the separation of the original image, the recognition of a text layer using an OCR system, and the compact packaging of the resulting layers and recognized text in a PDF / A file. Consider these parts in more detail.

Splitting algorithm

Different types of documents have different features. For example, financial documents are characterized by the presence of stamps, signatures and stamps, journal articles can have a complex multicolored background, and books often include full-color graphic elements. Therefore, Cognitive PDF / A technology provides unique layering for each type of document. The choice of the best scheme can be carried out using pre-identification algorithms for the type of document. Further, as examples, we consider layering schemes for two important types of documents: a book page and an office document.
Usually the page of the book contains black text on a white background and, possibly, graphic elements: drawings, charts, graphs, etc. (Fig. 4)

Typically, in books, text areas and graphics do not overlap. Another key feature of the layout of books is the use of fonts of similar linear sizes. Based on these characteristics, we construct a diagram of the bundle image of the page of the book.
Step 1. We binarize the original image, thereby transforming it into a monochrome form (Fig. 5a). Since the image mainly contained black text on a white background, the binarization process should not have a strong impact on areas containing textual information. Step 2. With the help of morphological filtering, we merge the words into single connected components. Denote by w and h the characteristic width and height of characters, respectively. Note also that the distance between the letters in a word is comparable to the thickness of the character stroke, and the distance between words is close to the width of the character. Therefore, we “glue” each word into a separate connected component, completing the opening with a window (Fig. 5b).

Step 3. Construct a histogram of heights of the connected components (Fig. 6). Since all the text on the page is printed in approximately the same font size, the connectivity components corresponding to the words form one or several distinct maxima on the histogram. Therefore, analyzing the histogram, you can calculate the characteristic font size h _font , which typed the text on the page, and, accordingly, select the area on the image corresponding to text information (areas corresponding to connectivity components with a height of about h _font ).
Knowing the area of the text on the original location, we construct the mask of the bundle, after which we apply it to obtain graphic and text layers (Fig. 7).
Since the fast morphological filtering algorithms with a rectangular window are used to select text blocks, it is very important that the text blocks are aligned with the image axes. Therefore, the image is “aligned” before morphology.
For color images of financial documents (invoices, receipts, contracts, etc.), the above-mentioned features of the book page image are not typical, as graphic elements (stamps, signatures, handwritten notes) are often superimposed on text blocks (Fig. 8). Therefore, it is unreasonable to use the above described algorithm for stratification. We construct a stratification scheme based on the color characteristics of the image. The color saturation of black text and white background is close to zero, while for blue seals and captions this value is large. Taking this property into account, we construct the following bundle scheme.
Step 1. Construct a histogram of color saturation (Fig. 9), i.e. dependence y = logN _x , where N _x - the number of image pixels, the saturation of which is equal to x .

Step 2. Note that two classes are clearly distinguished in the histogram: the first is formed by pixels with small values of color saturation, the second - with large values. Pixels from the first class make up the image areas corresponding to the background and black text, from the second - the graphic part of the image. Let us find the threshold of separation of two classes t * using the Ots method.
Step 3. We layer the original image as follows: the pixel of the similar image (x, y) belongs to the text layer (Fig. 10a), if its color saturation value is less than the threshold s (x, y) <t * ; otherwise, the pixel (x, y) belongs to the graphic layer (Fig. 10b).

Text layer recognition

As a result of the delamination, we have already received images of text layers that can be easily recognized without any prior preparation using external OCR systems.
In the software implementation of Cognitive PDF / A technology, the open source OCR CuneiForm OCR system is used as an OCR module.

Compression and packaging in PDF / A

The resulting text and graphic layers as well as the recognized text are separated in PDF / A format. This format is an ISO 19005-1: 2005 standard, based on the description of the standard PDF version 1.4 from Adobe Systems Inc. and is designed specifically for long-term archival storage of electronic documents. Despite the fact that PDF / A is a subset of the PDF format, there are a number of differences due to the requirements imposed on PDF / A as a long-term storage format for electronic documents. For example, required for PDF / A is:

The introduction of all used fonts, including fonts from the list of "standard for PDF."
If the PDF / A-file contains images, it is mandatory to embed a color profile — a file that contains information about how the output device (monitor, printer, etc.) should convey the color. Important is the fact that the included color profile must be device-independent.
Mandatory availability of metadata indicating the version of the format used, the title of the document, the list of authors, a brief description, the date of creation and the last modification of the document file, as well as keywords for the search. The PDF / A specification also specifies a metadata presentation format - the Adobe Extensible Metadata Platform (XMP).

To increase the compression ratio, the graphic and text layers are compressed in different ways. Due to the specific content, the graphics layer is reduced to a resolution of 100 DPI and is encoded by the JPEG algorithm. The text layer contains the basic information of the document, therefore, the text layer is retained in the original resolution, and CCITT Group 4 is used for encoding a lossless compression algorithm.

Experimental Results

Evaluating the effectiveness of the technology in an automatic mode is almost impossible - it’s not enough to compare only the size of the output quality, it is also necessary to compare the quality of the “compact electronic document”. Therefore, the actual evaluation is done organoleptically (that is, “by eye”).
In Fig. Figure 11 shows some test images that were compressed with JPEG algorithms (a compression level that keeps readability), DjVu, and Cognitive PDF / A. According to the results of the comparison, you can see (see the Table) that the Cognitive PDF / A technology, by the degree of compression, bypasses JPEG by an order of magnitude, but loses DjVu. Such a difference in size can be explained by the fact that, in addition to useful information (actually images and recognized text), the PDF / A file also contains auxiliary data necessary for long-term storage. Despite the smaller size of DjVu files, the compression quality of office documents is lower, which is especially evident in the area of seals and signatures.

The image and results presented in this table can be downloaded at: yadi.sk/d/7us8gghADHVrg

The full article is published:
Usilin S.A., Nikolaev D.P., Postnikov V.V. Cognitive PDF / A - technology of digitization of text documents for publication in the Internet and long-term archival storage // Proceedings of the Institute for System Analysis of the RAS. Technologies of programming and data storage / ed. Arlazarov V.L., Yemelyanov N.E. M .: LENAND, 2009. T. 45. P. 159–173.

Recommended literature

Vatolin D., Ratushnyak A., Smirnov M., Yukin V. Data Compression Methods. Image compression algorithms. - M .: Dialog-MEPI, 2002. - 99 p.
Gonzalez R., Woods R. Digital Image Processing. - M .: Technosphere, 2005. - 1072 p.
Lizardtech DjVu Reference, www.lizardtech.com
Kuroptev A.V., Nikolaev D.P., Postnikov V.V., Usilin S.A. Selecting graphic primitives and text blocks on document images using morphological operations // Proceedings of the 51st scientific conference of MIPT. Modern problems of fundamental and applied sciences. Part 9. Innovation and high technology. - M .: MIPT, 2008, p. 29 - 31.
M. van Herk. ASTRONOMETRIC MINIMUM FILTERS AND RESTRICULAR AND OCTAGONAL KERNELS // Pattern Recognition Letters. - 1992. - p. 517 - 521.
ISO 19005-1: 2005. Document management - Electronic document file format for long-term preservation - Part 1: Use of PDF 1.4 (PDF / A-1).
Adobe Systems Incorporated. Extensible Metadata Platform (XMP) Specification, www.adobe.com

Source: https://habr.com/ru/post/203580/

All Articles