📜 ⬆️ ⬇️

How MRC technology reduces the size of PDF documents

The PDF format has long taken root as a means of saving documents, which are then not supposed to be edited. All PDF files can be divided into two classes. The first is documents that were digitally folded, and then converted to PDF. Instructions for any device will most likely be just such a file. Inside, it looks like text and graphics plus formatting commands that describe how elements should be placed on the page.

The second class is the documents obtained as a result of scanning paper images. They can be skipped through ABBYY FineReader, and they will turn into the first type, or you can simply save to PDF as pictures. And it often makes sense to use it when you want to keep the original document view. Despite the fact that ABBYY FineReader is quite good at recognizing documents, recognition errors occur, some important elements on the page are not found, in general, what happens is somewhat different from the original document in appearance.

Therefore, it often makes sense to save a PDF image of the original image, and underneath it to insert the recognized text so that you can find the document by keywords or use copy-paste. Only one moment confuses - such PDF-files have a rather big size, from half a megabyte per page and more. Accordingly, if you scan a medium-size textbook on mathematical analysis, you get a file of 200 megabytes.
')


This size is due to the fact that inside the PDF scanned, bitmap images are compressed with conventional image codecs, JPEG, JPEG2000, LZW or ZIP. Accordingly, less than the usual JPEG files for such pages take is impossible. To reduce the size, they usually resort to all sorts of tricks - reduce the resolution, greatly underestimate the compression of the image, as a result of which the quality of the text in such PDF suffers.

Or then you have to give up PDF and save everything in DjVu. It turns out a fairly small size, but the reality is that not all users of the resulting file can easily read it - after all, Adobe Acrobat is on a much larger number of computers than the DjVu viewer.
And here comes to the aid of the PDF MRC technology (from “Mixed Raster Content”) - Adobe's answer to the DjVu format. This is the same PDF, but borrowing many elements from DjVu, and can be read by all popular PDF readers. When using MRC, the page size is reduced by a factor of 4 while maintaining the quality of the scanned image. This is due to the decision to divide the image into layers and compress each layer with the most appropriate codec. The text is compressed with JBIG2 codec, everything else is compressed using JPEG / JPEG2000 / ZIP with different quality.

How is the MRC inside the PDF? Consider a simple example, and then gradually we will complicate it.
Let us have a scan of a white page with black text, for example, pages from a book (all pictures are clickable).

Scan, JPEG, 1.2 MB



Useful information - only letters, everything else can be ignored. Find all the text on the page, for example, it is logical for this to run FineReader and recognize the page. Then select all the found text in a separate layer, and compress it with the JBIG2 codec. We get 50 kilobytes per page against 400 for JPEG and 200 for the black-and-white fax codec CCITT4.

JBIG2 is specifically designed to compress text. When working, he combines externally similar images of letters into clusters. Examples of such clusters, for example, are all the letters 'a', printed in the same font of the same size. The slightly different letters 'a', for example, with distortions from scanning, or printed in a different font, will fall into other clusters. The result is a dictionary, which combines the same common letters. Then for each letter its place is remembered. It turns out very compact.

JBIG2, 50 Kb. The PDF with additional information is 80 Kb.



Now we will complicate the task. Let us have an uneven background that you don't want to lose.

Tiff, 500 Kb



For this we need two layers already. The first of these will still be text compressed JBIG2. And in the second layer will get everything that remains of the original image after cutting the letters and painting the holes from them. We can compress the second layer quite strongly with the help of JPEG, since there is usually no particularly valuable information on it.

The resulting PDF has a size of 35 Kb against 190, which we would get by simply compressing the entire image in JPEG.

Text, JBIG2, 18 Kb



Background, 11 Kb, JPEG



Final MRC PDF, 35 Kb



The following complication. So far, we have displayed only black and white text. Let now we will meet the color text.

Tiff, 700 Kb



As before, we press the text with the black-and-white JBIG2 codec, but under the colored letters we enclose the so-called color mask - another layer that can be seen in the “slots” made by the letters. This layer contains few colors, and is beautifully packaged, for example, using ZIP.

Text, JBIG2, 11 Kb



Color mask, ZIP, 3 Kb



Text + color mask look like this:



Background, JPEG, 40 Kb



With background compression it is important not to overdo it - a text that has not been recognized as text can fall on it. And if we compress it too much, such a text will be difficult to read.

Final MRC PDF, 60 Kb



So, there are already 3 layers: the text, the color mask that colors the text, and the background. It remains to deal with elements that are neither text nor background. For example, these are pictures or photos. Nothing special can be done with them, and we will simply add them to the background, compressing JPEG or JPEG 2000 with high quality.

Tiff, 600 Kb



Text JBIG2, 25 Kb



Color mask, ZIP, 5 Kb



Background, JPEG, 40 Kb



PDF MRC is ready. It contains several layers, each of which contains different pieces of the picture and compressed with the most appropriate codec.

Summary PDF MRC, 72 Kb



Of course, there are images that do not benefit in size from the use of MRC. For example, trying to compress a photo of a landscape like that makes no sense, less than JPEG will fail. Or text printed on a background containing many small details.

From this picture PDF MRC will not work



However, for many of the documents we encounter in everyday life, the MRC gives excellent results.

And finally - a few examples of PDF MRC, which can be obtained using ABBYY FineReader, ABBYY FineReader Engine or ABBYY Recognition Server:

PDF, JPEGPDF, MRC
524 Kb218 Kb
618 Kb175 Kb
412 Kb113 Kb


So we get compression by 2-6 times with the same quality, and this is not the limit. PDF MRC is still a very young technology, and it continues to grow rapidly. There will be improvements in the direction of quality improvement, and in the direction of reducing the size.

All PDF examples in this article were obtained using ABBYY FineReader Engine 10, default settings.

Vasily Panfyorov,
Product Development Department

Source: https://habr.com/ru/post/119790/


All Articles