PDF from the point of view of the programmer

I deal with PDF not only as a user, but, above all, as a software developer who knows how to read and write it (you may have come across ABBYY products working with PDF - ABBYY FineReader, ABBYY PDF Transformer). I assume that you read the article habrahabr.ru/company/abbyy/blog/105006 and further I write only about some of the features and limitations of PDF, which are more interesting for advanced users. I will not touch on any complicated technical details, so for programmers who want to learn how to read or write PDF, it’s better to go straight to reading the specification version 1.7 from the page www.adobe.com/devnet/pdf/pdf_reference_archive.html :)

The purpose and features of PDF

Initially, PDF format was conceived by Adobe in the late 80s of the last century as an “electronic hard copy” of page-structured documents, which can be viewed and printed in a form identical to the original, on different machines and platforms, but which is not supposed to be edited. This definition distinguishes PDF from most other formats for storing and distributing human-readable documents. Over the years, PDF has evolved greatly, being currently a container for a wide variety of content (text, vector and raster graphics, interactive elements, forms, audio, video, annotations of various kinds), but its original purpose is still the source of its capabilities. and numerous restrictions.

For example, text document formats (DOC, RTF, DOCX, etc.) are mainly focused not on viewing, but on editing documents. Created by a reasonable user :) the document logically responds to inserting / replacing / deleting text, images, tables in different places, changing the size and margins of the pages, changing the formatting of text fragments of any size, and similar actions. HTML web pages are not very focused on editing (although they allow it), but under the condition of the author’s direct hands, the display is transferred not only on the screen of its creator’s monitor, but on devices with completely different screens and user interaction.
')
PDF has a special way - it is most widely used as a parasite format in which documents are not created by a person from scratch, but are most often generated from other formats by deep machine processing, which loses many or even all the details that are not needed to display a document in fixed form. The most common way to get a PDF is to print to a virtual PDF printer from any application that has the “Print” command on the menu.

A PDF printer translates GDI (“graphics device interface”) - commands for outputting symbols, lines, curves, rectangles, bitmaps, and other geometric primitives to the appropriate PDF commands with the corresponding ones saved to a file. At the same time, of course, the number and size of the pages on which the printing was performed are saved.

Such a transformation can very accurately convey the appearance of what happened before printing (for example, lines and symbols do not lose their clarity at any scaling and are kept quite compact), but completely ignores the structure of the document from which it came out. For example, to underline a word or other piece of text in PDF, a dedicated command or attribute of characters is not provided — instead, characters are displayed separately (in groups that usually do not even match words or lines), and lines or thin rectangles of the desired thickness and color are drawn separately. right page locations. Tables that a person perceives as a complete set of cells for an application that displays PDF is simply a chaotic set of characters and lines that, by chance, formed something that is perceived by a person as a table. Hyperlinks, which in the source document could be used both for navigation within the document and for going to Web addresses, disappear when printed as a means of navigation, leaving only colored and / or underlined inscriptions. In general, solid imitation and cheating. I will call such PDFs below “vector” (as consisting of vector commands, which include drawing characters).

Another way to get PDF documents, which has become especially popular in recent years, is the processing of scanned paper pages into it. Now, most scanners and multifunction devices can produce the result in the form of “raster” PDF - while the previous method of “simulating print” is not needed, and the device driver or utility independently creates PDF pages so that each of them has the desired “raster” image, The benefit of a set of graphics formats that can be used in PDF, cover most requests. Such "raster" PDF-documents occupy more space and look less quality than the "vector".

Some modern applications (including applications of the OpenOffice suite, Microsoft Office new versions, ABBYY FineReader and ABBYY PDF Transformer) are able to create PDF on their own, using a much larger arsenal of tools than PDF printers, because they know much more about the original document than need to transfer to the printer. This allows you to save, for example, hyperlinks as such (and not just as colored and / or underlined text) or to describe some elements of the document structure for its reformatting and display on small resolutions screens. Such documents with structural information are called “tagged” or “tagged” PDF. According to Adobe, "tagging", added since Acrobat 5, is designed to hide the most glaring flaws of earlier versions of PDF. For example, for untagged documents, the correct operation of the mechanism for copying text fragments to the Windows clipboard (the familiar Copy-Paste) is not guaranteed. Even today, not all created PDFs are tagged, including due to the limited capabilities of generator programs (or users' lack of knowledge of where to turn on the necessary checkbox for this), or simply because of the larger size of such PDFs when question of saving disk space when storing large archives.

Convert PDF documents to other formats

The desire to edit the contents of a PDF document or convert it to other, preferably editable formats (both for immediate editing and for storage with the ability to search / edit "someday") arises for various reasons. The simplest means of extracting text content is provided by any application that displays PDF - I have the usual Copy-Paste, which works rather primitively - as a rule, character and paragraph formatting is lost, tables and complex layout of a PDF document are ignored. There are applications that allow dotted editing PDFs without converting them to other formats - but their arsenal of editing tools is very limited, well, just no comparison with familiar word processors :) In expensive Adobe Acrobat, for many documents, the only working type of editing is “annotation” - there are tools for adding comments, highlighting text with a marker, strikethroughs, etc. Yes, the more advanced editing seems to be there, but you, by chance, have not met the funny message “There is no available system font. You can’t add or delete text using the currently selected font. ”In an innocent attempt to remove a character or word from Acrobat’s“ good ”,“ vector ”PDF document? But did you try to replace a fragment of a line with a longer one, sadly watching the tails crawling to the right? If not, then the love for Adobe products is yet to come! For simple and familiar to word processors tasks - for example, “in a few seconds to replace the whole word“ MS ”with“ Microsoft ”, with a change in text and column layout, there’s no such thing as“ editing ”.

It is no coincidence that in the software industry a whole industry has been formed that produces conversion tools with better functionality. From what has been written above (and especially below), it should become clear how difficult this task is. Most users who have not read this creative do not think so - that is why I am writing it :)

Major problems when converting PDF to other formats

Often in the discussion of PDF-related questions the concept of a “text layer” is used. Intuitively, many users assume that PDF files contain such selected parts where all the necessary characteristics of the visible text are described in a logical and understandable way - either invisible, but searchable or selectable with the mouse. I want to reveal to you a terrible secret (probably at the risk of soon getting a bullet from a killer sent by the authors of PDF and their marketing department) - there is no text layer in this sense in PDF! In fact, for each page there is a general flow of drawing commands, in which different types of commands are completely randomly mixed - specifying clipping areas, changing the current thickness, color and pattern of dotted lines, changing the coordinate system, changing the font, drawing lines and curves (with current attributes) , displaying a group of characters with current attributes and specified “glyph numbers” (a glyph is a description of a character image, without taking into account its other characteristics), displaying raster images, etc. That is, even special text commands are just one of many drawing tools that are not separated into separate streams.

Worse is different - even within the same PDF page, you can use (too) a wide range of image-like text: letters can be seen as parts of a bitmap image — for example, in logos (the task of recognizing them is, in its pure form, an OCR application task, the same ABBYY FineReader), as a result of drawing by Bezier curves or special text commands. This latter case is the best for processing, but even here, generally accepted character codes from Unicode or other encodings are not necessarily indicated - because special fonts from a subset of only actually used characters can be written to a PDF file and the symbols can be referenced by completely conditional “glyph numbers” and not by codes. That is, it is not always easy to find characters in the right place, and to determine their codes! With formatting, including the choice of a similar font in the absence of an exact equivalent, is still more cunning.

Symbols, even if their presence and codes are established in one way or another, very often do not correspond to the initial sequence of their placement and reading on the page in their order of output per page. For example, on a two-column page, text output commands from the right and left columns can be arbitrarily mixed. On this page, you need to highlight areas, each of which contains a logically connected text - this is also a task that OCR has solved for many years with applications. Some help is provided by structural information from tagged PDFs - but often, even for PDFs that have been made now, this information is either absent - as when outputting via a PDF printer - or is not complete enough.

When we decided that in some places of the page there is a coherent text (and somewhere they even understood how it is grouped into tables - this is a very nontrivial task!), And found which characters and lines add up, we need to convert these lines into paragraphs. and higher-level elements familiar to users of both word processors and HTML — columns, tables, sidebars. There is usually no data about paragraph formatting in PDF, so all these characteristics also need to be calculated - as with all the same recognition. If you try to ignore elements of the text harder than lines or paragraphs, then, putting everything in short frames, we get a document that looks like a real one, but is almost not edited - remember the task of replacing the word “MS” with “Microsoft” throughout the document? This is a very good editable test. For an edited document, the ability of text to flow from one zone to another is important - in the right cases, which still need to be able to be distinguished from unnecessary.

Only by doing all this, you can turn the contents of the PDF into a file of an editable format that looks similar to the original and is convenient for work. Of course, for many years, many smart people in different companies have learned how to solve each of these tasks well or even perfectly, but I haven’t yet met the ideal solution to the whole problem. But we are working on it :)

Vyacheslav Sapronenko SlaSapro
OCR

Source: https://habr.com/ru/post/108459/

All Articles

PDF from the point of view of the programmer

The purpose and features of PDF

Convert PDF documents to other formats

Major problems when converting PDF to other formats

More articles: