Extract Fonts from PDF

Immediately it should be said that the best information on the format than the multi megabyte PDFReference from the Adobe site does not exist. For those who write in C ++ there is a ready solution - XPDF. In Linux, this is the most full-featured replacement for Adobe products. Russian-language materials on this topic are superficial and serve only for familiarization, and not for practical work. But I expect that you are already familiar with them, and better with PDFReference. I decided to describe a specific simplified example of extracting fonts from a truetype PDF file, because this question often sounds on the web and remains unanswered. I know only one such program that works with errors and without source. I remind you that it is not always legal to use the extracted fonts, you can only display text from the document using the embedded font.

Who was interested in the question, they know that the PDF consists of the header, cross-reference table (XRef), body and trailer (trailer). All elements except the title can be scattered in parts and in several copies throughout the document. First you need to read the XRef table. I recommend to issue its class. To find the address of the table, we read the file from the end, until we meet the %% EOF tag. We continue to read backwards before the startxref tag. Now you can read the number that follows this tag.
Here is an example of the end of the file:

startxref
173
%% EOF
the number 173 is the offset from the beginning of the file data to the beginning of the first XRef table. Moving to this point, we see something like this:
xref
7628 42
0000000016 00000 n
0000001195 00000 f
etc.

At 7628, we will not pay attention yet (this is the name of the first object where information about the number of pages is recorded, for example, as well as many other things). And 42 is the number of records in this part of the table. Further, quite simply: we read the first word in the 10-byte buffer, skip the space and read the 5-byte buffer, read a single character. And so 42 times. The converted to integer strings have the following value - the offset from the beginning of the data to the reference object, the generation number. The last character is interpreted as follows: n - the object is used, f - the object is not used, but as I said, the XRef table may have continuations in the file stream. How to find them? after the table always follows the trailer tag. When he meets, you need to look for the string / Prev - if there is one, then an offset to the next table follows.
')
/ Prev 4025745

Thus, we read all the tables if there are more than one. You can finish reading if in the next trailer there is no key / Prev. A sign of the last table can also be that it begins with the entry 0000000000 65535 f. It must be said that we read the tables backwards, the latter, when reading, is the first that appeared when creating the document itself, and the first, when reading, appeared after the last editing.

Using the data we can navigate to any reference document object. True, there are still direct objects whose addresses are not listed in the XRef, but more on that later. Now we can iterate over document objects, check their type and do with them what our heart desires. The object starts like this:

7626 0 obj
object content
endobj

7626 is the number (name) of the object, and 0 is the generation number, which must match the similar value in the reference table for this object. As I understand it, if the object is changed, edited, then the generation number increases. We are going to look for fonts, for this we need to read the dictionary of the object, which is a lexeme, enclosed in tags << ... >>. If the dictionary elements have such a structure, for example:

/ FirstChar 32

where the word after the slash is the key, and the optional value after the space is the value. When parsing, you must remember that the value may contain any data, any enclosure, including other dictionaries. So recursion in your hands, however, is possible without recursion, if we work on the specific task of extracting fonts. The specified value may also include nested or non-nested elements of the following types:

(...) -text lines
<...> - hex strings
[...] - arrays

The value string continues to the next slash or line feed. To identify the font object you need to find a combination in the dictionary:

/ Type / Font
Now we filter Truetype fonts by content in the sequence dictionary:
/ Subtype / TrueType

The rest of the keys are ignored, because we just want to extract the fonts. But most likely we will not find the font itself in this object. Only a set of unnecessary keys to us. We read one of them:

/ FontDescriptor 1675 0 R

If there is no such key, the font is external and not embedded in the document. Next, the generation number of this object, and the symbol R indicates that it is a link. We have already read the XRef table and now we can move to the font data through the search for the offset for the object with the number 1675. However, such an option is possible:

/ FontDescriptor << dictionary and / or font data >>

We assume that we have moved the link to a direct object. In his dictionary should be such keys:

/ Type / FontDescriptor

This object also has a lot of useful information about the font, but the font itself is not there again. Not my fault - all claims against Adobe. We need such a key

/ FontFile2 1676 0 R

Familiar design. Go to the next object. If we did everything right, then this is a stream object. It consists of a stream dictionary and binary data enclosed between stream ... endstream tags. Here I must say that the presence of binary data does not allow to use ready-made text parsers. I tried a lot and had to write my own from scratch. Binary data can be read at once, as the stream dictionary contains the / Length key with the stream length. If you try to save the extracted stream to a file with a TTF extension, the system will declare that it is in no way a font. All right, you have to unclench it.

The font is often compressed using zip, but to be sure, you can check it by the presence of the / FlateDecode key. If we work in Delphi, then we use standard ZLib. We can get the buffer size for the expanded data from the stream dictionary using the / Length1 key. Well, you need to know that the font embedded in the document contains only those glyphs that are used in the document.

I think that after these blueprints you can take hex-ver in one hand, PDFReference in the other and cost your own AcrobatRider.

Source: https://habr.com/ru/post/116025/

All Articles

Extract Fonts from PDF

More articles: