(First line \ First line \ n Second line with brackets \ (\))
First line First line Second line with brackets ()
\053
), using a separate two-byte hex ( <2B>
) or even their sequence ( <54776F20>
). For example, the following lines are equivalent:(Two + two = four.) (Two \ 053 two \ 075 four.) (Two <2B> two <3D> four.) (<54776F202B2074776F203D20> four.)
[(Hello,)10(world!)]
. Arrays sometimes contain text strings.<< / Length 681 / Filter / FlateDecode >>
$ dictionary = array (Streams
"Length" => "681" ,
"Filter" => true
"FlateDecode" => true
) ;
stream
and endstream
. Any binary data, be it compressed text, image, or embedded font, will be presented as a stream. The stream is always inside the object (just below) and is characterized, at a minimum, by its length (the /Length N
option in the dictionary) and very often by the compression method (for example, /Filter /FlateDecode
). PDF supports a sufficient number of compression formats (including the encryption format /CryptDecode
), but we will be interested in only three: the most commonly used Flate (gzip-compression) and the rarer ASCII Hex (representing the data as a hexadecimal string with the final character >
) and ASCII 85-based (compression, when the successive 4 characters of the source text are encoded with 5 characters from !
to y
in the ASCII table).obj
and endobj
. The object has its ID inside the document by which it can be referenced. First of all, we are interested in objects with streams inside of us (do not forget about the main subtask), which almost always contain a set of additional options in the form of a dictionary. Here is a typical example of an object inside a PDF file (with uncompressed stream content):2 0 obj << / Length 9 2 R >> stream BT / F1 12 Tf 72 712 Td (A short text stream.) Tj ET endstream endobj
/Filter /FlateDecode /ASCIIHexDecode
). Well, we need some real example. Please, a poem by Mikhail Yuryevich Lermontov “Parus” in PDF-format (the document was created on Acrobat.com from the odt-file from the previous article)./Length 681
), that the stream is compressed ( /Filter
) in gzip ( /FlateDecode
). Already enough information to decompress the data stream - gzuncompress
will gzuncompress
:0.1 w q 0 -0.1 612.1 792.1 re W * n q 0 0 0 RG 0 0 0 rg BT 2 Tr 0.59999 w 56.8 716.6 Td / F1 18 Tf [<01> 17 <02> 10 <03> 10 <04> 17 <05>] TJ ET Q q 0 0 0 rg BT 56.8 682.5 Td / F1 11 Tf [<06> 9 <07> 11 <08> 6 <07> 11 <07> 11 <09> 13 <0A> 4 <0B> 14 <0C> 11 <0D> 11 <0E > 9 <0F> 9 <0A> 4 <10> 11 <11> 10 <12> 23 <13> 6 <10> 11 <14> 10 <10> 11 <15>] TJ ET ... a lot of text ...
BT
text (beginning of text) and the end of ET
(end of text).TJ
(display text taking into account individual character positioning). These markers appear after a line of text or an array of lines, as in this case ( [<01>17<02>10<03>10<04>17<05>]TJ
).1. <01> 17 <02> 10 <03> 10 <04> 17 <05> 2. <06> 9 <07> 11 <08> 6 <07> 11 <07> 11 <09> 13 <0A> 4 <0B> 14 <0C> 11 <0D> 11 <0E> 9 <0F> 9 <0A> 4 <10> 11 <11> 10 <12> 23 <13> 6 <10> 11 <14> 10 <10> 11 <15>
is encoded as 01 02 03 04 05
- like 06 07 08 07 07 09
.../ CIDInit / ProcSet findresource begin 12 dict begin begincmap / CIDSystemInfo << / Registry (Adobe) / Ordering (UCS) / Supplement 0 >> def / CMapName / Adobe-Identity-UCS def / CMapType 2 def 1 begincodespacerange <00> endcodespacerange 45 beginbfchar <01> <041F> <02> <0410> <03> <0420> <04> <0423> <05> <0421> <06> <0411> <07> <0435> <08> <043B> <09> <0442> ... many lines of transformations ... endbfchar endcmap CMapName currentdict / CMap defineresource pop end end
<01>
, <02>
and so on? No wonder - we saw them a bit earlier in text lines. Suppose we have to replace 01
with 041F
, take a look at what this number is hiding behind it. Hooray! #x041F
=
! We found the transformation of one character into another, now let's turn to the documentation and find out a little more.beginbfchar
and endbfchar
is the simplest endbfchar
. It assigns to the first code another one. For example, in the example above, we learned that 01
hides the code of the symbol
But this is only a special case of the operation of this conversion - it is possible to assign a whole line to a code of up to 512 characters (that is, up to 128 characters in Unicode).beginbfrange
and endbfrange
. It works no longer with individual characters, but with their ranges. Conversion supports two versions of its work:<0000> <005E> <0020>
- we work with a range from 0000 to 005E, each value of which is converted into values ​​from the interval 0020 and 007E. Noticed the principle? 0000 is converted to 0020, 0001 to 0021, 0002 to 0022, and so on;<005F> <0061> [<00660066> <00660069> <00660066006C>]
- each value between 005F and 0061 (i.e., another 0060) is replaced with the corresponding array sequence in square brackets: 005F will be replaced with 0066 00 66 (i.e., on ff
), 0060 on fi
, and 0061 on ffl
.You can get the code with comments on GitHub .
- function pdf2text ( $ filename ) {
- // Read the data from the pdf-file into a string, taking into account that the file may contain
- // binary streams.
- $ infile = @ file_get_contents ( $ filename , FILE_BINARY ) ;
- if ( empty ( $ infile ) )
- return "" ;
- // Pass the first. We need to get all the text data from the file.
- // In the 1st pass, we get only the "dirty" data, with positioning,
- // with hex inserts and so on.
- $ transformations = array ( ) ;
- $ texts = array ( ) ;
- // First we get a list of all the objects from the pdf-file.
- preg_match_all ( "#obj (. *) endobj # ismU" , $ infile , $ objects ) ;
- $ objects = @ $ objects [ 1 ] ;
- // Let's start bypassing what was found - in addition to the text, we can get caught
- // a lot of interesting and not always "tasty", for example, the same fonts.
- for ( $ i = 0 ; $ i < count ( $ objects ) ; $ i ++ ) {
- $ currentObject = $ objects [ $ i ] ;
- // Check if there is a data stream in the current object, it is almost always
- // compressed with gzip.
- if ( preg_match ( "#stream (. *) endstream # ismU" , $ currentObject , $ stream ) ) {
- $ stream = ltrim ( $ stream [ 1 ] ) ;
- // Read the parameters of this object, we are only interested in text
- // data, so we do minimal clipping to speed up
- // run
- $ options = getObjectOptions ( $ currentObject ) ;
- if ( ! ( empty ( $ options [ "Length1" ] ) && empty ( $ options [ "Type" ] ) && empty ( $ options [ "Subtype" ] ) ) )
- continue ;
- // So, we have a "possible" text, decrypt it from the binary
- // representation. After this action, we deal only with plain text.
- $ data = getDecodedStream ( $ stream , $ options ) ;
- if ( strlen ( $ data ) ) {
- // So, we need to find a text container in the current thread.
- // If successful, the found "dirty" text will go to the rest
- // found before
- if ( preg_match_all ( "#BT (. *) ET # ismU" , $ data , $ textContainers ) ) {
- $ textContainers = @ $ textContainers [ 1 ] ;
- getDirtyTexts ( $ texts , $ textContainers ) ;
- // Otherwise, we are trying to find character transformations,
- // which we will use in the second step.
- } else
- getCharTransformations ( $ transformations , $ data ) ;
- }
- }
- }
- // At the end of the initial parsing of the pdf document, we start the analysis of the received
- // text blocks with character transformations. At the end, we return
- // received result.
- return getTextUsingTransformations ( $ texts , $ transformations ) ;
- }
$content = shell_exec('/usr/local/bin/pdftotext '.$filename.' -');
. But in this case, the task was to read PDF under any platform and on any platform.Source: https://habr.com/ru/post/69568/
All Articles