📜 ⬆️ ⬇️

Saga about batch converting pdf to text

Last year, a simple job was ordered, as it seemed at first glance: to create a system for batch processing files - containing a 12-column table, the data from which to export to the database. Everything would be fine - yes, the files turned out to be documents in pdf, and the customer claimed that he could not provide any other format for processing.

image
A sample of that pdf-a - the structure is saved in the file, but all the data is cleaned up.

Well, despite the warnings of knowledgeable people, but they warned oh how not in vain - I took up the job and survived such an adventure:
')


The search for software for batch processing of pdf files has begun. The work was done on linux - because the first challenger was the program from the poppler package - pdftotext, which is most often used for this purpose.

pdftotext


the function is easily called from the command line, and has the following form:

$ pdftotext .pdf .txt


But the result of the conversion was very strange:

Extract from the registry
deals

on the basis of bidding
ZAO St. Petersburg
International
Commodity - raw exchange »

SET-VRS form

Date of bidding: 01/01/1900
Section: Section “Petroleum Products” CJSC SPbMTSB
Member: LLC
Participant code: 000000000000
Name and code of the client: OJSC / 000000000000
Tool code: 0000000001
Tool Description: Gasoline Premium Euro 95 type II class B,
Number
deals

Time
deals

Application number

Purchase / Transaction Price (per Quantity
saleÂą
one ton)
lots

01-01

01-02

01-03

S

01-05

five

02-01

02-02

02-03

S

02-05

one

Total to 0000000001

and so on…

Not only that: the output was still changing from file to file — of course in this form it was impossible to write a parser to pull data from the tables.

Therefore, a second candidate appeared on the horizon, this time a proprietary candidate: Adobe Acrobat Reader for linux.

Acrobat reader


In this case, I decided, for starters, to test its ability to “File - Save as Text” in graphical mode, without much delving into the command line. As it turned out, I did the right thing - the results of converting my pdf and in this program did not inspire optimism:

Participant code:
Section:
Tool code:
Tool code:
000000000000
Ltd
0000000001
0000000002
00,000,000.11 00,000.12
0,000,000.21 0,000.22
01-12
02-12
03-12
04-12
ZAO St. Petersburg
International
Commodity - raw exchange »
Extract from the registry
deals
on the basis of bidding
Date of bidding: 01/01/1900
SET-VRS form
Name and code of the client: OJSC / 000000000000
Participant:
Section “Oil products” CJSC “SPIMTSB”
Tool Description: Gasoline Premium Euro 95 type II class B,
Code
trader
Time
deals
Number
deals
Application number
Total to 0000000001
Tool Description: Diesel fuel winter Z-0.2-35.
Code
trader
Time
deals
Number
deals
Application number
Total to 0000000002

and so on…

Parsing such a conclusion is also not very desirable.

The third and, as it turned out, the last candidate was pdfedit .

pdfedit


In all repositories, it is represented by a graphical program with the ability to execute scripts from the command line. Just like Acrobat Reader, the first start of converting pdf to text was made from the “file as save as text” graphical environment, the result was very pleased:

 
  ZAO St. Petersburg Extract from the Register Form SET-VRS 
  International deals 
  Commodity and Raw Materials Exchange "at the end of the auction 
                               Date of bidding: 01/01/1900 
  Section: Section "Petroleum Products" CJSC SPbMTSB 
  Member: LLC 
  Participant code: 000000000000 
  Name and code of the client: OJSC / 000000000000 
  Tool code: 0000000001 
  Tool Description: Gasoline Premium Euro 95 type II class B, 
    Number Time Application Number 
                      Purchase / Transaction Price (per Quantity 
                                                    VAT Exchange Name 
                                                                                  Code 
                                  CRC transaction amount 
   deals sale 
                        one ton) 
                               lots (18%) collecting counterparty 
                                                                               trader 
  01-01 01-02 01-03 S 01-05 5 01-07 01-08 01-09 01-10 PeSee 
                                                                             01-12 
  02-01 02-02 02-03 S 02-05 1 02-07 02-08 02-09 02-10 PbSee 
                                                                             02-12 
   Total to 0000000001 00,000,000.11 00,000.12 
  Tool code: 0000000002 
  Tool Description: Diesel fuel winter Z-0.2-35. 
    Number Time Application Number 
                      Purchase / Transaction Price (per Quantity 
                                                    VAT Exchange Name 
                                                                                  Code 
                                  CRC transaction amount 
   deals sale 
                        one ton) 
                               lots (18%) collecting counterparty 
                                                                               trader 
  03-01 03-02 03-03 S 03-05 1 03-07 03-08 03-09 03-10 PuSE 
                                                                             03-12 
  04-01 04-02 04-03 S 04-05 1 04-07 04-08 04-09 04-10 PgSee 
                                                                             04-12 
   Total to 0000000002 0,000,000.21 0,000.22 
  SPbMTSB Broker Copy of an electronic document 
        Âą B - purchase, S - sale.  1/3 
 



It turned out a very correct conclusion - parsing which is easy and pleasant. But it was not there, the path to the completion of the epic was thorny. It was impossible to call the “save as text” function from the command line, in the standard assembled pdfedit:

 
  $ pdfedit -console 
  PDFedit 0.4.5 
  Using: 
   pdfedit -console [function name] [function parameter (s)] 
  It is a function of invoice (case insensitive) or its unambiguous part. 
  The remaining parameters are passed to the called function. 
  Available features: 
   Delinearizator 
    Description: Delinearize input file 
    Parameters: [input file] [output file] 
   Flattener 
    Description: Flatten input file (remove all revisions except the last one) 
    Parameters: [input file] [output file]  
 



the console help output gave two utilities that were not related to converting to text. Further study of the documentation on wiki led to the complete conversion script savealltext.qs , but the trouble is that it should have been executed from the graphic menu. As a result, I had to study the materiel in detail:

API for creating scripts in pdfedit
create pdfedit script wiki
examples of ready-made scripts

The “smoking” of this material + a phased analysis of the /usr/share/pdfedit/delinearize.qs presented in the standard assembly of the console script led me to create my savealltext.qs:

 /** Print help for savealltext */ function savealltext_help() { print(tr("Usage:")); print("savealltext ["+tr("input file")+"] ["+tr("output file")+"]"); print(" "+tr("Input file must exist")); print(" "+tr("Output file must not exist")); exit(1); } function savealltext_fail(err) { print(tr("savealltext failed!")); print(err); exit(2); } function saveAsText_save(p,f) { document=loadPdf(p) qs=""; pages=document.getPageCount(); for (i=1;i<=pages;i++) { pg=document.getPage(i); text=pg.getText(); qs+=text; qs+="\n"; } saveFile(f,qs); } p=parameters(); if (p.length!=2) { savealltext_help("savealltext "+tr("is expecting two parameters")); } inFile=p[0]; outFile=p[1]; if (!exists(inFile)) savealltext_fail(tr("Input file '%1' does not exist").arg(inFile)); if (exists(outFile)) savealltext_fail(tr("Output file '%1' already exist").arg(outFile)); if (inFile==outFile) savealltext_fail(tr("Input and output files must be different")); if (saveAsText_save(inFile,outFile)) { } else { print(tr("savealltext")+" :"+inFile+" -> "+outFile); } 
/** Print help for savealltext */ function savealltext_help() { print(tr("Usage:")); print("savealltext ["+tr("input file")+"] ["+tr("output file")+"]"); print(" "+tr("Input file must exist")); print(" "+tr("Output file must not exist")); exit(1); } function savealltext_fail(err) { print(tr("savealltext failed!")); print(err); exit(2); } function saveAsText_save(p,f) { document=loadPdf(p) qs=""; pages=document.getPageCount(); for (i=1;i<=pages;i++) { pg=document.getPage(i); text=pg.getText(); qs+=text; qs+="\n"; } saveFile(f,qs); } p=parameters(); if (p.length!=2) { savealltext_help("savealltext "+tr("is expecting two parameters")); } inFile=p[0]; outFile=p[1]; if (!exists(inFile)) savealltext_fail(tr("Input file '%1' does not exist").arg(inFile)); if (exists(outFile)) savealltext_fail(tr("Output file '%1' already exist").arg(outFile)); if (inFile==outFile) savealltext_fail(tr("Input and output files must be different")); if (saveAsText_save(inFile,outFile)) { } else { print(tr("savealltext")+" :"+inFile+" -> "+outFile); }

being placed in the / usr / share / pdfedit / directory, the script is invoked from the command line and performs the required conversion:

 
  $ pdfedit -console 
  PDFedit 0.4.5-20111014140242 
  Using: 
   pdfedit -console [function name] [function parameter (s)] 
  It is a function of invoice (case insensitive) or its unambiguous part. 
  The remaining parameters are passed to the called function. 
  Available features: 
   Delinearizator 
    Description: Delinearize input file 
    Parameters: [input file] [output file] 
   Flattener 
    Description: Flatten input file (remove all revisions except the last one) 
    Parameters: [input file] [output file] 
   savealltext 
    Description: savealltext input file 
    Parameters: [input file] [output file] 
 



$pdfedit -console savealltext .pdf .txt

It seems to be fine, but the pitfalls continued to float: it turned out that pdfedit, in console mode, takes text from the default area of ​​612x792 pixels - at 72 dpi, this corresponds to A4 sheet of landscape orientation. The program did not want to rotate the scan area, despite the presence in the code of the corresponding instructions from the rotate_text_fix.patch patch.
The search for the definition of this tricky "default area" brought me to the source code of the project, to the file: src / kernel / displayparams.h - that's because, they found where to put it!

static const int DEFAULT_PAGE_RX = 612; /**< Default A4 width on a device with 72 horizontal dpi. */
static const int DEFAULT_PAGE_RY = 792; /**< Default A4 height on a device with 72 vertical dpi. */

Replaced both values ​​with 1584 points (which corresponds to half of whatman paper in the same resolution) and chose the project.

At this, one might say, the torment ended - pdf-ki were processed from the terminal, in all its A4-th album landscape orientation.
Based on the results of his torment, he made the rules for building pdfedit packages with savealltext for ArchLinux and uploaded to the AUR:
pdfedit release
pdfedit version from csv

ps:


Soon the fairy tale is felt, but it is not done soon - I didn’t like that pdfedit to run, even in console mode, required qt3 in the system:

$ pdfedit -console
pdfedit: error while loading loading libraries: libqt-mt.so.3

What led to the consideration of the program configuration keys when building from source and the following recipe was found here:

configure --disable-gui --enable-pdfedit-core-dev --enable-tools

As a result, the graphical pdfedit binary is not compiled at all, but it creates a whole bunch of useful utilities, including pdf_to_text - which does the conversion I need, using the same algorithms as the savealltext.qs script:

$pdf_to_txt -file .pdf >.txt

The results of such an assembly of the project, also turned into a PKGBUILD for ArchLinux and laid out in AUR .

upd1:


Thank you all for the consultation! Thanks to the discussion, several more interesting ways to convert pdf to text were revealed:
zeliboba proposed to apply -layout option to pdftotext
$ pdftotext -layout .pdf .txt

It can be said without exaggeration that the project files from the article, this way converts better than pdfedit:

 
  ZAO St. Petersburg Extract from the Register Form SET-VRS 
  International deals 
  Commodity and Raw Materials Exchange "at the end of the auction 

                                                                          Date of bidding: 01/01/1900 

  Section: Section "Petroleum Products" CJSC SPbMTSB 
  Member: LLC 
  Participant code: 000000000000 
  Name and code of the client: OJSC / 000000000000 

  Tool code: 0000000001 
  Tool Description: Gasoline Premium Euro 95 type II class B, 

   Number Time Order number Purchase / Transaction price (per Quantity VAT Exchange Name Code 
                                                                                   CRC transaction amount 
   transactions sale ¹ one ton) of lots (18%) collection of trader’s counterparty 

  01-01 01-02 01-03 S 01-05 5 01-07 01-08 01-09 01-10 Pay 01-12 
  02-01 02-02 02-03 S 02-05 1 02-07 02-08 02-09 02-10 PbSee 02-12 

   Total to 0000000001 00,000,000.11 00,000.12 
 



On the other hand, this rule does not work with all pdfs:
“Today I experimented with various pdf files, it turned out that between pdfedit and pdftotext there is still no absolute champion, say:
Pdftotext - better converted a series of horizontally elongated tables and single-column journals.
Pdfedit - worked better with multi- column magazines (for example: linuxformat) and vertically oriented tables.
Somewhere, both programs showed the same result (for example: the Zeingaus magazine). ”

kornerz suggested converting pdf to svg, using inkscape - pre-dividing the pdf-file into sheets (as an option using pdftk)

$pdftk file.pdf burst

$inkscape -z -f pg_0001.pdf -l output_page1.svg

Source: https://habr.com/ru/post/130601/


All Articles