Saga about batch converting pdf to text

Last year, a simple job was ordered, as it seemed at first glance: to create a system for batch processing files - containing a 12-column table, the data from which to export to the database. Everything would be fine - yes, the files turned out to be documents in pdf, and the customer claimed that he could not provide any other format for processing.

_{A sample of that pdf-a - the structure is saved in the file, but all the data is cleaned up.}

Well, despite the warnings of knowledgeable people, but they warned oh how not in vain - I took up the job and survived such an adventure:
')

The search for software for batch processing of pdf files has begun. The work was done on linux - because the first challenger was the program from the poppler package - pdftotext, which is most often used for this purpose.

pdftotext

the function is easily called from the command line, and has the following form:

$ pdftotext .pdf .txt

But the result of the conversion was very strange:

_{Extract from the registry} _deals _{on the basis of bidding} _{ZAO St. Petersburg} _{International} _{Commodity - raw exchange »} _{SET-VRS form} _{Date of bidding: 01/01/1900} _{Section: Section “Petroleum Products” CJSC SPbMTSB} _{Member: LLC} _{Participant code: 000000000000} _{Name and code of the client: OJSC / 000000000000} _{Tool code: 0000000001} _{Tool Description: Gasoline Premium Euro 95 type II class B,} _Number _deals _Time _deals _{Application number} _{Purchase / Transaction Price (per Quantity} _sale¹ _{one ton)} _lots _01-01 _01-02 _01-03 _S _01-05 _five _02-01 _02-02 _02-03 _S _02-05 _one _{Total to 0000000001}
and so on…

Not only that: the output was still changing from file to file — of course in this form it was impossible to write a parser to pull data from the tables.

Therefore, a second candidate appeared on the horizon, this time a proprietary candidate: Adobe Acrobat Reader for linux.

Acrobat reader

In this case, I decided, for starters, to test its ability to “File - Save as Text” in graphical mode, without much delving into the command line. As it turned out, I did the right thing - the results of converting my pdf and in this program did not inspire optimism:

_{Participant code:} _Section: _{Tool code:} _{Tool code:} _000000000000 _Ltd _0000000001 _0000000002 _{00,000,000.11 00,000.12} _{0,000,000.21 0,000.22} _01-12 _02-12 _03-12 _04-12 _{ZAO St. Petersburg} _{International} _{Commodity - raw exchange »} _{Extract from the registry} _deals _{on the basis of bidding} _{Date of bidding: 01/01/1900} _{SET-VRS form} _{Name and code of the client: OJSC / 000000000000} _Participant: _{Section “Oil products” CJSC “SPIMTSB”} _{Tool Description: Gasoline Premium Euro 95 type II class B,} _Code _trader _Time _deals _Number _deals _{Application number} _{Total to 0000000001} _{Tool Description: Diesel fuel winter Z-0.2-35.} _Code _trader _Time _deals _Number _deals _{Application number} _{Total to 0000000002}
and so on…

Parsing such a conclusion is also not very desirable.

The third and, as it turned out, the last candidate was pdfedit .

pdfedit

In all repositories, it is represented by a graphical program with the ability to execute scripts from the command line. Just like Acrobat Reader, the first start of converting pdf to text was made from the “file as save as text” graphical environment, the result was very pleased:

   _{ZAO St. Petersburg Extract from the Register Form SET-VRS}   _{International deals}   _{Commodity and Raw Materials Exchange "at the end of the auction}   _{Date of bidding: 01/01/1900}   _{Section: Section "Petroleum Products" CJSC SPbMTSB}   _{Member: LLC}   _{Participant code: 000000000000}   _{Name and code of the client: OJSC / 000000000000}   _{Tool code: 0000000001}   _{Tool Description: Gasoline Premium Euro 95 type II class B,}   _{Number Time Application Number}   _{Purchase / Transaction Price (per Quantity}   _{VAT Exchange Name}   _Code   _{CRC transaction amount}   _{deals sale}   _{one ton)}   _{lots (18%) collecting counterparty}   _trader   _{01-01 01-02 01-03 S 01-05 5 01-07 01-08 01-09 01-10 PeSee}   _01-12   _{02-01 02-02 02-03 S 02-05 1 02-07 02-08 02-09 02-10 PbSee}   _02-12   _{Total to 0000000001 00,000,000.11 00,000.12}   _{Tool code: 0000000002}   _{Tool Description: Diesel fuel winter Z-0.2-35.}   _{Number Time Application Number}   _{Purchase / Transaction Price (per Quantity}   _{VAT Exchange Name}   _Code   _{CRC transaction amount}   _{deals sale}   _{one ton)}   _{lots (18%) collecting counterparty}   _trader   _{03-01 03-02 03-03 S 03-05 1 03-07 03-08 03-09 03-10 PuSE}   _03-12   _{04-01 04-02 04-03 S 04-05 1 04-07 04-08 04-09 04-10 PgSee}   _04-12   _{Total to 0000000002 0,000,000.21 0,000.22}   _{SPbMTSB Broker Copy of an electronic document}   _{¹ B - purchase, S - sale.}  _1/3

It turned out a very correct conclusion - parsing which is easy and pleasant. But it was not there, the path to the completion of the epic was thorny. It was impossible to call the “save as text” function from the command line, in the standard assembled pdfedit:

   _{$ pdfedit -console}   _{PDFedit 0.4.5}   _Using:   _{pdfedit -console [function name] [function parameter (s)]}   _{It is a function of invoice (case insensitive) or its unambiguous part.}   _{The remaining parameters are passed to the called function.}   _{Available features:}   _{Delinearizator}   _{Description: Delinearize input file}   _{Parameters: [input file] [output file]}   _Flattener   _{Description: Flatten input file (remove all revisions except the last one)}   _{Parameters: [input file] [output file]}

the console help output gave two utilities that were not related to converting to text. Further study of the documentation on wiki led to the complete conversion script savealltext.qs , but the trouble is that it should have been executed from the graphic menu. As a result, I had to study the materiel in detail:

API for creating scripts in pdfedit
create pdfedit script wiki
examples of ready-made scripts

The “smoking” of this material + a phased analysis of the /usr/share/pdfedit/delinearize.qs presented in the standard assembly of the console script led me to create my savealltext.qs:

 /** Print help for savealltext */ function savealltext_help() { print(tr("Usage:")); print("savealltext ["+tr("input file")+"] ["+tr("output file")+"]"); print(" "+tr("Input file must exist")); print(" "+tr("Output file must not exist")); exit(1); } function savealltext_fail(err) { print(tr("savealltext failed!")); print(err); exit(2); } function saveAsText_save(p,f) { document=loadPdf(p) qs=""; pages=document.getPageCount(); for (i=1;i<=pages;i++) { pg=document.getPage(i); text=pg.getText(); qs+=text; qs+="\n"; } saveFile(f,qs); } p=parameters(); if (p.length!=2) { savealltext_help("savealltext "+tr("is expecting two parameters")); } inFile=p[0]; outFile=p[1]; if (!exists(inFile)) savealltext_fail(tr("Input file '%1' does not exist").arg(inFile)); if (exists(outFile)) savealltext_fail(tr("Output file '%1' already exist").arg(outFile)); if (inFile==outFile) savealltext_fail(tr("Input and output files must be different")); if (saveAsText_save(inFile,outFile)) { } else { print(tr("savealltext")+" :"+inFile+" -> "+outFile); }

/** Print help for savealltext */ function savealltext_help() { print(tr("Usage:")); print("savealltext ["+tr("input file")+"] ["+tr("output file")+"]"); print(" "+tr("Input file must exist")); print(" "+tr("Output file must not exist")); exit(1); } function savealltext_fail(err) { print(tr("savealltext failed!")); print(err); exit(2); } function saveAsText_save(p,f) { document=loadPdf(p) qs=""; pages=document.getPageCount(); for (i=1;i<=pages;i++) { pg=document.getPage(i); text=pg.getText(); qs+=text; qs+="\n"; } saveFile(f,qs); } p=parameters(); if (p.length!=2) { savealltext_help("savealltext "+tr("is expecting two parameters")); } inFile=p[0]; outFile=p[1]; if (!exists(inFile)) savealltext_fail(tr("Input file '%1' does not exist").arg(inFile)); if (exists(outFile)) savealltext_fail(tr("Output file '%1' already exist").arg(outFile)); if (inFile==outFile) savealltext_fail(tr("Input and output files must be different")); if (saveAsText_save(inFile,outFile)) { } else { print(tr("savealltext")+" :"+inFile+" -> "+outFile); }

being placed in the / usr / share / pdfedit / directory, the script is invoked from the command line and performs the required conversion:

   _{$ pdfedit -console}   _{PDFedit 0.4.5-20111014140242}   _Using:   _{pdfedit -console [function name] [function parameter (s)]}   _{It is a function of invoice (case insensitive) or its unambiguous part.}   _{The remaining parameters are passed to the called function.}   _{Available features:}   _{Delinearizator}   _{Description: Delinearize input file}   _{Parameters: [input file] [output file]}   _Flattener   _{Description: Flatten input file (remove all revisions except the last one)}   _{Parameters: [input file] [output file]}   _savealltext   _{Description: savealltext input file}   _{Parameters: [input file] [output file]}

$pdfedit -console savealltext .pdf .txt

It seems to be fine, but the pitfalls continued to float: it turned out that pdfedit, in console mode, takes text from the default area of 612x792 pixels - at 72 dpi, this corresponds to A4 sheet of landscape orientation. The program did not want to rotate the scan area, despite the presence in the code of the corresponding instructions from the rotate_text_fix.patch patch.
The search for the definition of this tricky "default area" brought me to the source code of the project, to the file: src / kernel / displayparams.h - that's because, they found where to put it!

static const int DEFAULT_PAGE_RX = 612; /**< Default A4 width on a device with 72 horizontal dpi. */
static const int DEFAULT_PAGE_RY = 792; /**< Default A4 height on a device with 72 vertical dpi. */

Replaced both values with 1584 points (which corresponds to half of whatman paper in the same resolution) and chose the project.

At this, one might say, the torment ended - pdf-ki were processed from the terminal, in all its A4-th album landscape orientation.
Based on the results of his torment, he made the rules for building pdfedit packages with savealltext for ArchLinux and uploaded to the AUR:
pdfedit release
pdfedit version from csv

ps:

Soon the fairy tale is felt, but it is not done soon - I didn’t like that pdfedit to run, even in console mode, required qt3 in the system:

_{$ pdfedit -console} _{pdfedit: error while loading loading libraries: libqt-mt.so.3}
What led to the consideration of the program configuration keys when building from source and the following recipe was found here:

configure --disable-gui --enable-pdfedit-core-dev --enable-tools

As a result, the graphical pdfedit binary is not compiled at all, but it creates a whole bunch of useful utilities, including pdf_to_text - which does the conversion I need, using the same algorithms as the savealltext.qs script:

$pdf_to_txt -file .pdf >.txt

The results of such an assembly of the project, also turned into a PKGBUILD for ArchLinux and laid out in AUR .

upd1:

Thank you all for the consultation! Thanks to the discussion, several more interesting ways to convert pdf to text were revealed:
zeliboba proposed to apply -layout option to pdftotext

$ pdftotext -layout .pdf .txt

It can be said without exaggeration that the project files from the article, this way converts better than pdfedit:

   _{ZAO St. Petersburg Extract from the Register Form SET-VRS}   _{International deals}   _{Commodity and Raw Materials Exchange "at the end of the auction}   _{Date of bidding: 01/01/1900}   _{Section: Section "Petroleum Products" CJSC SPbMTSB}   _{Member: LLC}   _{Participant code: 000000000000}   _{Name and code of the client: OJSC / 000000000000}   _{Tool code: 0000000001}   _{Tool Description: Gasoline Premium Euro 95 type II class B,}   _{Number Time Order number Purchase / Transaction price (per Quantity VAT Exchange Name Code}   _{CRC transaction amount}   _{transactions sale ¹ one ton) of lots (18%) collection of trader’s counterparty}   _{01-01 01-02 01-03 S 01-05 5 01-07 01-08 01-09 01-10 Pay 01-12}   _{02-01 02-02 02-03 S 02-05 1 02-07 02-08 02-09 02-10 PbSee 02-12}   _{Total to 0000000001 00,000,000.11 00,000.12}

On the other hand, this rule does not work with all pdfs:
“Today I experimented with various pdf files, it turned out that between pdfedit and pdftotext there is still no absolute champion, say:
Pdftotext - better converted a series of horizontally elongated tables and single-column journals.
Pdfedit - worked better with multi- column magazines (for example: linuxformat) and vertically oriented tables.
Somewhere, both programs showed the same result (for example: the Zeingaus magazine). ”

kornerz suggested converting pdf to svg, using inkscape - pre-dividing the pdf-file into sheets (as an option using pdftk)

$pdftk file.pdf burst

$inkscape -z -f pg_0001.pdf -l output_page1.svg

Source: https://habr.com/ru/post/130601/

All Articles

Saga about batch converting pdf to text

pdftotext

Acrobat reader

pdfedit

ps:

upd1:

More articles: