Last year, a simple job was ordered, as it seemed at first glance: to create a system for batch processing files - containing a 12-column table, the data from which to export to the database. Everything would be fine - yes, the files turned out to be documents in pdf, and the customer claimed that he could not provide any other format for processing.
A sample of that pdf-a - the structure is saved in the file, but all the data is cleaned up.Well, despite the warnings of knowledgeable people, but they warned oh how not in vain - I took up the job and survived such an adventure:
')
The search for software for batch processing of pdf files has begun. The work was done on linux - because the first challenger was the program from the poppler package - pdftotext, which is most often used for this purpose.
pdftotext
the function is easily called from the command line, and has the following form:
$ pdftotext .pdf .txt
But the result of the conversion was very strange:
Extract from the registry
deals
on the basis of bidding
ZAO St. Petersburg
International
Commodity - raw exchange »
SET-VRS form
Date of bidding: 01/01/1900
Section: Section “Petroleum Products” CJSC SPbMTSB
Member: LLC
Participant code: 000000000000
Name and code of the client: OJSC / 000000000000
Tool code: 0000000001
Tool Description: Gasoline Premium Euro 95 type II class B,
Number
deals
Time
deals
Application number
Purchase / Transaction Price (per Quantity
saleÂą
one ton)
lots
01-01
01-02
01-03
S
01-05
five
02-01
02-02
02-03
S
02-05
one
Total to 0000000001
and so on…
Not only that: the output was still changing from file to file — of course in this form it was impossible to write a parser to pull data from the tables.
Therefore, a second candidate appeared on the horizon, this time a proprietary candidate: Adobe Acrobat Reader for linux.
Acrobat reader
In this case, I decided, for starters, to test its ability to “File - Save as Text” in graphical mode, without much delving into the command line. As it turned out, I did the right thing - the results of converting my pdf and in this program did not inspire optimism:
Participant code:
Section:
Tool code:
Tool code:
000000000000
Ltd
0000000001
0000000002
00,000,000.11 00,000.12
0,000,000.21 0,000.22
01-12
02-12
03-12
04-12
ZAO St. Petersburg
International
Commodity - raw exchange »
Extract from the registry
deals
on the basis of bidding
Date of bidding: 01/01/1900
SET-VRS form
Name and code of the client: OJSC / 000000000000
Participant:
Section “Oil products” CJSC “SPIMTSB”
Tool Description: Gasoline Premium Euro 95 type II class B,
Code
trader
Time
deals
Number
deals
Application number
Total to 0000000001
Tool Description: Diesel fuel winter Z-0.2-35.
Code
trader
Time
deals
Number
deals
Application number
Total to 0000000002
and so on…
Parsing such a conclusion is also not very desirable.
The third and, as it turned out, the last candidate was
pdfedit .
pdfedit
In all repositories, it is represented by a graphical program with the ability to execute scripts from the command line. Just like Acrobat Reader, the first start of converting pdf to text was made from the “file as save as text” graphical environment, the result was very pleased:
ZAO St. Petersburg Extract from the Register Form SET-VRS
International deals
Commodity and Raw Materials Exchange "at the end of the auction
Date of bidding: 01/01/1900
Section: Section "Petroleum Products" CJSC SPbMTSB
Member: LLC
Participant code: 000000000000
Name and code of the client: OJSC / 000000000000
Tool code: 0000000001
Tool Description: Gasoline Premium Euro 95 type II class B,
Number Time Application Number
Purchase / Transaction Price (per Quantity
VAT Exchange Name
Code
CRC transaction amount
deals sale
one ton)
lots (18%) collecting counterparty
trader
01-01 01-02 01-03 S 01-05 5 01-07 01-08 01-09 01-10 PeSee
01-12
02-01 02-02 02-03 S 02-05 1 02-07 02-08 02-09 02-10 PbSee
02-12
Total to 0000000001 00,000,000.11 00,000.12
Tool code: 0000000002
Tool Description: Diesel fuel winter Z-0.2-35.
Number Time Application Number
Purchase / Transaction Price (per Quantity
VAT Exchange Name
Code
CRC transaction amount
deals sale
one ton)
lots (18%) collecting counterparty
trader
03-01 03-02 03-03 S 03-05 1 03-07 03-08 03-09 03-10 PuSE
03-12
04-01 04-02 04-03 S 04-05 1 04-07 04-08 04-09 04-10 PgSee
04-12
Total to 0000000002 0,000,000.21 0,000.22
SPbMTSB Broker Copy of an electronic document
Âą B - purchase, S - sale. 1/3
It turned out a very correct conclusion - parsing which is easy and pleasant. But it was not there, the path to the completion of the epic was thorny. It was impossible to call the “save as text” function from the command line, in the standard assembled pdfedit:
$ pdfedit -console
PDFedit 0.4.5
Using:
pdfedit -console [function name] [function parameter (s)]
It is a function of invoice (case insensitive) or its unambiguous part.
The remaining parameters are passed to the called function.
Available features:
Delinearizator
Description: Delinearize input file
Parameters: [input file] [output file]
Flattener
Description: Flatten input file (remove all revisions except the last one)
Parameters: [input file] [output file]
the console help output gave two utilities that were not related to converting to text. Further study of the documentation on
wiki led to the complete conversion script
savealltext.qs , but the trouble is that it should have been executed from the graphic menu. As a result, I had to study the materiel in detail:
API for creating scripts in pdfeditcreate pdfedit script wikiexamples of ready-made scriptsThe “smoking” of this material + a phased analysis of the /usr/share/pdfedit/delinearize.qs presented in the standard assembly of the console script led me to create my savealltext.qs:
/** Print help for savealltext */ function savealltext_help() { print(tr("Usage:")); print("savealltext ["+tr("input file")+"] ["+tr("output file")+"]"); print(" "+tr("Input file must exist")); print(" "+tr("Output file must not exist")); exit(1); } function savealltext_fail(err) { print(tr("savealltext failed!")); print(err); exit(2); } function saveAsText_save(p,f) { document=loadPdf(p) qs=""; pages=document.getPageCount(); for (i=1;i<=pages;i++) { pg=document.getPage(i); text=pg.getText(); qs+=text; qs+="\n"; } saveFile(f,qs); } p=parameters(); if (p.length!=2) { savealltext_help("savealltext "+tr("is expecting two parameters")); } inFile=p[0]; outFile=p[1]; if (!exists(inFile)) savealltext_fail(tr("Input file '%1' does not exist").arg(inFile)); if (exists(outFile)) savealltext_fail(tr("Output file '%1' already exist").arg(outFile)); if (inFile==outFile) savealltext_fail(tr("Input and output files must be different")); if (saveAsText_save(inFile,outFile)) { } else { print(tr("savealltext")+" :"+inFile+" -> "+outFile); }
/** Print help for savealltext */ function savealltext_help() { print(tr("Usage:")); print("savealltext ["+tr("input file")+"] ["+tr("output file")+"]"); print(" "+tr("Input file must exist")); print(" "+tr("Output file must not exist")); exit(1); } function savealltext_fail(err) { print(tr("savealltext failed!")); print(err); exit(2); } function saveAsText_save(p,f) { document=loadPdf(p) qs=""; pages=document.getPageCount(); for (i=1;i<=pages;i++) { pg=document.getPage(i); text=pg.getText(); qs+=text; qs+="\n"; } saveFile(f,qs); } p=parameters(); if (p.length!=2) { savealltext_help("savealltext "+tr("is expecting two parameters")); } inFile=p[0]; outFile=p[1]; if (!exists(inFile)) savealltext_fail(tr("Input file '%1' does not exist").arg(inFile)); if (exists(outFile)) savealltext_fail(tr("Output file '%1' already exist").arg(outFile)); if (inFile==outFile) savealltext_fail(tr("Input and output files must be different")); if (saveAsText_save(inFile,outFile)) { } else { print(tr("savealltext")+" :"+inFile+" -> "+outFile); }
being placed in the / usr / share / pdfedit / directory, the script is invoked from the command line and performs the required conversion:
$ pdfedit -console
PDFedit 0.4.5-20111014140242
Using:
pdfedit -console [function name] [function parameter (s)]
It is a function of invoice (case insensitive) or its unambiguous part.
The remaining parameters are passed to the called function.
Available features:
Delinearizator
Description: Delinearize input file
Parameters: [input file] [output file]
Flattener
Description: Flatten input file (remove all revisions except the last one)
Parameters: [input file] [output file]
savealltext
Description: savealltext input file
Parameters: [input file] [output file]
$pdfedit -console savealltext .pdf .txt
It seems to be fine, but the pitfalls continued to float: it turned out that pdfedit, in console mode, takes text from the default area of ​​612x792 pixels - at 72 dpi, this corresponds to A4 sheet of landscape orientation. The program did not want to
rotate the scan area, despite the presence in the code of the corresponding instructions from the
rotate_text_fix.patch patch.
The search for the definition of this tricky "default area" brought me to the
source code of the project, to the file: src / kernel / displayparams.h - that's because, they found where to put it!
static const int DEFAULT_PAGE_RX = 612; /**< Default A4 width on a device with 72 horizontal dpi. */
static const int DEFAULT_PAGE_RY = 792; /**< Default A4 height on a device with 72 vertical dpi. */
Replaced both values ​​with 1584 points (which corresponds to half of whatman paper in the same resolution) and chose the project.
At this, one might say, the torment ended - pdf-ki were processed from the terminal, in all its A4-th album landscape orientation.
Based on the results of his torment, he made the rules for building pdfedit packages with savealltext for ArchLinux and uploaded to the AUR:
pdfedit releasepdfedit version from csvps:
Soon the fairy tale is felt, but it is not done soon - I didn’t like that pdfedit to run, even in console mode, required qt3 in the system:
$ pdfedit -console
pdfedit: error while loading loading libraries: libqt-mt.so.3
What led to the consideration of the program configuration keys when building from source and the following recipe was found here:
configure --disable-gui --enable-pdfedit-core-dev --enable-tools
As a result, the graphical pdfedit binary is not compiled at all, but it creates a whole bunch of useful utilities, including pdf_to_text - which does the conversion I need, using the same algorithms as the savealltext.qs script:
$pdf_to_txt -file .pdf >.txt
The results of such an assembly of the project, also turned into a PKGBUILD for ArchLinux and laid out in
AUR .
upd1:
Thank you all for the consultation! Thanks to the discussion, several more interesting ways to convert pdf to text were revealed:
zeliboba proposed to apply
-layout option to pdftotext
$ pdftotext -layout .pdf .txt
It can be said without exaggeration that the project files from the article, this way converts better than pdfedit:
ZAO St. Petersburg Extract from the Register Form SET-VRS
International deals
Commodity and Raw Materials Exchange "at the end of the auction
Date of bidding: 01/01/1900
Section: Section "Petroleum Products" CJSC SPbMTSB
Member: LLC
Participant code: 000000000000
Name and code of the client: OJSC / 000000000000
Tool code: 0000000001
Tool Description: Gasoline Premium Euro 95 type II class B,
Number Time Order number Purchase / Transaction price (per Quantity VAT Exchange Name Code
CRC transaction amount
transactions sale ¹ one ton) of lots (18%) collection of trader’s counterparty
01-01 01-02 01-03 S 01-05 5 01-07 01-08 01-09 01-10 Pay 01-12
02-01 02-02 02-03 S 02-05 1 02-07 02-08 02-09 02-10 PbSee 02-12
Total to 0000000001 00,000,000.11 00,000.12
On the other hand, this rule does not work with all pdfs:
“Today I experimented with various pdf files, it turned out that between pdfedit and pdftotext there is still no absolute champion, say:
Pdftotext - better converted a series of horizontally elongated tables and single-column journals.
Pdfedit - worked better with multi-
column magazines (for example: linuxformat) and vertically oriented tables.
Somewhere, both programs showed the same result (for example: the Zeingaus magazine). ”
kornerz suggested converting pdf to svg, using inkscape - pre-dividing the pdf-file into sheets (as an option using pdftk)
$pdftk file.pdf burst
$inkscape -z -f pg_0001.pdf -l output_page1.svg