📜 ⬆️ ⬇️

Parse it!

Some time ago I had to do a little research work. Its essence was to find the best pdf parser implemented in java.

A little bit about the project. It implements a system for forwarding internal messages to which files can be attached. There is also a search that should be carried out on the contents of the attachments. Most of these attachments are pdfs.
Actually, the operation of the mechanism is quite simple: when sending a message, the attachment data is parsed and the index is left on it.

For a long time, documents were parsed with the help of the PDFBOX library, whose work did not cause anyone to be happy: long and malfunctioning.
As a result, 4 libraries were selected, the comparison of which I did: PDFBOX, JPod, iText and Acrobat.

The acrobat was eliminated almost immediately, since it turned out that this library has not been maintained for several years, but the statistics on it has remained, so I will publish it too.
Libraries had to be compared according to two criteria - the speed of work and the quality of the results obtained.
I'll warn you right away: the libraries were tested on internal documents, and they fall under a certain security level. So I can’t even give names. I can only say one thing - the content of the files was the most diverse: the text, tables of pictures, scans, and so on. File sizes are also quite different, so we can expect objective assessments.
')

Estimated time:


file size

PDFBOX

Acrobat

JPOD

iText

74.1 KB00: 02.59100: 03.15500: 01.68300: 00.963
257.5 KB00: 01.68000: 03.19100: 00.11600: 00.78
1.6 MB00: 05.80500: 02.88400: 02.53200: 02.79
28.1 MBE01: 10.98300: 43.815E
13.6 MB00: 05.21800: 04.33100: 00.59900: 00.77
1.9 MB00: 02.78200: 14.50600: 00.60800: 00.707
1.6 MB00: 06.18200: 02.9800: 00.90600: 02.413
8.9 MB00: 05.9800: 03.89400: 00.68000: 00.647
2.4 MB00: 14.1500: 07.89300: 02.826E
604.7 KB00: 03.34200: 04.72100: 00.55100: 01.222
100.6 KB00: 01.81900: 04.21200: 00.8400: 00.456
1.6 MB00: 05.63300: 03.9900: 00.88300: 02.18
10.3 MB00: 22.31100: 22.14500: 27.663E
1.9 MB00: 06.94300: 14.73600: 01.200E
2.1 MB00: 02.573E00: 00.49800: 00.475
111.0 KB00: 01.95600: 02.84600: 00.70500: 00.300
814.3 KB00: 02.55200: 04.22100: 00.30600: 00.900
2.0 MB00: 06.31900: 07.12800: 01.82100: 02.796
338.7 KB00: 01.95000: 03.68400: 00.7900: 00.415
12.9 MB00: 15.93200: 13.62800: 04.989E
7.3 MBE00: 17.27500: 16.377E
97.2 MB00: 27.29100: 01.99400: 05.739E
5.2 MB00: 07.77300: 11.10800: 01.964E
Total:
Best02147
Middle127eighteight
Worst (including errors)eleven14oneeight
Errors2one0eight

The worst and best times are highlighted in red and green respectively. The letter “E” indicates the state of permanent collapse, overtaking the process due to buffer overflow or any other errors.
In comparison, the objective winner was JPod. Pleased by the lack of errors when parsing.

Quality control:


Quality assessment was rather subjective and was divided into only 3 categories: Best, Middle and Worst. There is also an Empty score, which was set if a collapse occurred during the parsing process, or simply the parser did not find the text inside the document.
The similarity of the received text with the original was assessed, but not very critical, because the text was needed for the index, not for the output.

file size

PDFBOX

Acrobat

JPOD

iText

74.1 KBBestMiddleBestBest
257.5 KBBestMiddleBestEmpty
1.6 MBBestEmptyEmptyWorst
28.1 MBEmptyMiddleBestEmpty
13.6 MBEmptyEmptyEmptyEmpty
1.9 MBBestMiddleWorstMiddle
1.6 MBBestEmptyWorstWorst
8.9 MBBestMiddleMiddleBest
2.4 MBBestMiddleWorstEmpty
604.7 KBBestMiddleMiddleMiddle
100.6 KBBestMiddleBestEmpty
1.6 MBBestEmptyWorstWorst
10.3 MBBestBestBestEmpty
1.9 MBBestBestBestEmpty
2.1 MBBestBestBestEmpty
111.0 KBBestBestBestBest
814.3 KBBestWorstBestBest
2.0 MBBestMiddleBestBest
338.7 KBMiddleMiddleBestEmpty
12.9 MBBestBestBestEmpty
7.3 MBEmptyBestBestEmpty
97.2 MBBestBestMiddleEmpty
5.2 MBBestBestBestEmpty
Total:
Bestnineteeneight14five
Middleoneten32
Worst (including empty)3five6sixteen
Empty3four213


Characteristics of parsing:
Acrobat very often parses the text in one line. Spaces between words and sentences are preserved, so in principle this is not critical for indexing.
iText does not understand "non-English" characters. In the test used the documents in English, German and French. Therefore, all their umlauts went to the forest. I didn’t even just have to go to the forest - instead of these symbols, I received question marks. Perhaps this is somewhere configurable, but the rest understood similar symbols without dancing with a tambourine.
PDFBOX for quality did not cause complaints.
JPod - everything is ok too. Except for one feature that made tinker with it for quite some time. In some cases, the document is completely or partially parsed one letter per line - for the index, such parsing is useless.

As a result, JPod was declared the winner, despite its peculiarity of parsing by letter in a row.
We had to deal with this.

Part two. Inside JPod.


It took a lot of time to tinker with source JPod. Letters, project forum. As a result, it was found that this behavior of the parser is caused by the orientation of the document pages. Portrait orientation is normal, but landscape orientation is not. Attempts to pick the parameters yielded nothing. The class properties responsible for page orientation were useless.
In general, at one of the moments I decided to simply remove from the classes all work with fonts. Anyway, they are not needed for text indexing. It helped because blocks of text were calculated incorrectly, and it was caused by fonts.
Here I would stop, but Egiptyanin insisted on the need to reach the end. Then I almost did not participate.
The solution was found, and in this form is used: the affine transformation matrix was redefined. Instead of a dynamic matrix, a static one was set. The CSPlainTextExtractor class was used instead of the CSTextExtractor. The new class has the following form:

public class CSPlainTextExtractor extends CSTextExtractor {
public void textSetTransform(float a, float b, float c, float d, float e, float f) {
super.textSetTransform(1, 0, 0, 1, 0, 0);
}
}

Of course, this is not a panacea and very rarely the parser does not add hyphenation to the necessary places, but this is not important for indexing.

Actually, everything. Thank you for your attention =)

PS This is my first more or less serious article, I hope for objective criticism.

Upd Particularly attentive readers found inconsistencies in the tables - fixed.

Source: https://habr.com/ru/post/57076/


All Articles