Parse it!

Some time ago I had to do a little research work. Its essence was to find the best pdf parser implemented in java.

A little bit about the project. It implements a system for forwarding internal messages to which files can be attached. There is also a search that should be carried out on the contents of the attachments. Most of these attachments are pdfs.
Actually, the operation of the mechanism is quite simple: when sending a message, the attachment data is parsed and the index is left on it.

For a long time, documents were parsed with the help of the PDFBOX library, whose work did not cause anyone to be happy: long and malfunctioning.
As a result, 4 libraries were selected, the comparison of which I did: PDFBOX, JPod, iText and Acrobat.

The acrobat was eliminated almost immediately, since it turned out that this library has not been maintained for several years, but the statistics on it has remained, so I will publish it too.
Libraries had to be compared according to two criteria - the speed of work and the quality of the results obtained.
I'll warn you right away: the libraries were tested on internal documents, and they fall under a certain security level. So I can’t even give names. I can only say one thing - the content of the files was the most diverse: the text, tables of pictures, scans, and so on. File sizes are also quite different, so we can expect objective assessments.
')

Estimated time:

file size	PDFBOX	Acrobat	JPOD	iText
74.1 KB	00: 02.591	00: 03.155	00: 01.683	00: 00.963
257.5 KB	00: 01.680	00: 03.191	00: 00.116	00: 00.78
1.6 MB	00: 05.805	00: 02.884	00: 02.532	00: 02.79
28.1 MB	E	01: 10.983	00: 43.815	E
13.6 MB	00: 05.218	00: 04.331	00: 00.599	00: 00.77
1.9 MB	00: 02.782	00: 14.506	00: 00.608	00: 00.707
1.6 MB	00: 06.182	00: 02.98	00: 00.906	00: 02.413
8.9 MB	00: 05.98	00: 03.894	00: 00.680	00: 00.647
2.4 MB	00: 14.15	00: 07.893	00: 02.826	E
604.7 KB	00: 03.342	00: 04.721	00: 00.551	00: 01.222
100.6 KB	00: 01.819	00: 04.212	00: 00.84	00: 00.456
1.6 MB	00: 05.633	00: 03.99	00: 00.883	00: 02.18
10.3 MB	00: 22.311	00: 22.145	00: 27.663	E
1.9 MB	00: 06.943	00: 14.736	00: 01.200	E
2.1 MB	00: 02.573	E	00: 00.498	00: 00.475
111.0 KB	00: 01.956	00: 02.846	00: 00.705	00: 00.300
814.3 KB	00: 02.552	00: 04.221	00: 00.306	00: 00.900
2.0 MB	00: 06.319	00: 07.128	00: 01.821	00: 02.796
338.7 KB	00: 01.950	00: 03.684	00: 00.79	00: 00.415
12.9 MB	00: 15.932	00: 13.628	00: 04.989	E
7.3 MB	E	00: 17.275	00: 16.377	E
97.2 MB	00: 27.291	00: 01.994	00: 05.739	E
5.2 MB	00: 07.773	00: 11.108	00: 01.964	E

Total:
Best	0	2	14	7
Middle	12	7	eight	eight
Worst (including errors)	eleven	14	one	eight

Errors	2	one	0	eight

The worst and best times are highlighted in red and green respectively. The letter “E” indicates the state of permanent collapse, overtaking the process due to buffer overflow or any other errors.
In comparison, the objective winner was JPod. Pleased by the lack of errors when parsing.

Quality control:

Quality assessment was rather subjective and was divided into only 3 categories: Best, Middle and Worst. There is also an Empty score, which was set if a collapse occurred during the parsing process, or simply the parser did not find the text inside the document.
The similarity of the received text with the original was assessed, but not very critical, because the text was needed for the index, not for the output.

file size	PDFBOX	Acrobat	JPOD	iText
74.1 KB	Best	Middle	Best	Best
257.5 KB	Best	Middle	Best	Empty
1.6 MB	Best	Empty	Empty	Worst
28.1 MB	Empty	Middle	Best	Empty
13.6 MB	Empty	Empty	Empty	Empty
1.9 MB	Best	Middle	Worst	Middle
1.6 MB	Best	Empty	Worst	Worst
8.9 MB	Best	Middle	Middle	Best
2.4 MB	Best	Middle	Worst	Empty
604.7 KB	Best	Middle	Middle	Middle
100.6 KB	Best	Middle	Best	Empty
1.6 MB	Best	Empty	Worst	Worst
10.3 MB	Best	Best	Best	Empty
1.9 MB	Best	Best	Best	Empty
2.1 MB	Best	Best	Best	Empty
111.0 KB	Best	Best	Best	Best
814.3 KB	Best	Worst	Best	Best
2.0 MB	Best	Middle	Best	Best
338.7 KB	Middle	Middle	Best	Empty
12.9 MB	Best	Best	Best	Empty
7.3 MB	Empty	Best	Best	Empty
97.2 MB	Best	Best	Middle	Empty
5.2 MB	Best	Best	Best	Empty

Total:
Best	nineteen	eight	14	five
Middle	one	ten	3	2
Worst (including empty)	3	five	6	sixteen

Empty	3	four	2	13

Characteristics of parsing:
Acrobat very often parses the text in one line. Spaces between words and sentences are preserved, so in principle this is not critical for indexing.
iText does not understand "non-English" characters. In the test used the documents in English, German and French. Therefore, all their umlauts went to the forest. I didn’t even just have to go to the forest - instead of these symbols, I received question marks. Perhaps this is somewhere configurable, but the rest understood similar symbols without dancing with a tambourine.
PDFBOX for quality did not cause complaints.
JPod - everything is ok too. Except for one feature that made tinker with it for quite some time. In some cases, the document is completely or partially parsed one letter per line - for the index, such parsing is useless.

As a result, JPod was declared the winner, despite its peculiarity of parsing by letter in a row.
We had to deal with this.

Part two. Inside JPod.

It took a lot of time to tinker with source JPod. Letters, project forum. As a result, it was found that this behavior of the parser is caused by the orientation of the document pages. Portrait orientation is normal, but landscape orientation is not. Attempts to pick the parameters yielded nothing. The class properties responsible for page orientation were useless.
In general, at one of the moments I decided to simply remove from the classes all work with fonts. Anyway, they are not needed for text indexing. It helped because blocks of text were calculated incorrectly, and it was caused by fonts.
Here I would stop, but Egiptyanin insisted on the need to reach the end. Then I almost did not participate.
The solution was found, and in this form is used: the affine transformation matrix was redefined. Instead of a dynamic matrix, a static one was set. The CSPlainTextExtractor class was used instead of the CSTextExtractor. The new class has the following form:

public class CSPlainTextExtractor extends CSTextExtractor {
public void textSetTransform(float a, float b, float c, float d, float e, float f) {
super.textSetTransform(1, 0, 0, 1, 0, 0);
}
}

Of course, this is not a panacea and very rarely the parser does not add hyphenation to the necessary places, but this is not important for indexing.

Actually, everything. Thank you for your attention =)

PS This is my first more or less serious article, I hope for objective criticism.

Upd Particularly attentive readers found inconsistencies in the tables - fixed.

Source: https://habr.com/ru/post/57076/

All Articles

Parse it!

Estimated time:

file size

PDFBOX

Acrobat

JPOD

iText