IText: we take out the text from PDF

Good time of day, habrovchane!

Recently I faced the challenge: learn to pull text from a PDF by memorizing its position on the page. And, of course, in the simple task at first, the pitfalls came out. How did you finally solve this? The answer is under the cut.

A bit about PDF format

PDF (Portable Document Format) is a popular cross-platform document format using PostScript. Its main purpose is to correctly display it on various operating systems, etc.
')
The first idea was simply to ~~invent the bicycle~~ itself, namely, to open the pdf and pull the text out. And, trying to do this, I realized that inside the pdf is not very nicely arranged and revealed several facts that seriously complicate the task:

words can be illogically broken into pieces. For example, the display of the word “algorithms” is written, roughly speaking, in three parts: display “al” “orit” “we”
the lines in the text and the words in the lines may not appear in the order we used to read
in some documents, spaces are set explicitly (i.e. there are commands containing ''), in others - they are formed by the fact that adjacent words are displayed from each other at some distance

Because the desire to parse pdf yourself disappeared instantly.
ps from all this involuntarily remembered quote

Those who love sausage and respect the law, it’s better not to see how both are done

Then, having played with several libraries (pdfminer, pdfbox), I decided to stop at iText.

A little about iText

iText: a Java library designed to work with pdf (there is also a version in C #: iTextSharp). Starting from version 5.0.0, it is freely distributed under the AGPL license (obliging to provide users with the ability to get the source code), but there is also a commercial version. Stocked with good documentation. And for those who want to familiarize themselves with the library in a better way, I advise a book from the creator of the library “iText in Action”.

Easy way to pull text from PDF

This code well extracts text from PDF, but does not provide any information about its location in the document.

public class SimpleTextExtractor { public static void main(String[] args) throws IOException { // ,      -   PdfReader reader = new PdfReader(args[0]); //  ,     PDF   . for (int i = 1; i <= reader.getNumberOfPages(); ++i) { TextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); String text = PdfTextExtractor.getTextFromPage(reader, i, strategy); System.out.println(text); } //    reader.close(); } }

And now let's look at everything in order.

PdfReader is a class that reads PDF. It can be constructed not only from the file name, but also from the InputStream, Url or RandomAccessFileOrArray.

TextExtractionStrategy is an interface that defines a text extraction strategy. More about him - below.

SimpleTextExtractionStrategy is a class that implements TextExtractionStrategy. Despite the title, very nicely pulls text out of PDF (copes with the changeable structure of PDF, that is, if the text first comes in two columns, and then switches to normal writing in the entire page.

PdfTextExtractor - a static class that contains only 2 getTextFromPage methods with one difference - are we indicating the text extraction strategy explicitly or not.

We take out the text, remembering the coordinates

To do this, we need to pay attention to the TextExtractionStrategy interface. Namely, these two functions:

 public void renderText(TextRenderInfo renderInfo)

- when calling getTextFromPage, this function is called with every command that displays text. TextRenderInfo stores all the necessary information: text, font, coordinates.

 public string GetResultantText()

- this function is called before the end of getTextFromPage and its result will be returned to the user.

As a sample, let's learn to pull out in the simplest way pairs of the form <y-coordinate of the line, text of the line> for each line on the page.

Interface implementation:

 public class TextExtractionStrategyImpl implements TextExtractionStrategy { private TreeMap<Float, TreeMap<Float, String>> textMap; public TextExtractionStrategyImpl() { // reverseOrder     y      textMap = new TreeMap<Float, TreeMap<Float, String>>(Collections.reverseOrder()); } @Override public String getResultantText() { StringBuilder stringBuilder = new StringBuilder(); //    for (Map.Entry<Float, TreeMap<Float, String>> stringMap: textMap.entrySet()) { //      for (Map.Entry<Float, String> entry: stringMap.getValue().entrySet()) { stringBuilder.append(entry.getValue()); } stringBuilder.append('\n'); } return stringBuilder.toString(); } @Override public void beginTextBlock() {} @Override public void renderText(TextRenderInfo renderInfo) { //   Float x = renderInfo.getBaseline().getStartPoint().get(Vector.I1); Float y = renderInfo.getBaseline().getStartPoint().get(Vector.I2); //           . if (!textMap.containsKey(y)) { textMap.put(y, new TreeMap<Float, String>()); } textMap.get(y).put(x, renderInfo.getText()); } @Override public void endTextBlock() {} @Override public void renderImage(ImageRenderInfo imageRenderInfo) {} //       y- ArrayList<Pair<Float, String>> getStringsWithCoordinates() { ArrayList<Pair<Float, String>> result = new ArrayList<Pair<Float, String>>(); for (Map.Entry<Float, TreeMap<Float, String>> stringMap: textMap.entrySet()) { StringBuilder stringBuilder = new StringBuilder(); for (Map.Entry<Float, String> entry: stringMap.getValue().entrySet()) { stringBuilder.append(entry.getValue()); } result.add(new Pair<Float, String>(stringMap.getKey(), stringBuilder.toString())); } return result; } }

And the main code looks like this:

 public class TextExtractor { public static void main(String[] args) throws IOException { PdfReader reader = new PdfReader(args[0]); for (int i = 1; i <= reader.getNumberOfPages(); ++i) { TextExtractionStrategyImpl strategy = new TextExtractionStrategyImpl(); // ,           PdfTextExtractor.getTextFromPage(reader, i, strategy); System.out.println("Page : " + i); for (Pair<Float, String> pair: strategy.getStringsWithCoordinates()) { System.out.println(pair.getKey().toString() + " " + pair.getValue()); } } reader.close(); } }

Notes

Of course, for good text extraction, you need to add all sorts of chips for correct processing of text in several columns, processing spaces that are not explicitly given, etc., but I don’t want to delve into such details within this article.

And I would also like to note that this is only a small part of the library’s possibilities. With it, you can create documents, add text and images to existing ones (including watermarks).

And the link to the repository (oh, this AGPL)

Source: https://habr.com/ru/post/225647/

All Articles