📜 ⬆️ ⬇️

How to easily convert a web page to PDF?


For me, it was very unexpected that in the Java hub there is almost no information on working with PDF documents, so from personal experience I would like to use the example of a servlet to show how easily any web page can be turned into a PDF document.

Preamble:
Let's write a simple servlet that will take the web page we specified via the HTTP protocol and generate a full PDF document based on it.

Used libraries:

Library descriptions for Maven configuration (pom.xml)
<dependency> <groupId>org.xhtmlrenderer</groupId> <artifactId>flying-saucer-pdf</artifactId> <version>9.0.4</version> </dependency> <dependency> <groupId>net.sourceforge.htmlcleaner</groupId> <artifactId>htmlcleaner</artifactId> <version>2.6.1</version> </dependency> 


Forming page:
One of the most important moments is the formation of the page. The fact is that it is from the page itself, through CSS, that the parameters of the future PDF document are set.

Consider the layout:
page.jsp
 <%@ page import="java.util.Date" %> <%@ page import="java.text.SimpleDateFormat" %> <%@ page contentType="text/html;charset=UTF-8" language="java" %> <%! private SimpleDateFormat sdf = new SimpleDateFormat("HH:mm:ss"); %> <html> <head> <title></title> <style> @font-face { font-family: "HabraFont"; src: url(http://localhost:8080/resources/fonts/tahoma.ttf); -fs-pdf-font-embed: embed; -fs-pdf-font-encoding: Identity-H; } @page { margin: 0px; padding: 0px; size: A4 portrait; } @media print { .new_page { page-break-after: always; } } body { background-image: url(http://localhost:8080/resources/images/background.png); } body *{ padding: 0; margin: 0; } * { font-family: HabraFont; } #block { width: 90%; margin: auto; background-color: white; border: dashed #dbdbdb 1px; } #logo { margin-top: 5px; width: 100%; text-align: center; border-bottom: dashed #dbdbdb 1px; } #content { padding-left: 10px; } </style> </head> <body> <div id="block"> <div id="logo"><img src="http://localhost:8080/resources/images/habra-logo.png"></div> <div id="content"> , !  : <%=sdf.format(new Date())%> <div class="new_page"> </div>  ! </div> </div> </body> </html> 


Here I want to dwell on a few points. The most important thing to begin with: all paths must be absolute ! Pictures, styles, font addresses, etc., absolute paths should be written for everything. Now let's go through the CSS rules (what starts with the @ symbol).
@ font-face is a rule that tells our PDF generator which font to take, and where from. The problem is that the library that will generate the PDF document does not contain Cyrillic fonts. That is why this way you have to define ALL fonts that are used in your page, even if they are standard fonts: Arial, Verdana, Tahoma, etc., otherwise you risk not seeing Cyrillic in your document.
Notice properties such as "-fs-pdf-font-embed: embed;" and "-fs-pdf-font-encoding: Identity-H;", these properties are necessary, just do not forget to add them.
@ page is the rule that indents for a PDF document, and its size. Here I would like to note that if you specify the page size A3 (and as practice shows, this is often necessary, since the page does not fit the document in width), this does not mean that the user will need to print the document (if desired) in A3 format, rather, just all the content will be proportionally reduced / increased to the desired (more often A4). Those. Be skeptical of the size property, but know that it can play a key role for you.
@ media is a rule that allows you to create CSS classes for a certain type of device, in our case it is “print”. Inside this rule, we created a class, after which our PDF document generator will create a new page.
')
Servlet:
Now we will write a servlet that will return the generated PDF document to us:
PdfServlet.java
 package ru.habrahabr.web_to_pdf.servlets; import org.htmlcleaner.CleanerProperties; import org.htmlcleaner.HtmlCleaner; import org.htmlcleaner.PrettyXmlSerializer; import org.htmlcleaner.TagNode; import org.xhtmlrenderer.pdf.ITextRenderer; import javax.servlet.ServletException; import javax.servlet.http.HttpServlet; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletResponse; import java.io.*; import java.net.HttpURLConnection; import java.net.URL; import java.net.URLConnection; /** * Date: 31.03.2014 * Time: 9:33 * * @author Ruslan Molchanov (ruslanys@gmail.com) */ public class PdfServlet extends HttpServlet { private static final String PAGE_TO_PARSE = "http://localhost:8080/page.jsp"; private static final String CHARSET = "UTF-8"; @Override protected void service(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException { try { resp.setContentType("application/pdf"); byte[] pdfDoc = performPdfDocument(PAGE_TO_PARSE); resp.setContentLength(pdfDoc.length); resp.getOutputStream().write(pdfDoc); } catch (Exception ex) { resp.setContentType("text/html"); PrintWriter out = resp.getWriter(); out.write("<strong>Something wrong</strong><br /><br />"); ex.printStackTrace(out); ex.printStackTrace(); } } /** * ,  PDF . * @param path    * @return PDF  * @throws Exception */ private byte[] performPdfDocument(String path) throws Exception { //  HTML   String html = getHtml(path); // ,      HTML  ByteArrayOutputStream out = new ByteArrayOutputStream(); //  HTML  /*    ,        */ HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setCharset(CHARSET); TagNode node = cleaner.clean(html); new PrettyXmlSerializer(props).writeToStream(node, out); //  PDF   HTML  ITextRenderer renderer = new ITextRenderer(); renderer.setDocumentFromString(new String(out.toByteArray(), CHARSET)); renderer.layout(); /* ,       PDF , ,   *     ,    PDF , *    ,     */ ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); renderer.createPDF(outputStream); //   renderer.finishPDF(); out.flush(); out.close(); byte[] result = outputStream.toByteArray(); outputStream.close(); return result; } private String getHtml(String path) throws IOException { URLConnection urlConnection = new URL(path).openConnection(); ((HttpURLConnection) urlConnection).setInstanceFollowRedirects(true); HttpURLConnection.setFollowRedirects(true); boolean redirect = false; // normally, 3xx is redirect int status = ((HttpURLConnection) urlConnection).getResponseCode(); if (HttpURLConnection.HTTP_OK != status && (HttpURLConnection.HTTP_MOVED_TEMP == status || HttpURLConnection.HTTP_MOVED_PERM == status || HttpURLConnection.HTTP_SEE_OTHER == status)) { redirect = true; } if (redirect) { // get redirect url from "location" header field String newUrl = urlConnection.getHeaderField("Location"); // open the new connnection again urlConnection = new URL(newUrl).openConnection(); } urlConnection.setConnectTimeout(30000); urlConnection.setReadTimeout(30000); BufferedReader in = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), CHARSET)); StringBuilder sb = new StringBuilder(); String line; while (null != (line = in.readLine())) { sb.append(line).append("\n"); } return sb.toString().trim(); } @Override public String getServletInfo() { return "The servlet that generate and returns pdf file"; } } 


By the way, it is not at all necessary to write a servlet for these purposes, you can transfer the logic of this servlet to the console application, which will save PDF documents to files. As you may have noticed, there is no need to configure, change, add, etc. in the servlet. (well, except for the path to the page and, possibly, encodings), respectively, all the work on the preparation of a PDF document is very simple and takes place exclusively in the view.

In the end, you should get something like this PDF document: github.com/ruslanys/example-web-to-pdf/blob/master/web-to-pdf-example.pdf
I added some information to my document (I used the Habr's main page) and I got the following document: github.com/ruslanys/sample-html-to-pdf/blob/master/web-to-pdf-habra.pdf

Link to sources: github.com/ruslanys/sample-html-to-pdf

PS In principle, based on this example, you can write a whole service that will create a PDF document at any page address. The only thing that needs to be done is to bring the HTML code of the page in line with our rules, i.e. first of all, it will be necessary to rewrite all relative paths to absolute ones (the good is not difficult), and according to some logic, specify the dimensions of the document.

Source: https://habr.com/ru/post/217561/


All Articles