
In one of my projects I needed to automatically generate contract documents for clients. The contract is a legal document of about 10 pages in length and is a template: in the right places the data of a specific client are substituted.
Task
The primary requirements were:
In a complex stylized doc or docx document, print the necessary information in the marked places.')
Later they were refined and expanded:
- In a complex stylized docx document, display the data in the marked places.
- The output markup should be similar to scriptlets:
${}, <%%>, <%=%>
. - Data for output can be an object. Need the ability to access the fields.
- For output use one of the scripting languages: Groovy, JavaScript.
- You need to be able to display lists of objects in tables, displaying fields in each cell.
Available solutions
It turned out that the products available in the field (I'm talking about the Java platform) do not solve the problem posed. Below is a brief overview of the products:
Jasper reports
As a template, it uses the xml markup file * .jrxml. The document markup file + data (both from the database and the Map parameters) are sent to the processor, which forms any of the following formats: PDF, XML, HTML, CSV, XLS, RTF, TXT.
Not satisfied:
- This is not WYSIWYG, even with iReport, a visual tool for generating jrxml files.
- You need to study the JasperReports API well to create and style a complex template.
- Does not display in the right format. You can also PDF, but I would like to be able to edit the output later.
Docx4java
A tool for manipulating the component parts of docx-, pptx-, xlsx- documents using Java-API.
Not satisfied:
- There is no matching case in my Docx4Java documentation . There is a brief reference to the XMLUtils.unmarshallFromTemplate function, which makes simple substitutions.
- Processing retry output is implemented via XML-sources using XPath, link .
Apache point
A tool for manipulating the component parts of doc-, ppt-, xls- documents using Java-API. Originally created to retrieve data from documents of these formats.
Not satisfied:
- There is no solution to the problem.
The solution of the problem
It was interesting :)
1. The content of the document is stored as xml, compressed in a zip-archive. Unpacking and packing is difficult because the traditional zipper JDK 6 does not support explicitly specifying the encoding (apparently, file names). It turned out broken docx when archiving. It was necessary to use Groovy-wrapper AntBuilder with the corresponding parameter when packing content.
2. Any text entered into MS Word can be broken into pieces by the program and placed into different groups of xml tags. Thus, I had to solve the problem of cleaning the template from the generated xml pads first. For this task I used regex-expressions, they seemed to me faster than the SAX parser (although I didn’t measure performance).
3. I decided to use Groovy as a scripting language due to its simplicity, Java nature, and the core
template processor . Interesting difficulties also arose with him. It turned out that even in a small document on 10 sheets, you can easily run into the restriction on the length of the line between two scriptlets. I had to replace all the text between scriptlets with UUID lines, run the Groovy template processor, and replace the UUID lines with the original XML pieces only at the output.
Having overcome these difficulties, I tried the project in real life. It turned out well!
I created an English-language project site and published it.
Project address:
snowindy.github.com/scriptlet4docxAPI example
HashMap<String, Object> params = new HashMap<String, Object>(); params.put("name", "John"); params.put("sirname", "Smith"); DocxTemplater docxTemplater = new DocxTemplater(new File("path_to_docx_template/template.docx")); docxTemplater.process(new File("path_to_result_docx/result.docx"), params);
This can be completed by referring you to the project website ...
But to improve understanding, I will translate the most interesting section “Scriptlet types explanation”.
Types of scriptlets, details
Disclaimer:
When working with templates, the Groovy template engine translates all scriptlets into Groovy code, and displays text between scriptlets like this: out.print('template_text')
$ {data}
Equivalent to output data in out:
out.print(data)
<% = data%>
Equivalent to output data in out:
out.print(data)
<% any_code%>
Execution of code inside the scriptlet, displays nothing. Can be used for conditional output:
<% if (cond) { %> , "cond == true" <% } else { %> , "cond != true" <% } %>
$ [@ listVar.field]
This is the most interesting type of scriptlet! It is used to display lists of objects in a table inside a docx document. Must be used within a table cell.
For example, we have a list of Person objects. The object has two fields: 'name' and 'address'. We want to display a list of two columns in the table.
- Creating the 'personList' parameter in the input values ​​map refers to the list of objects.
- Creating a table with two columns and one line in a docx-document.
$[@person.name]
must be entered in the first cell; $[@person.address]
- to the second.- Everything is ready, the personList list will be printed in the table.
Live sample template
Lies here:
link .
Project development
If I really invented a new approach to processing docx templates, I would like to popularize it.
The project has a lot to strive for:
- Full caching,
- Scriptlet support in lists
- Implementing Streaming API
I will be glad to advice in disseminating the project to an English-speaking audience!