Beautiful print to PDF from Django

Because This article is the result of several years of experiments, then there will be a lot of letters. But - perhaps - it will save someone many months of cycling on a rake, which are described.
In general, this is not even about Django, but about printing regulated documents from python using template engines.
To whom it is too lazy to read further - I will immediately say - the problem has not been completely resolved. But more or less working version loomed.

1. Task

User enters data in web form
The server inserts this data into the print form template.
And gives the user in a form suitable for printing

2. Limitations

Forms are “soft” (where accuracy is not very important - for example, Contract or Account) and “hard” (accuracy - maximum, under the scanner - for example, notifying a migrant or a statement on the STS (Form 26.2-1)).
At the same time, even “soft” forms should be printed as close as possible to the intended creator (if I said that the borders are 1 cm, then the user should receive a document with borders exactly 1 cm) and — especially — take into account page breaks (see Forms 11001, 21001 and t .d.)
Required - the minimum gestures to transform the source material (as a rule - .xls or .doc, drawn from the "Consultant" or "Guarantor").
Because this is a web application - responsiveness and reliability of the solution are highly desirable => working with native python libraries is highly desirable.
The possibility of placing all of this farm on a rented hosting (ideally - GAE) is desirable.
The ability to visually edit templates is desirable.
A quick preview of the template is desirable (and even better - and the result).

The first stage is the selection of the final format. After some thought from various tz. (cross-platform, guaranteed results, convertibility in) the choice fell on PDF.
Now - input formats and how to convert them.

3. Soft forms

Odf

We are talking about Open Document Format - ODS, ODT and others.
Everything is very simple here:

Edit the template in LibreOffice (leaving space for data).
Somehow fill the fields in Django.
Somehow we get a PDF

Place for data: either we add user-defined fields to the document - or we insert {{django}} {{tags_django}} directly into the text. In the first case, filling in these fields later from python is most likely possible, but I can’t even imagine how (or rather, everything that is presented looks extremely confusing). Therefore, simply arrange the tags as text.
In this case, filling in the fields is elementary - we simply feed the template to the Django template engine (we’ll leave picking the python libraries inside the template to the gantushnikam :-). And in order not to unzip / zip the documents with every kick - documents are saved in * .fodX (Flat X) - the only one unpacked xml. The template is fed as xml.
Obtaining a PDF — without options — using LibreOffice: feeding the demon LibreOffice (libreofficed (found somewhere at ubuntovodov)) or unoconv or handmade LO launch in daemon mode. All of these options are about the same.

Virtues

You can immediately use documents that are hidden on the Internet (as a rule, from the “Consultant”, in Microsoft Office formats).
With editing templates - no problem.
As with the preview.
Perhaps - getting a PDF about Google Docs - has not yet tried. But I’m sure that it will be reactive now (and I don’t doubt that it’s incorrect; try to upload the same form 21001 from the Consultant into the gugledox (it lies on the tax site)).

disadvantages

Sometimes when writing templates, LibreOffice spontaneously ruins tags, inserting into {{..}} all sorts of span lang = "en-GB" and others. Then you have to manually return everything back.
Simply fantastic resource consumption for the server - CPU 100% (only one, no matter how many they are), hundreds of meters of RAM, receiving PDF - before a minute or after (form 21001 - 50 seconds at P4-3.0). Java same.
Pulls for unmeasured packages (Fedora, CentOS).
The presence of at least some X-server (Xvfb for example).
Probably, on some hosting they will allow LibreOffice to be deployed - but I strongly doubt nic.ru for example. About GAE it is not even talking.
Preview result - no.

Summary

As an extreme backup option - suitable. But just as extreme.

HTML

Here, with the editing of templates (with hands) and the template engine (distortion) everything is clear. Only one small one remains, but the main question is how to get the PDF? Quickly, efficiently, with page breaks where necessary. And here was the most experiments.
Numerous experiments with pure python html render (such as PISA and ancestors / heirs / forks) led to one important (IMHO) conclusion: to get a guaranteed result, use a ready-made html engine. Which, as we all know, already 4 (from normal). From them it is possible to use as much as 2 in linux - gecko and webkit. Most likely, it is possible to call a gecko from python - but a) for this you need a running X (as in the case of LibreOffice) and b) [semi] I did not find the finished recipe.
There is a webkit:

PyQt4> Qt> WebKit> QPrinter (such as this ). Natively (although it carries a lot with it), quickly - but the pagebreak does not catch. In addition - we need special dances with DPI and ZoomFactor.
GTK> WebKit> GTK printer (like this ). Native, smart - but also does not catch page break.
Use a specially modified webkit - wkhtmltopdf - as an external binary (now this option is used) or through native python binding (in progress, but there are some minor problems). Natively (if binding), smartly, catches the page-break, the result is guaranteed.

Virtues

Theoretically , visual editing is possible.
Instant preview (in the same html form) - both the template - and the result.
Reactive conversion to PDF.
Pure python API conversion to PDF (this is “in progress”).

disadvantages

Still, high-quality HTML - handmade.
Complex forms (such as 21001) will have to write or draw by yourself - because on the Internet this is a terrible .xls.
Because The lib / binary compiled for Linux is used - on the same nic.ru (FreeBSD) it will not work (without crutches). About GAE it is still not talking.

Summary

The main option for “soft” documents. But still, you need to look for high-quality pure python html render - without flash drives, JS and other cartoons - but with high-quality processing of CSS.

maybe

For the future, TeX, LaTeX, Lyx, docbook formats are considered - but so far there are no advantages (especially for “almost soft” forms - like the same 21001).
')

4. Hard forms

Here everything is much sadder. Especially in the light of the fact that there is already a visual editor is highly desirable.
In addition - the vast majority (if not all) of “hard” RF forms use “squares” - when the text is broken into letters - and each fits into its own square ( example ).
Let's drop the first available ones (like “drag the text onto the tiff”) and go straight to the finalists.

RML

The development of Reportlab (yes, python-reportlab is theirs) is an ordinary XML that allows you to create miracles from PDF. Because The well-known python-trml2pdf is already RIP (as the developer honestly wrote it to me) - I had to take this trml2pdf and finish it a bit, because It does not support many interesting features of RML, and religion prohibits me from buying (and even less breaking) commercial rml2pdf.

Virtues

Natively
Smartly
Flexibly
There should be no problems with hosting (theoretically) - even in GAE (I haven't tried it).

disadvantages

Strictly handmade
Very annoying syntax - when you need to mix precise positioning (“graphics”) with “soft” text (“flowables”) (hence, apparently, the lack of a visual editor).
No preview - no template, no result.

Summary

Substitute option for accurate forms (especially simple ones).

PDF forms

Everything is very simple here: source in PDF - and the final result in PDF.

Take the original PDF form in your left hand
XFDF (unpretentious xml), processed by the built-in Django template - to the right
merge them (populate) into a new PDF (“unrolled” - flatten)
and give the user

The problem is only one - p.3.
To date, the native and correctly working python API for working with PDF forms has not been found (although poppler can already do something — but there is still a lot of sawing there), so the only acceptable option is iText . Through pdftk or your bike - this is already to your taste.

Virtues

You can turn anything into a PDF form (as a separate question).
You can even edit (likewise).
Absolutely guaranteed result.
Built-in PDF “squares” (combo).
Most likely - no problems with hosting (perhaps - and with GAE) - have not tried.

disadvantages

Call an external application instead of the python API.
Java same.

Summary

The main option for accurate printing forms.

5. General summary

Total formed today:

“Soft” forms - html | webkit - but through a rather heavy, redundant and not very portable webkittox library (and keep looking).
“Hard” forms - PDF forms, but through a ~~crutch to an~~ external JAVA library (and continue to rape poppler).
ODF and RML - as backup options, respectively.

Ps. How it all works - you can see here - without ODF and RML, but the latter are provided.

Source: https://habr.com/ru/post/148612/

All Articles

Beautiful print to PDF from Django

1. Task

2. Limitations

3. Soft forms

Odf

Virtues

disadvantages

Summary

HTML

Virtues

disadvantages

Summary

maybe

4. Hard forms

RML

Virtues

disadvantages

Summary

PDF forms

Virtues

disadvantages

Summary

5. General summary

More articles: