📜 ⬆️ ⬇️

Generate PDF barrels

Prehistory


On Habré various tools and ways of creating screenshots of WEB pages were repeatedly mentioned .

I want to share my own “bike” for creating PDF in Python and QT, supplemented and improved for centralized use by several projects.

Initially, the generation was run from a PHP script, like this:
')
<?php //   exec('xvfb-run python2 html2pdf.py file:///tmp/in.html /tmp/out.pdf'); //  URL exec('xvfb-run python2 html2pdf.py http://habrahabr.ru /tmp/habr.pdf'); ?> 

it was enough and everything was fine ...

How it all began


However, xvfb-run at the time of launch creates DISPLAY: 99 and with several parallel tasks “girls quarreled” in the logs, but somehow they worked.

Thanks to xpra, there is no need to run the xvfb-run wrapper each time, the opportunity to reuse virtual X appeared, the girls made up, the overhead was reduced:

[user@rdesk ~]$ xpra start :99

And it became possible to start like this:
 <?php //   exec('DISPLAY=:99 python2 html2pdf.py file:///tmp/in.html /tmp/out.pdf'); //  URL exec('DISPLAY=:99 python2 html2pdf.py http://habrahabr.ru /tmp/habr.pdf'); ?> 

Application code html2pdf.py, it is he who creates the browser, loads the HTML into it and prints it into a PDF file.

almost complete copy-paste found in the open spaces of the network
 #!/usr/bin/env python2 # -*- coding: UTF-8 -*- from PyQt4.QtCore import * from PyQt4.QtGui import * from PyQt4.QtWebKit import * import sys #     if len(sys.argv) != 3: print "USAGE app.py URL FILE" sys.exit() #   def html2pdf(f_url, f_name): #  QT  app = QApplication(sys.argv) #  "" web = QWebView() #  URL   web.load(QUrl(f_url)) #   printer = QPrinter() #   printer.setPageSize(QPrinter.A4) #   printer.setOutputFormat(QPrinter.PdfFormat) #    printer.setOutputFileName(f_name) #    ""  PDF def convertIt(): web.print_(printer) QApplication.exit() #    ,   ,   "" PDF QObject.connect(web, SIGNAL("loadFinished(bool)"), convertIt) sys.exit(app.exec_()) html2pdf(sys.argv[1], sys.argv[2]) 


The solution was quite workable, but with an obvious disadvantage - the ability to work only with local documents. Scalability came down to creating a complete copy of the environment. Meanwhile, the number of documents has increased, and resource consumption has grown.
There is a need for a centralized solution.

We increase meat


Concept


Delivery vehicle - rabbitmq was already deployed , so it was logical to use the available resources.

Document exchange - transferring source HTML to a render server and receiving the resulting PDF.

Why HTML, and not just “follow the link”: I did not find a way to catch the completion of the page loading when there are a large number of js pull-up dynamic content -> “watches” are visible on the resulting PDF.

As it turned out later this is not necessary. Some documents simply do not have external links. For example, there is a document A and B, each consists of 3 sections, there are 2 references to complete documents, but you need only Ap1 and Bp2 to be redirected. Later they added additional styles that were used to print the document, turning into a “print version” on the fly.

FS, NFS, etc as a storage of intermediate files - were dropped immediately (the number of manipulations increases when a client is deployed).
The obvious choice was key-value storage. Almost perfect, memcached would have been right, if it weren’t for one thing - it loses all the records on restart.
The choice fell on the erased brother - Redis .
Simple, compact, fast, scalable, the base is stored on disk, tasty features like vm / swap

The common logic is “Obvious, better not obvious” because I used the pass-through ID of the document both in the database and in the Rabbit queue:

Task ID - Q_app1_1314422323.65 where:
Q from Queue
app1 - source project id
1314422323.65 Unix timestamp + ms

Result: R_app1_1314422323.65 where:
R - Result

Architecture


Description of the route:
  1. A “request” to create a PDF is received, PHP writes a binary document into the HTML database, forms an ID
  2. After recording and checking the existence of the document, the ID is written to the Rabbit queue.
  3. Render gets a new ID
  4. Render picks up the HTML document from the database by ID
  5. Document processing (rendering)
  6. Write PDF to the database with the changed ID, cleaning the database from the source HTML document

Feedback end of generation is not implemented.
Projects independently access the database and check the result.
DB.EXISTS ('R_app1_1314422323.65')

Implementation

The code is shortened to facilitate perception.
 #!/usr/bin/env python2 # -*- coding: UTF-8 -*- import pika, os, time, threading, logging, redis, Queue RBT_HOST = 'rabbit.myhost.ru' RBT_QE = 'pdf.render' RDS_HOST = 'redis.myhost.ru' LOG = 'watcher.log' MAX_THREADS = 4 #    logging.basicConfig(level=logging.DEBUG, format='%(asctime)-15s - %(threadName)-10s - %(message)s', filename=LOG ) def render(msg_id): #     output_file = '/tmp/' + msg_id + '.pdf' input_file = '/tmp/' + msg_id + '.html' #  HTML   logging.debug('[R] Loading HTML from DB...') dbcon_r = redis.Redis(RDS_HOST, port=6379, db=0) bq = dbcon_r.get(msg_id) logging.debug('[R] HTML loaded...') #  HTML    logging.debug('[R] Write tmp HTML...') fin1 = open(input_file, "wb") fin1.write(bq) fin1.close() logging.debug('[R] HTML writed...') #     command = 'DISPLAY=:99 python2 ./html2pdf.py %s %s' % ( 'file://' + input_file, output_file ) #    t_start = time.time() sys_output = int(os.system(command)) t_finish = time.time() #       i_size = str(os.path.getsize(input_file)/1024) o_size = str(os.path.getsize(output_file)/1024) #      log dbg_mesg = '[R] Render [msg.id:' + msg_id + '] ' +\ '[rend.time:' + str(t_finish-t_start) + 'sec]' + \ '[in.fle:' + input_file + '(' + i_size+ 'kb)]' +\ '[ou.fle:' + output_file + '(' + o_size + 'kb)]' #  log logging.debug(dbg_mesg) #  PDF logging.debug('[R] Loading PDF...') fin = open(output_file, "rb") binary_data = fin.read() fin.close() logging.debug('[R] PDF loaded...') #  ID   Q_  R_ msg_out = msg_id.split('_') msg = 'R_' + msg_out[1] + '_' + msg_out[2] #  PDF   logging.debug('[R] Write PDF 2 DB...') dbcon_r.set(msg, binary_data) logging.debug('[R] PDF commited...') #  ( ,   ) logging.debug('[R] DEL db record: ' + msg_id) dbcon_r.delete(msg_id) logging.debug('[R] DEL tmp: ' + output_file) os.remove(output_file) logging.debug('[R] DEL tmp: ' + input_file) os.remove(input_file) logging.debug('[R] Render done') # rets if not sys_output: return True, output_file return False, sys_output def catcher(q): '''   N     ''' while True: try: item = q.get() #     except Queue.Empty: break logging.debug('Queue send task to render: ' + item) render(item) #    q.task_done() #   #  logging.debug('Daemon START') #   TQ = Queue.Queue() logging.debug('Starting threads...') #    for i in xrange(MAX_THREADS): wrkr_T = threading.Thread(target = catcher, args=(TQ,)) wrkr_T.daemon = True wrkr_T.start() logging.debug('Thread: ' + str(i) + ' started') logging.debug('Start Consuming...') #  ,  ,    try: connection = pika.BlockingConnection(pika.ConnectionParameters(host = RBT_HOST)) channel = connection.channel() channel.queue_declare(queue = RBT_QE) def callback(ch, method, properties, body): TQ.put(body) logging.debug('Consumer got task: ' + body) channel.basic_consume(callback, queue = RBT_QE, no_ack = True) channel.start_consuming() except KeyboardInterrupt: logging.debug('Daemon END') print '\nApp terminated!' 


Some statistics


On average hardware, by modern standards ProLiant DL360 G5 (8 cores E5410@2.33GHz, 16GB RAM)
results were obtained:
8 streams, LA 120
Source HTML 10Kb in size ... 5Mb
~ 5000 generations per minute
Average time per document - 5 seconds

An interesting relationship (linear) between the size of the original HTML and the memory for its processing has been revealed:
1MB HTML = ~ 17Mb RAM

"Stress endurance test" with an HTML size of 370Mb
To be honest, I expected a fall in the WebKit area, as it turned out in vain.
The document was processed without errors, a PDF of ~ 28000 pages was received and of course a trifle that it took ~ 50 hours and ~ 12GB of RAM (:

Links


Redis + Python
Rabbit + Pika
xpra
GitHub Code

Source: https://habr.com/ru/post/128078/


All Articles