📜 ⬆️ ⬇️

How to make a web service for converting Excel, Word, TXT and other files to PDF in the "as I see" mode

Where did the stupid habit of keeping accounts, accounting and financial documents in MS Excel format come from? Why store and transfer documents intended for printing in spreadsheet format, if there is a special PDF format? However, in all accounting programs, documents are first of all necessarily exported to MS Excel and only then alternative ways of storing documents are proposed. The problem arose from here: let the accountant save his documents as he wants, but the client must receive them in PDF format and be sure to accurately in the image that the accountant piled in the MS Excel template.
As an accounting program, we used a free Aircraft: Accounting . (To tell the truth, from this program you can immediately save documents to PDF, but once told excel, it means excel.)
The accountant uploads the XLS file to a specific directory on the disk, from where we have to pick it up, transfer it to PDF and save it to another directory. All this should be implemented as a web service, that is, the client should see and be able to download their pdf documents in the “Personal Account” on the site.

Solutions:

The first thing I wanted to do was to find a ready-made library for PHP or Perl and convert files directly on the fly.
Such libraries for working with the Excel format were really found, for example: PHPExcel, PHPExcelReader, Spreadsheet :: ParseExcel, etc.
These libraries work really well, but they do exactly what they are meant for: search for data in an Excel spreadsheet and operate on it.
We also need a completely different thing - to get an Excel table in the form as for printing, with all the pictures with stamps and captions, with font formats, and cells.

The second solution is a virtual printer. Its essence is that we open the file in a suitable program and send it to print, but not to a real printer, but to a virtual one, which instead of paper will be saved to a file, first in a ps format postscript, and then make a PDF file out of it.
Since the system should also function as a web service, I chose Linux with Apache as the platform. And as a program that can open all MS Office files - free OpenOffice.org 3.4
')
So, what we do:

Install OpenOffice. In the installation guide for the PyODConverter converter, you are prompted to install the version of OpenOffice.org 2.4 necessarily-headless, but I just installed OpenOffice.org 3.4 from the repository and everything worked out.
After installation, I tried to run the program, but the system refused, even if I wanted the Java Runtime Environment. Does he need one? It turned out not. And running OpenOffice entirely to send a file to a virtual printer is completely useless. The program has an excellent converter to pdf, which is easy to call from the command line.

This is done like this:



Create a converter to PDF


Create a bash file, for example with the name converter.sh

 <code>

 <source lang = "bash">
 #! / bin / bash

 # Check where OpenOffice.org and PYTHON are installed. 
 # Correct the paths manually if they are different on your system

 OOFFICE = `ls /usr/bin/openoffice.org3 / usr / bin / ooffice / usr / lib / openoffice / program / soffice |  head -n 1`
 OOOPYTHON = `ls /opt/openoffice.org * / program / python / usr / bin / python |  head -n 1`

 if [!  -x "$ OOFFICE"]
 then
  echo "Could not auto-detect OpenOffice.org binary"
  exit
 fi

 if [!  -x "$ OOOPYTHON"]
 then
  echo "Could not auto-detect OpenOffice.org Python"
  exit
 fi

 echo "Detected OpenOffice.org binary: $ OOFFICE"
 echo "Detected OpenOffice.org python: $ OOOPYTHON"

 # Reference: http://wiki.services.openoffice.org/wiki/Using_Python_on_Linux
 # If you use the OpenOffice.org that comes with Fedora or Ubuntu, uncomment the following line:
 # export PYTHONPATH = "/ usr / lib / openoffice.org / program" 

 # If you want to simulate for testing, there is no X server, uncomment the next line.
 #unset DISPLAY

 # Kill any running OpenOffice.org processes.
 killall -u `whoami` -q soffice

 # This is an important line: the program is trying to download a Python script from the network, which is just 
 # need OpenOffice to convert files.  If everything is good, then after running converter.sh 
 A # directory will appear with a file called DocumentConverter.py. 
 # If it didn't work out, download the DocumentConverter.py file. 
 # manually at the address below and put in the same directory with this script.
 # Check that the port in the script was specified 8100
 test -f DocumentConverter.py ||  wget http://www.artofsolving.com/files/DocumentConverter.py

 # Start OpenOffice.org in listening mode on TCP port 8100.
 $ OOFFICE "-accept = socket, host = localhost, port = 8100; urp; StarOffice.ServiceManager" -norestore -nofirststartwizard -nologo -headless &

 # Wait for a few seconds.
 sleep 5s

 # Here we list the names of the files we want to convert: 
 # The name of the source file - in which file we convert. 
 # The most pleasant thing is that you can convert not only to PDF, but also to any format, 
 # supported by OpenOffice.org
 # 
 # From MS PowerPoint to Flash
 $ OOOPYTHON DocumentConverter.py sample.ppt sample.swf

 # From Excel to PDF
 $ OOOPYTHON DocumentConverter.py sample.xls sample.pdf

 # Close OpenOffice.org.
 killall -u `whoami` soffice

 # ------------------------------------------------- --- 

 </ source>
 </ code>


Set permissions to run the script converter.sh (755)
Place the Excel file named sample.xls in the same directory with this script (converter.sh), run the converter.sh script and the file sample.pdf will appear in the directory, which will be a copy of the printed Excel file.

Half done. Now we need to organize the conversion process as a web service.
You can, of course, place the converter.sh file in the / cgi-bin / Apache directory and run it right away, but the security tasks start here. We need to approach the issue of security of this service very carefully, because we are going to transfer unknown files to the script and allow it to save what is unknown on the disk.
The correct approach is to place the converter.sh file above DOCUMENT_ROOT Apache, and access it through an intermediary script, for example, on perl, which will be located in the / cgi-bin / directory and carefully check all parameters passed to converter.sh.

What problems were found after the implementation of the conversion service.
It turned out that Linux does not understand files with the names of Russian letters in the windows-1251 encoding. What can be done here:
1. To persuade the accountant to save files with Latin names (difficult)
2. Convert the file name to utf-8 when loading into a directory (quite real)
3. Implement a similar service under Windows - Apache (try)

Under Windows, the converter was also easy to implement.

I did this:
1. Download and install LibreOffice 3.4
2. Download PyODConverter
3. Saved the DocumentConverter.py file in the working directory, for example, C: \ test \
4. Changed the DEFAULT_OPENOFFICE_PORT = 8100 constant in the DocumentConverter.py file (by default there was a different port specified)
5. Put into the directory C: \ test \ a test file for conversion - test.xls

Now we start the conversion process.
First, we run OpenOffice in stealth mode. In the command line (cmd) we write:

"C:\Program Files\LibreOffice.org 3.4\program\soffice.exe" -headless -nologo -norestore -accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager

Here it is important that the port matches DEFAULT_OPENOFFICE_PORT

Now we run the conversion:
"C:\Program Files\LibreOffice.org 3.4\program\python" c:\test\DocumentConverter.py c:\test\test.xls c:\test\test.pdf

And in the c: \ test \ directory, we find the converted PDF file.

So the xls, doc, docx, rtf, txt, odt, ott, sxw, stw, html, xml file converter is ready, in general, everything that OpenOffice will open.

Source: https://habr.com/ru/post/148800/


All Articles