
In the corporate sector, sometimes there is the task of automatically converting documents from one format to another, as well as the task of their software processing and modification. It would seem that the problem is: for
normal formats, long-established full-featured libraries were written for work - so Perl or Python is up to you and forward.
But, to the utmost regret for all system administrators and programmers of various business applications, a huge mass of workflow is currently carried out using closed and poorly modifiable and parsing formats. What is there to dissemble - we are talking about doc, xls and others like them, as well as in many respects about docx, xlsx and the like. What to do with such files, especially if you do not have free Windows with the latest version of Office installed, is
completely incomprehensible . Of course, if you have Windows, Visual Studio and working skills in C #, then there will be less problems with the analysis of Microsoft documents. But there will be problems with ODF. Plus, you often want to save the result in PDF format so that no one can change it.
Fortunately, there is a fairly universal way to work with almost any common document formats on any platform. About him and will be discussed further.
')
No doubt everyone knows about
OpenOffice and its progressive branch -
LibreOffice . The latest versions of these packages do an excellent job with Microsoft documents — at least much better than many free parsing libraries.
But not many people know that
OpenOffice and of course
LibreOffice have an API that allows you to work with documents directly from
Python . In particular, using this API you can calmly convert documents from one format to another.
Thus, to parse any document, it is enough to convert it into the appropriate ODF format, then use your favorite programming language to make all the necessary changes, and then, if necessary, convert the result to PDF or MS Office 2003 format (doc, xls).
Another scenario: you have a bunch of documents in editable formats (doc, docx, odt), and you need to make a PDF out of them. All the same script will allow you to automatically perform such a conversion without any problems. Or do you use standardized ODF for internal workflow, but your partners have not even grown to docx so far. It's okay - LibreOffice will automatically convert ODF to MS Office format.
In general, the LibreOffice API application scenarios are very numerous - so the range of possible tasks for solving is very wide.
What is required
The article describes the use of
LibreOffice on
Ubuntu , although with a slight modification all instructions are transferred to other Linux and OpenOffice distributions, as well as to Windows and MacOS.
All you need is installed
LibreOffice and
Python , as well as the basic ability to write scripts.
Actually, the bash conversion script itself looks like this:
This script can be called from another wrapper script for batch processing a large number of files.
For example, I had to add title pages of the same format to a large number of doc documents and save the result in PDF format. To add cover sheets, I used the
Perl script and the
OpenOffice :: OODoc library (available in Ubuntu as the
libopenoffice-oodoc-perl package). The result was the following batch processing script:
Now enough to execute
find /my/doc/path -type f -iname "*.doc" -exec ./convert.sh {} \;
and on output we get a set of PDF files with beautiful title pages.
Other features
Using the described technique, it is possible not only to convert various document formats among themselves, but also to export to image files, such as JPEG or PNG. To do this, you need to put
ImageMagic , then using the described script to convert the document to PDF, and using ImageMagic to convert PDF to the desired image format:
convert sample.pdf sample.png convert sample.pdf sample.jpg convert sample.pdf sample.tif
A little more information about automatic document conversion using LibreOffice or OpenOffice can be found here:
http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.htmlThe Python API for LibreOffice mentioned above (which, by the way, is called
PyUNO ), can be used to directly edit documents from Python, although this is often not very convenient. You can read more in
this habratopike .
UPD: As suggested in the comments: good people have simplified the conversion of documents using OpenOffice (LibreOffice), writing a script-wrapper
unoconv . This utility does exactly the same and exactly the same way as the scripts described above. But it will certainly be more convenient in most cases if it starts up normally on your system.