Text at any cost: DOCX and ODT

Recently, the problem arose of obtaining clear text from various document circulation formats - whether it be Microsoft Word documents or PDF. The task was completed even with a slightly wider list of possible inputs. So, with this article I open the list of publications on reading text from the following file types: DOC, DOCX, RTF, ODT and PDF - using PHP without using third-party utilities.

To begin with, I will answer the quite reasonable question: “ Why is this, in fact, necessary? "That's right, the clean text obtained from, for example, a Word document is a fairly mixed mess. But this “mess” is quite enough to build, for example, an index to search through the extensive repository of office documents.

Another quite reasonable question: “ Why not use third-party utilities, for example, antiword or xpdf, or, in the extreme case, OLE under Windows ?” These were the conditions that were set, and OLE is very slow, even if the problem can be solved with this technology.
')
Today, as a “seed,” I’ll talk about fairly simple formats for the task at hand — Office Open XML , better known as Microsoft’s DOCX and OpenDocument Format , also known as ODT from ODF Aliance.

To get started, let's look inside a couple of files and see literally the following (behind docx, front odt):

The most important thing we see here is the first two PK characters at the beginning of the data. This means that both files are a zip archive renamed to .docx / .odt. Open, for example, by Ctrl+PageDown in Total Commander and see a quite acceptable structure (on the left odt, on the right docx):

So, the data files we need are content.xml in ODT and word / document.xml in DOCX. To read the text data from them we will write a simple code:

function odt2text ( $ filename ) {
return getTextFromZippedXML ( $ filename , "content.xml" ) ;
}
function docx2text ( $ filename ) {
return getTextFromZippedXML ( $ filename , "word / document.xml" ) ;
}
function getTextFromZippedXML ( $ archiveFile , $ contentFile ) {
// Creates the "reincarnation" of the zip archive ...
$ zip = new ZipArchive ;
// And try to open the zip file
if ( $ zip -> open ( $ archiveFile ) ) {
// If successful, look for the data file in the archive
if ( ( $ index = $ zip -> locateName ( $ contentFile ) ) ! == false ) {
// If we find it, we read it into a string.
$ content = $ zip -> getFromIndex ( $ index ) ;
// Close the zip archive, we don’t need it anymore
$ zip -> close ( ) ;
// After that, we load all the entity and, if possible, include other files
// Swallow errors and warnings
$ xml = DOMDocument :: loadXML ( $ content , LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING ) ;
// Then return the data without XML formatting tags
return strip_tags ( $ xml -> saveXML ( ) ) ;
}
$ zip -> close ( ) ;
}
// If something went wrong, return an empty string
return "" ;
}

Just some 30 lines, and we get the text data from the file. The code works under PHP 5.2+ and requires php_zip.dll under Windows or the key - --enable-zip under Linux. In the absence of the possibility of using ZipArchive (the old version of PHP or the absence of libraries), the PclZip library, which implements reading zip files without the appropriate tools in the system, may well fit .

I note that this code is only a blank for solving text reading problems. After a series of articles under the slogan "Text at any cost," I will try to describe the principles and implementation of reading formatted text.

On this topic:

Next time I will talk about reading text from a PDF without the help of xpdf. More difficult, but quite feasible for PHP tasks.

Source: https://habr.com/ru/post/69417/

All Articles

Text at any cost: DOCX and ODT

More articles: