PK
characters at the beginning of the data. This means that both files are a zip archive renamed to .docx / .odt. Open, for example, by Ctrl+PageDown
in Total Commander and see a quite acceptable structure (on the left odt, on the right docx):Just some 30 lines, and we get the text data from the file. The code works under PHP 5.2+ and requires
- function odt2text ( $ filename ) {
- return getTextFromZippedXML ( $ filename , "content.xml" ) ;
- }
- function docx2text ( $ filename ) {
- return getTextFromZippedXML ( $ filename , "word / document.xml" ) ;
- }
- function getTextFromZippedXML ( $ archiveFile , $ contentFile ) {
- // Creates the "reincarnation" of the zip archive ...
- $ zip = new ZipArchive ;
- // And try to open the zip file
- if ( $ zip -> open ( $ archiveFile ) ) {
- // If successful, look for the data file in the archive
- if ( ( $ index = $ zip -> locateName ( $ contentFile ) ) ! == false ) {
- // If we find it, we read it into a string.
- $ content = $ zip -> getFromIndex ( $ index ) ;
- // Close the zip archive, we don’t need it anymore
- $ zip -> close ( ) ;
- // After that, we load all the entity and, if possible, include other files
- // Swallow errors and warnings
- $ xml = DOMDocument :: loadXML ( $ content , LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING ) ;
- // Then return the data without XML formatting tags
- return strip_tags ( $ xml -> saveXML ( ) ) ;
- }
- $ zip -> close ( ) ;
- }
- // If something went wrong, return an empty string
- return "" ;
- }
php_zip.dll
under Windows or the key - --enable-zip
under Linux. In the absence of the possibility of using ZipArchive
(the old version of PHP or the absence of libraries), the PclZip library, which implements reading zip files without the appropriate tools in the system, may well fit .Source: https://habr.com/ru/post/69417/
All Articles