📜 ⬆️ ⬇️

parsing MS-office files

Recently, a task was set before me: to pull out some information from MS-office files (.xls, .doc) for its further processing. In fact, it was necessary to pull out the text contained in the document.



For .xls, the PhpExcelReader project was quickly found, and there is nothing more to say - look at the code, google, and I can only give a few lines of code to help:
')
$reader = new Spreadsheet_Excel_Reader();
$reader->setUTFEncoder( 'iconv' );
$reader->setOutputEncoding( 'UTF-8' );
$reader->read($ this ->filename);

$text = "" ;

if ($reader->sheets && count($reader->sheets))
{
$sheet = $reader->sheets[0];

if (isset($sheet[ 'cells' ]))
{
foreach ($sheet[ 'cells' ] as $row)
{
$text .= implode( ' ' , $row) . "\n" ;
}
}
}
echo $text;


* This source code was highlighted with Source Code Highlighter .


At first it turned out to be somewhat more complicated with .docs: I just could not find a free PHP parser that would not use COM (I didn’t get paid either, but I was still looking for a free one; by the way, if habra people know that Project - welcome to comment).

I was completely desperate when I decided to look at the .doc-file using the console utility less. less complained that " catdoc is not installed ", I took heart, typed sudo apt-get install catdoc - and voila - in my hands is a console viewer of Word documents. After that, it remains only to write:
/**
* @note catdoc program should be installed and reside within $PATH!
*/
echo shell_exec( 'catdoc ' . escapeshellarg($ this ->filename));


* This source code was highlighted with Source Code Highlighter .

Source: https://habr.com/ru/post/45375/


All Articles