We disassemble xlsx in PHP without ready libraries

First of all, I will describe the problem that made us return to the question sucked in from all sides for the thousandth time: stupid managers - without consulting programmers - promised the customer to upload data to the site from xls (x).

Everything would be fine, but the customer’s hoster gives 64mb of memory for execution and he didn’t want to say that the Excel client has 10-15mb of files without formatting at all, that when downloading it, PHPExel eats (on the test server) something around 500mb of memory.
The solution under the cut hardly reached 5mb.

Preconditions:
1. There is an Excel document sheet so in 10-20 with data about the goods in the online catalog. Each sheet has a cap - “name”, “price”, etc. + extra charge characteristics in 40 columns - and the actual data in the amount of "y-Excel-centimeter-scroller";
2. no CSV can be used. All the data from the customer is already in Exel and he is not going to re-save them ... they promised here and that's all;
3. Spreadsheet_Excel_Writer is folded due to non-universality, although many good things have been written about it. I am waiting for comments on memory tests;
4. surprisingly, Google did not offer universal solutions. Really nobody faced such problem on PHP * nix, I was surprised.

Decision:
After going through the different ways politely provided by Google, they decided to read the specifications (ehh, my father taught me ...). Seeing there keywords based on Open XML and using ZIP compression quickly called the customer and transferred the conversation to the mainstream xlsx and only: “Well, you understand! 21st century after all! Why do we grab at the old! You need to stand with one foot in the future! ”
')
Further, the algorithm is as follows: take the file, unpack it and carefully look at the result.
A complete inventory will have to be spent at leisure, but now we are most interested in the contents of the [xl] directory, specifically - / xl / worksheets / and the file /xl/sharedStrings.xml.
In the file /xl/workbook.xml is the description of the sheets, but since The task of collecting the names of the sheets did not stand, I will skip this point. If necessary, to understand it is not difficult.

/xl/sharedStrings.xml

... <si> <t></t> </si> <si> <t></t> </si> <si> <t></t> </si> <si> <t>URL</t> </si> <si> <t>!</t> </si> <si> <t>!</t> </si> <si> <t>~1, 220-240 , 50 </t> </si> ...

and so on in the same vein. Represents textual data in the cells of the original document. From all sheets! For now, just collect this data into an array.

  $xml = simplexml_load_file(PATH . '/upload/xls_data/xl/sharedStrings.xml'); $sharedStringsArr = array(); foreach ($xml->children() as $item) { $sharedStringsArr[] = (string)$item->t; }

/ xl / worksheets /
This is a directory with files of type "sheet1.xml" with a description of these sheets. Specifically, in each file we are interested in the content and its children <row ...>.

 ... <sheetData> ... <row r="1" spans="1:43" ht="48.75" customHeight="1" x14ac:dyDescent="0.2"> <cr="A1" s="1" t="s"> <v>0</v> </c> <cr="B1" s="1" t="s"> <v>1</v> </c> <cr="C1" s="2" t="s"> <v>2</v> </c> <cr="E2" s="12"> <v>2</v> </c> <cr="F2" s="12"/> .... </row> <row r="2" spans="1:43" ht="13.5" customHeight="1" x14ac:dyDescent="0.2"> ... </sheetData> ...

Using the method of comparisons and experiments, it was found out that the [t = "s"] attribute of the cell (judging by all type = string) is an indication that we take the value from the file sharedStrings.xml. Pointer - value - the element number from $ sharedStringsArr. If not specified, take the value itself for the value of the cell.

We collect:

  $handle = @opendir(PATH . '/upload/xls_data/xl/worksheets'); $out = array(); while ($file = @readdir($handle)) { //      /xl/worksheets/ if ($file != "." && $file != ".." && $file != '_rels') { $xml = simplexml_load_file(PATH . '/upload/xls_data/xl/worksheets/' . $file); //   $row = 0; foreach ($xml->sheetData->row as $item) { $out[$file][$row] = array(); //    $cell = 0; foreach ($item as $child) { $attr = $child->attributes(); $value = isset($child->v)? (string)$child->v:false; $out[$file][$row][$cell] = isset($attr['t']) ? $sharedStringsArr[$value] : $value; $cell++; } $row++; } } } var_dump($out);

At the output we get a multidimensional array, with which you can already work freely, and you can immediately pour data into the database - this is a personal matter for everyone.

Finally, I will say that I did not really understand the xlsx specification, but only completed the task with specific xlsx documents. Somewhere after all, formulas and images should be written (t = "i"?). When I face such a task, I will certainly describe it, but for now I present an undemanding system algorithm for collecting text data from xslx. I hope it will be in demand, because in search of this I have not met.

PS Just placing labels came across Working with large Excel files . Habrit was necessary, but not google - I would save a lot of time.

UPD:
It just turned out that the empty cell can be represented by the absence of the <v> parameter in <c>, or by the absence of the <c> itself. It is necessary to verify the attribute “r”.

  <cr="A1" s="1" t="s"/> <cr="B1" s="1" t="s"> <v>1</v> </c> <!--   1--> <cr="D1" s="2" t="s"> <v>2</v> </c> <cr="E1" s="12"/>

Correct as possible.

Source: https://habr.com/ru/post/140352/

All Articles

We disassemble xlsx in PHP without ready libraries

More articles: