📜 ⬆️ ⬇️

Methods of working with "heavy" XML

image

At work, they were asked to conduct a study using what means it is better to disassemble the bulk XML file (more than 100Mb). I suggest the community to get acquainted with the results.

Consider the basic methods of working with XML:
')
1. Simple XML ( documentation )
2. DOM ( documentation )
3. xml_parser (SAX) ( documentation )
4. XMLReader ( documentation )

Simple xml


Cons : works very slowly, collects the entire file into memory, the tree is compiled into a separate array.
Pros : simplicity of work, work out of the box (requires libxml library which is included on almost all servers)

Simple XML Example
$xml = simplexml_load_file("price.xml"); echo "<table border='1'>\n"; foreach ($xml->xpath('/DocumentElement/price') as $producs) { ?> <tr> <td><?php echo $producs->name; ?></td> <td><?php echo $producs->company; ?></td> <td><?php echo $producs->city; ?></td> <td><?php echo $producs->amount ?></td> </tr> <? } echo "</table>\n"; 


Dom


Minuses : it works very slowly, as all the previous examples collect the entire file into memory.
Pros : The output is the usual DOM which is very easy to work with.

DOM Usage Example
 $doc = new DOMDocument(); $doc->load( 'books.xml' ); $books = $doc->getElementsByTagName( "book" ); foreach( $books as $book ) { $authors = $book->getElementsByTagName( "author" ); $author = $authors->item(0)->nodeValue; $publishers = $book->getElementsByTagName( "publisher" ); $publisher = $publishers->item(0)->nodeValue; $titles = $book->getElementsByTagName( "title" ); $title = $titles->item(0)->nodeValue; echo "$title - $author - $publisher\n"; 


xml_parser and XMLReader.


The previous 2 do not suit us because of working with the whole file, since we have files at 20-30 Mb, and while working with them some blocks form a chain (array) at 100> Mb

Both methods work by reading the file line by line, which is ideal for the task.

The difference between xml_parser and XMLReader is that, in the first case, you will need to write your own functions that will respond to the beginning and end of the tag.

Simply put, xml_parser works after 2 triggers - the tag is open, the tag is closed. He does not care what goes on there, what data is used, etc. For work you set 2 triggers indicating processing functions.

Xml_parser example
 class Simple_Parser { var $parser; var $error_code; var $error_string; var $current_line; var $current_column; var $data = array(); var $datas = array(); function parse($data) { $this->parser = xml_parser_create('UTF-8'); xml_set_object($this->parser, $this); xml_parser_set_option($this->parser, XML_OPTION_SKIP_WHITE, 1); xml_set_element_handler($this->parser, 'tag_open', 'tag_close'); xml_set_character_data_handler($this->parser, 'cdata'); if (!xml_parse($this->parser, $data)) { $this->data = array(); $this->error_code = xml_get_error_code($this->parser); $this->error_string = xml_error_string($this->error_code); $this->current_line = xml_get_current_line_number($this->parser); $this->current_column = xml_get_current_column_number($this->parser); } else { $this->data = $this->data['child']; } xml_parser_free($this->parser); } function tag_open($parser, $tag, $attribs) { $this->data['child'][$tag][] = array('data' => '', 'attribs' => $attribs, 'child' => array()); $this->datas[] =& $this->data; $this->data =& $this->data['child'][$tag][count($this->data['child'][$tag])-1]; } function cdata($parser, $cdata) { $this->data['data'] .= $cdata; } function tag_close($parser, $tag) { $this->data =& $this->datas[count($this->datas)-1]; array_pop($this->datas); } } $xml_parser = new Simple_Parser; $xml_parser->parse('<foo><bar>test</bar></foo>'); 


In XMLReader everything is easier. First is the class. All triggers are already given by constants (there are 17 of them), the reading is performed by the read () function, which reads the first occurrence that matches the given triggers. Next we get an object in which the data type is entered (ala trigger), the name of the tag, its value. Also XMLReader works fine with tag attributes.

An example of using XMLReader
 <?php <?php Class StoreXMLReader { private $reader; private $tag; // if $ignoreDepth == 1 then will parse just first level, else parse 2th level too private function parseBlock($name, $ignoreDepth = 1) { if ($this->reader->name == $name && $this->reader->nodeType == XMLReader::ELEMENT) { $result = array(); while (!($this->reader->name == $name && $this->reader->nodeType == XMLReader::END_ELEMENT)) { //echo $this->reader->name. ' - '.$this->reader->nodeType." - ".$this->reader->depth."\n"; switch ($this->reader->nodeType) { case 1: if ($this->reader->depth > 3 && !$ignoreDepth) { $result[$nodeName] = (isset($result[$nodeName]) ? $result[$nodeName] : array()); while (!($this->reader->name == $nodeName && $this->reader->nodeType == XMLReader::END_ELEMENT)) { $resultSubBlock = $this->parseBlock($this->reader->name, 1); if (!empty($resultSubBlock)) $result[$nodeName][] = $resultSubBlock; unset($resultSubBlock); $this->reader->read(); } } $nodeName = $this->reader->name; if ($this->reader->hasAttributes) { $attributeCount = $this->reader->attributeCount; for ($i = 0; $i < $attributeCount; $i++) { $this->reader->moveToAttributeNo($i); $result['attr'][$this->reader->name] = $this->reader->value; } $this->reader->moveToElement(); } break; case 3: case 4: $result[$nodeName] = $this->reader->value; $this->reader->read(); break; } $this->reader->read(); } return $result; } } public function parse($filename) { if (!$filename) return array(); $this->reader = new XMLReader(); $this->reader->open($filename); // begin read XML while ($this->reader->read()) { if ($this->reader->name == 'store_categories') { // while not found end tag read blocks while (!($this->reader->name == 'store_categories' && $this->reader->nodeType == XMLReader::END_ELEMENT)) { $store_category = $this->parseBlock('store_category'); /* Do some code */ $this->reader->read(); } $this->reader->read(); } } // while } // func } $xmlr = new StoreXMLReader(); $r = $xmlr->parse('example.xml'); 


Performance test


Generator code example.xml
 <?php $xmlWriter = new XMLWriter(); $xmlWriter->openMemory(); $xmlWriter->startDocument('1.0', 'UTF-8'); $xmlWriter->startElement('shop'); for ($i=0; $i<=1000000; ++$i) { $productId = uniqid(); $xmlWriter->startElement('product'); $xmlWriter->writeElement('id', $productId); $xmlWriter->writeElement('name', 'Some product name. ID:' . $productId); $xmlWriter->endElement(); // Flush XML in memory to file every 1000 iterations if (0 == $i%1000) { file_put_contents('example.xml', $xmlWriter->flush(true), FILE_APPEND); } } $xmlWriter->endElement(); // Final flush to make sure we haven't missed anything file_put_contents('example.xml', $xmlWriter->flush(true), FILE_APPEND); 


Test results (reading without parsing data)

Characteristics of the test environment
Ubuntu 16.04.1 LTS
PHP 7.0.15
Intel® Core (TM) i5-3550 CPU @ 3.30GHz, 16 Gb RAM, 256 SSD

MethodRuntime (19 Mb)Runtime (190 Mb)
Simple xml0.46 sec4.56 seconds
Dom0.52 sec4.09 seconds
xml_parse0.22 sec2.25 seconds
XML Reader0.26 seconds2.18 sec

PS Tips and comments happy to hear. Please do not kick much

Source: https://habr.com/ru/post/330240/


All Articles