📜 ⬆️ ⬇️

XML parsing tools in PHP

In my personal practice, the task of parsing XML using PHP arose back in 2005. However, when I tried to figure it out and write a simple script that loads an XML file into an array, I came across a rather serious problem - there are no normal PHP software and binary libraries for working with XML . As far as working with XML by means of PHP and the evolution of PHP, various technologies were used to parse XML code, and they will be discussed further.

First, I’ll give a summary table of compatibility of PHP tools and XML libraries.

image
')
SAX (Simple API for XML) turned out to be the most compatible, it is supported even in the EXPAT library available in all versions of PHP 4 and higher. However, its capabilities and applications have caused a sharply negative reaction - there is no possibility of modifying XML, an extremely cumbersome and complex code with a large number of places for potential errors.



DOMXML is a terrible thing, because Existed in the form of additional experimental libraries for PHP 4. In PHP 5 is not included, because PHP 5 by default has a more versatile DOM tool (Standard W3C DOM level 3). DOM is the most documented (English PHP & W3C) and completed, but not included in PHP 4, because it was developed only by the beginning of 2006. If the choice becomes DOM or PHP4, definitely the DOM should be said, since today PHP 5 is available at any self-respecting hosting provider. Moreover, the developer has the ability to write PHP 4 compatible code, since PHP 4 has a basic DOM and it supports some of the basic features of the new DOM.

There are additional XML-RPC libraries, but they are experimental, which speaks for itself - their testing and testing are possible no earlier than in 2009.

In RuNet, there was no more or less useful literature at that time (autumn 2007), all developers used SAX (often even their libraries based on SAX) or DOMXML. Very few people have heard about DOM, and those who heard it, refused to use in favor of the older and less standard, but more familiar DOMXML. Thus, there was an extremely low level of implementation and portability of existing WEB solutions using XML. The decision to use the new, convenient, W3C-approved DOM tool was the only correct one. DOM in PHP for its compatibility and mutual understanding is identical to DOM in JS.

We will conduct a comparative analysis of the performance of SAX PHP 4 and DOM PHP 5. The parsing time for the next XML file will be measured.

image

SAX parsing algorithm
// SAX , XML-.
$parser = xml_parser_create();
//
// XML-:
// - XML
xml_set_element_handler($parser,'saxStartElement','saxEndElement');
// -
xml_set_character_data_handler($parser,'saxCharacterData');
//
// XML-.
// case folding,
// . case
// folding ,
// .
xml_parser_set_option($parser,XML_OPTION_CASE_FOLDING,false);
// XML- .
$xml = join('',file($link_file));
// () XML-.
//
// $news,
// XML-.

$GLOBALS['sax']['links'] = array(); // , XML
$GLOBALS['sax']['current_linksblock']=null;// .
$GLOBALS['sax']['page_r'] =0;
$GLOBALS['sax']['page_i'] =-1;
$GLOBALS['sax']['link_r'] =0;
$GLOBALS['sax']['link_i'] =-1;
$GLOBALS['sax']['index'] =null;// .
//

if (xml_parse($parser,$xml,true))
// ,
xml_parser_free($parser);
//else
// FALSE,
// - .
// .
// die(sprintf('AOW - XML: %s %d',
// xml_error_string(xml_get_error_code($parser)),
// xml_get_current_line_number($parser)));
// XML $GLOBALS['sax']['links'];
dbg($GLOBALS['sax']['links'],"results");

//-------------------------------------------------------------------------------------
// , ,
// XML- .
//-------------------------------------------------------------------------------------
// XML
// :
// - SAX
// - XML
// -
function saxStartElement($parser,$name,$attrs){
switch($name){
case 'links':
// links .
// $links XML .
$GLOBALS['sax']['links'] = array();
break;
case 'linksblock':
// linksblock.
// $GLOBALS['current_linksblock']
$GLOBALS['sax']['current_linksblock'] = array("page" => array(), "link" => array());
$GLOBALS['sax']['page_r'] =0;
$GLOBALS['sax']['link_r'] =0;
$GLOBALS['sax']['page_i'] =-1;
$GLOBALS['sax']['link_i'] =-1;
// random -
if (isset($attrs))
$GLOBALS['sax']['current_linksblock']['attributes'] = $attrs;
break;
case 'page':
$GLOBALS['sax']['page_r']=1;
$GLOBALS['sax']['page_i']++;
$GLOBALS['sax']['current_linksblock']['page'][$GLOBALS['sax']['page_i']]="";
break;
case 'link':
$GLOBALS['sax']['link_r']=1;
$GLOBALS['sax']['link_i']++;
$GLOBALS['sax']['current_linksblock']['link'][$GLOBALS['sax']['link_i']]="";
break;
};
}
//-------------------------------------------------------------------------------------
// XML
// :
// - SAX
// - XML
function saxEndElement($parser,$name){
if ((is_array($GLOBALS['sax']['current_linksblock'])) && ($name=='linksblock')){
// $GLOBALS['current_linksblock']
// .
$GLOBALS['sax']['links'][] = $GLOBALS['sax']['current_linksblock'];
$GLOBALS['sax']['current_linksblock'] = null;
} elseif($name=='page') {
$GLOBALS['sax']['page_r'] =0;
} elseif($name=='link') {
$GLOBALS['sax']['link_r'] =0;
}
}

//
// :
// - SAX
// - XML
function saxCharacterData($parser,$data){
// ,
// - .
// ( ,
// ) .
if (is_array($GLOBALS['sax']['current_linksblock'])){
// page ,
if($GLOBALS['sax']['page_r']) {
$GLOBALS['sax']['current_linksblock']['page'][$GLOBALS['sax']['page_i']].= iconv("UTF-8", "windows-1251", $data);
} elseif($GLOBALS['sax']['link_r']) {
// link ,
$GLOBALS['sax']['current_linksblock']['link'][$GLOBALS['sax']['link_i']].= iconv("UTF-8", "windows-1251", $data);
}
}
}
//-------------------------------------------------------------------------------------


The disadvantages of this XML parsing method are obvious: cumbersome, unreadable code, and the need to use global variables.

Here are 2 parsing methods for the same XML file based on the PHP 5 DOM.
Method 1
/* here we must specify the version of XML : ie: 1.0 */
$xml = new DomDocument('1.0');
$xml->load($link_file);

$linksblocksa = array();
$i=0;
foreach($xml->documentElement->childNodes as $XMLlinksblock){
if ($XMLlinksblock->nodeType == 1 && $XMLlinksblock->nodeName == "linksblock"){
$linksblocksa[$i]['attributes']=array();
foreach($XMLlinksblock->attributes as $attr)
$linksblocksa[$i]['attributes'][$attr->name]= $attr->value;

foreach($XMLlinksblock->childNodes as $node){
if ($node->nodeType == 1 && $node->nodeName == "page")
$linksblocksa[$i]['page'][]= $node->nodeValue;
elseif($node->nodeType == 1 && $node->nodeName == "link")
$linksblocksa[$i]['link'][]= iconv("UTF-8", $GLOBALS['E_server_encoding'], $node->nodeValue);
}
$i++;
}
}
unset($xml);
dbg($linksblocksa,"linksblocksa");


The method uses the physical addressless navigation through the XML document tree.

Method 2
/* here we must specify the version of XML : ie: 1.0 */
$xml = new DomDocument('1.0');
$xml->load($link_file);

$linksblocksb = array();

$i=0;
foreach($xml->getElementsByTagName('linksblock') as $XMLlinksblock){
$linksblocksb[$i]['attributes']=array();
foreach($XMLlinksblock->attributes as $attr)
$linksblocksb[$i]['attributes'][$attr->name]= $attr->value;

foreach($XMLlinksblock->getElementsByTagName('page') as $page)
$linksblocksb[$i]['page'][]= $page->nodeValue;

foreach($XMLlinksblock->getElementsByTagName('link') as $link)
$linksblocksb[$i]['link'][]= iconv("UTF-8", $GLOBALS['E_server_encoding'], $link->nodeValue);
$i++;
}
unset($xml);
dbg($linksblocksb,"linksblocksb");


The method uses associative-address navigation through the XML document tree.
In conclusion, I note that all three algorithms end up with absolutely identical data arrays:

image

Algorithm performance tests were performed subject to the following conditions:
AMD Athlon Platform (tm) 64 X2 Dual Core Processor 4200+, DDR 2 1024 MB.
Web server Windows NT 5.1 build 2600, Apache / 1.3.33 (Win32) PHP / 5.1.6.

image
image
image

The performance graph allows the following conclusions: SAX is most stable and its performance does not depend on the position in the program body, nor on the server load.
Consider the rms performance for each group of tests.

image

1-SAX Prod 1
2-DOM 1 Prod 2
3-DOM 2 Prod 3
Make - build mode, Run 10 times - load mode.

1) Make 2-3-1 (order)
2) Run 10 times 2-3-1 (order)
3) Make 3-2-1 (order)
4) Run 10 times 3-2-1 (order)
5) Make 1-2-3 (order)
6) Run 10 times 1-2-3 (order)
7) Make 1-3-2 (order)
8) Run 10 times 1-3-2 (order)

Obviously, the most important at this stage of the analysis is to identify the most productive method for parsing XML based on DOM, SAX is not considered, because its lag and shortcomings are obvious.
Let me remind you that method 1 uses physical indirectional navigation through the tree of an XML document, less readable than method 2, which uses associative-address navigation through the tree of an XML document.
For us, the most important modes are performance results under load conditions, such are even tests:

image

Tests 2 and 6, tests in which method 1 goes first, tests 4 and 8, tests in which method 2 goes first.

From the graph it follows that with its convenience, method 2 achieves the highest performance indicators, only with numerous use of XML in the program.

Method 1, with less conciseness and peak performance relative to method 2, is more stable to use for parsing in a single place of work PHP script.

Thus, the transition to DOM PHP 5, regardless of the method of parsing an XML document, is fully justified, both in terms of code convenience and performance, all the more so, considering that at present PHP 4 is practically not used.

All tests were conducted handicraft, their main task was to show the difference and not the quantitative characteristics of the performance of a particular parser, it is obvious that with proper tuning of the caching mechanisms, the results may differ.

Useful article about support XML in PHP5 habrahabr.ru/blogs/php/31189

Source: https://habr.com/ru/post/50668/


All Articles