📜 ⬆️ ⬇️

Writing your XML parser

Prehistory


Having decided to run a small service on the hosting I was given, it turned out that there is not a single xml parser there: neither SimpleXML, nor DOMXML, but only libxml and xml-rpc. Without thinking twice, I decided to write my own. I needed to parse non-complex rss feeds, so the xml => array class was enough. [one]

But for an interesting article this was clearly not enough, so now we will write our replacement for SimpleXML. And at the same time let's go over the many interesting features of PHP 5.

Formulation of the problem


Access to the elements will be provided as access to the properties of the class, for example, $ xml-> element , and access to the attributes of the element as an array, those $ xml-> element [ 'attr' ] , also implements the check for the existence of an attribute using isset () and iterate over elements using foreach . So, let's begin.

')

A little bit of magic?


In PHP 5, some 'magic' methods are defined for the classes, they begin with a double underscore '__' and are called when a certain action originates. [2] We will need the following:


SPL


Standard PHP Library - the standard PHP library, like STL from the world of C ++, was created in order to provide the developer with tools for solving typical problems. [3]
We will need to implement the following interfaces:


XML and expat


These are standard libraries for working with XML and creating XML parsers. [4] What is needed to solve our problem. For the sake of interest, you can write an analysis of the xml-file manually, for example, on regular expressions.
Most of all in expat we are interested in the following functions:

Note: callback in php is either the name of the function passed as a string, or an array with two values ​​- the first is the name of the class, and the second is the name of the method of this class.

Pointers


Pointers in PHP do not quite work as they do in C or C ++. [5] Actually, the $ a = & $ b construction only means that now $ a points to the same area with data as $ b , and it’s impossible to change the address where $ b points through $ a, it’s possible to say that address change has one nesting level.
Starting with the fifth version, in PHP all variables are passed to the function by pointer, but as soon as you change its value, memory is allocated for a new one. In our case, pointers are useful for pointing to the parent element.

Coding


With the theory finished, now we will start directly writing of the parser.
Each object will represent a single xml element, so it will need properties such as tag name, attributes, data, a reference to the parent, and an array with descendants, in addition, you will need a pointer variable to the current element. Of the methods, we will need to implement all the interfaces, add a child, set a reference to the parent, assign the contents of the element, and the three functions required for the parser - open and close the tag and get the contents of the element.
Make a sketch of the future class:
class XML implements ArrayAccess , IteratorAggregate , Countable {

private $ pointer ;

private $ tagName ;


private $ attributes = array ();

private $ cdata ;

private $ parent ;

private $ childs = array ();



public function __construct ( $ data ) {}




public function __toString () {return; }



public function __get ( $ name ) {return; }




public function offsetGet ( $ offset ) {return; }



public function offsetExists ( $ offset ) {return; }




public function offsetSet ( $ offset , $ value ) {return; }

public function offsetUnset ( $ offset ) {return; }




public function count () {return; }



public function getIterator () {return; }




public function appendChild ( $ tag , $ attributes ) {return; }



public function setParent ( XML $ parent ) {}




public function getParent () {return; }



public function setCData ( $ cdata ) {}




private function parse ( $ data ) {}



private function tag_open ( $ parser , $ tag , $ attributes ) {}



private function cdata ( $ parser , $ cdata ) {}


private function tag_close ( $ parser , $ tag ) {}


}

Now let's get down to the implementation of the functions. In order, let's start with the constructor. In our case, it can take two types of values ​​- a string (xml) or an array of two elements (element name, attributes), since there is no overload of the same method with different parameters in php - you will have to manually check the type.
public function __construct ( $ data ) {

if ( is_array ( $ data )) {

list ( $ this -> tagName , $ this -> attributes ) = $ data ;


} else if ( is_string ( $ data ))

$ this -> parse ( $ data );

}

As already mentioned, with the help of the __toString () magic method, the user will be able to get the data of an element as a string, and then convert it to any type that he wants, unfortunately, it’s impossible to return directly what he wants, so that's the only way.
At the same time, we will analyze the next magic method __get ( $ name ) , with the help of which we will access the descendants of the current element. It is quite logical that if there is only one descendant, then it will be returned immediately, without the need to call on the 0 index of the array. For example: $ xml-> rss-> channel-> item [ 5 ] -> url , instead of $ xml-> rss [ 0 ] -> channel [ 0 ] -> item [ 5 ] -> url [ 0 ] , if the elements rss, channel and url exist in a single copy at their nesting level.
public function __toString () {

return $ this -> cdata ;

}



public function __get ( $ name ) {


if (isset ( $ this -> childs [ $ name ])) {

if ( count ( $ this -> childs [ $ name ]) == 1 )


return $ this -> childs [ $ name ] [ 0 ];

else

return $ this -> childs [ $ name ];


}

throw new Exception ( “UFO steals [$ name]!” );

}


The offsetGet , offsetExists , offsetSet, and offsetUnset functions implement the ArrayAccess interface to access an object as an array. We use it to access element attributes. offsetSet and offsetUnset will leave stubs for now.
public function offsetGet ( $ offset ) {

if (isset ( $ this -> attributes [ $ offset ]))


return $ this -> attributes [ $ offset ];

throw new Exception ( "Holy cow! There is'nt [$ offset] attribute!" );

}




public function offsetExists ( $ offset ) {

return isset ( $ this -> attributes [ $ offset ]);

}


And now we are faced with a problem because of a recent decision. If suddenly we want to start a foreach loop on a single element, then it will start on the xml object itself! So you have to sacrifice the ability to use foreach for element attributes in a simple way and implement the getAttributes () method. And we will return the iterator and the number of elements for the array of elements to which the callee belongs, and if he does not have a parent, then an iterator over the array from one current element. Thus, the IteratorAggregate and Countable interfaces will be implemented.
public function count () {

if ( $ this -> parent ! = null )

return count ( $ this -> parent -> childs [ $ this -> tagName ]);


return 1 ;

}



public function getIterator () {

if ( $ this -> parent ! = null )


return new ArrayIterator ( $ this -> parent -> childs [ $ this -> tagName ]);

return new ArrayIterator (array ( $ this ));


}


Adding a child is a simple function, the only interesting thing about it is that after adding an element, it returns a reference to it.
public function appendChild ( $ tag , $ attributes ) {

$ element = new XML (array ( $ tag , $ attributes ));


$ element -> setParent ( $ this );

$ this -> childs [ $ tag ] [] = $ element ;

return $ element ;


}

Now we implement the parser itself. To create a tree structure we will use a pointer to the current element. At the beginning, it is installed directly on the current element, when opening a tag - on an open element, so that all elements contained in it are added to its descendants, and when closing a tag - on its parent element.
private function parse ( $ data ) {

$ this -> pointer = & $ this ;

$ parser = xml_parser_create ();


xml_set_object ( $ parser , $ this );

xml_parser_set_option ( $ parser , XML_OPTION_CASE_FOLDING , false );

xml_set_element_handler ( $ parser , "tag_open" , "tag_close" );


xml_set_character_data_handler ( $ parser , "cdata" );

xml_parse ( $ parser , $ data );

}



private function tag_open ( $ parser , $ tag , $ attributes ) {


$ this -> pointer = & $ this -> pointer -> appendChild ( $ tag , $ attributes );

}


private function cdata ( $ parser , $ cdata ) {


$ this -> pointer -> setCData ( $ cdata );

}


private function tag_close ( $ parser , $ tag ) {


$ this -> pointer = & $ this -> pointer -> getParent ();

}


Everything. Parser is ready to go. In order not to inflate the article even more, I downloaded the entire source code with comments on Google Docs and the usage example too. [6]

What's next?


This is still not a complete replacement for SimpleXML, our parser still does not know how to create an xml document from the data in it. Adding the necessary functions is not a difficult task, so I will leave it, for those who are interested, as homework :)

Links


1) The first version of xml => array parser .
2) Documentation of magical methods (eng) ( rus ).
3) SPL documentation .
4) Description of the functions of the xml-parser .
5) Documentation of signs (eng) ( rus ).
6) The final version of the parser and a simple example of use .

Source: https://habr.com/ru/post/30353/


All Articles