Good afternoon, dear habrovchane. In this post we will discuss a joint project of
SC Chen and
John Schlick called
PHP Simple HTML DOM Parser (sourceforge links).
The idea of the project is to create a tool that allows working with html code using jQuery-like selectors. The original idea belongs to
Jose Solorzano's and is implemented for the fourth version of php. This project is a more advanced version based on php5 +.
The review will provide brief excerpts from the
official manual , as well as an example of implementing a parser for twitter. In fairness, it should be noted that a similar post is
already present on habrahabr, but in my opinion, contains too little information. Who are interested in this topic, welcome under cat.
Getting html code page
$html = file_get_html('http://habrahabr.ru/');
Comrade
Fedcomp gave a
helpful comment about file_get_contents and 404 responses. The original script when requesting a 404 page does not return anything. To fix this, I added a check on get_headers. Modified script can
be found here .
Search for an item by tag name
foreach($html->find('img') as $element) {
Modification of html elements
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>'); // html (file_get_html() - ) $html->find('div', 1)->class = 'bar'; // div 1 "bar" $html->find('div[id=hello]', 0)->innertext = 'foo'; // div id="hello" foo echo $html; // <div id="hello">foo</div><div id="world" class="bar">World</div>
Getting the text content of an element (plaintext)
echo file_get_html('http://habrahabr.ru/')->plaintext;
')
The purpose of the article is not to provide comprehensive documentation on this script, you can find a detailed description of all the possibilities in the
official manual , if the community has a desire, I will gladly translate the entire manual into Russian, for the time being I will give the twitter parser example promised in the beginning of the article.
Sample twitter message parser
require_once 'simple_html_dom.php';
Output messages
for ($j = 0; $j < $maxpost; $j++) { echo '<div class="twitter_message">'; echo '<p class="twitter_text">' . $articles[$j]['text'] . '</p>'; echo '<p class="twitter_time">' . $articles[$j]['time'] . '</p>'; echo '</div>'; }
Thank you for attention. I hope it turned out not very hard and easy to read.
Related Libraries
htmlSQL - thanks
ChesnovichZend_Dom_Query - thanks
majestyphpQuery - thanks to
theRavelQueryPath - thanks to
ZonD80The DomCrawler (Symfony) - thank you
choorCDom - thanks to the author
samallyFamous XPath - thanks for reminding
KAndyPS
Groove's Haboriteel has suggested that
such materials have already beenPps
In my free time I will try to collect all the libraries and compile a summary of performance and pleasantness of use.