📜 ⬆️ ⬇️

PHP HTML DOM parser with jQuery like selectors

Good afternoon, dear habrovchane. In this post we will discuss a joint project of SC Chen and John Schlick called PHP Simple HTML DOM Parser (sourceforge links).

The idea of ​​the project is to create a tool that allows working with html code using jQuery-like selectors. The original idea belongs to Jose Solorzano's and is implemented for the fourth version of php. This project is a more advanced version based on php5 +.

The review will provide brief excerpts from the official manual , as well as an example of implementing a parser for twitter. In fairness, it should be noted that a similar post is already present on habrahabr, but in my opinion, contains too little information. Who are interested in this topic, welcome under cat.

Getting html code page

$html = file_get_html('http://habrahabr.ru/'); //   https:// 

Comrade Fedcomp gave a helpful comment about file_get_contents and 404 responses. The original script when requesting a 404 page does not return anything. To fix this, I added a check on get_headers. Modified script can be found here .
Search for an item by tag name

 foreach($html->find('img') as $element) { //   img   echo $element->src . '<br>'; //       src } 

Modification of html elements

 $html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>'); //  html    (file_get_html() -  ) $html->find('div', 1)->class = 'bar'; //   div    1  "bar" $html->find('div[id=hello]', 0)->innertext = 'foo'; //    div  id="hello"  foo echo $html; //  <div id="hello">foo</div><div id="world" class="bar">World</div> 

Getting the text content of an element (plaintext)

 echo file_get_html('http://habrahabr.ru/')->plaintext; 

')
The purpose of the article is not to provide comprehensive documentation on this script, you can find a detailed description of all the possibilities in the official manual , if the community has a desire, I will gladly translate the entire manual into Russian, for the time being I will give the twitter parser example promised in the beginning of the article.

Sample twitter message parser

 require_once 'simple_html_dom.php'; //    $username = 'habrahabr'; //   twitter $maxpost = '5'; // -  $html = file_get_html('https://twitter.com/' . $username); $i = '0'; foreach ($html->find('li.expanding-stream-item') as $article) { //  li  $item['text'] = $article->find('p.js-tweet-text', 0)->innertext; //     html  $item['time'] = $article->find('small.time', 0)->innertext; //    html  $articles[] = $item; //    $i++; if ($i == $maxpost) break; //   } 


Output messages

  for ($j = 0; $j < $maxpost; $j++) { echo '<div class="twitter_message">'; echo '<p class="twitter_text">' . $articles[$j]['text'] . '</p>'; echo '<p class="twitter_time">' . $articles[$j]['time'] . '</p>'; echo '</div>'; } 


Thank you for attention. I hope it turned out not very hard and easy to read.

Related Libraries

htmlSQL - thanks Chesnovich
Zend_Dom_Query - thanks majesty
phpQuery - thanks to theRavel
QueryPath - thanks to ZonD80
The DomCrawler (Symfony) - thank you choor
CDom - thanks to the author samally
Famous XPath - thanks for reminding KAndy

PS
Groove's Haboriteel has suggested that such materials have already been
Pps
In my free time I will try to collect all the libraries and compile a summary of performance and pleasantness of use.

Source: https://habr.com/ru/post/176635/


All Articles