📜 ⬆️ ⬇️

Writing a parser with XPath and Yii

Introduction

Sometimes there are tasks when you need to implement a wrapper for working with the API of a certain service for the needs of the customer and to do a similar task is basically quite simple, but the service does not always have this API, or the thought arises that it would be better if it were not, therefore you have to parse the entire content page .

As an example for this article, we will use the XenForo demo forum and a previously created topic, from where we will parse the typical data: the title, the time of creation and the text of the topic itself, while the parsing will be carried out in an authorized forum account. All other data can be taken by analogy.

The parser itself is implemented as a component for convenient use in Yii2.

What do we need


Let's start

Create the ParserXenforo component. Since we don't need events, it will be enough to inherit from Object.
')
<?php namespace app\components; use Yii; use \yii\base\Object; class ParserXenforo extends Object { } 


We need to add properties and constants to load the page. The host, username, password, curlOpt properties themselves will be set in the component settings.

 <?php namespace app\components; use Yii; use \yii\base\Object; class ParserXenforo extends Object { /** * Uri      */ const REQUEST_URI_LOGIN = 'login/login'; /** *     cookies */ const COOKIES_FILE_NAME = 'cookies.txt'; /** * @var string    */ private $_data; /** * @var string   */ public $host; /** * @var string   */ public $username; /** * @var string   */ public $password; /** * @var array  cURL */ public $curlOpt; } 


Add page loading methods.
First, we will implement a method to get the set header and user-agent values ​​that will be stored in curlOpt, and in the future be passed to the cURL parameters

 protected function getCurlOpt($nameOpt) { if ($nameOpt !== 'userAgent' && $nameOpt !== 'header') { return false; } return $this->curlOpt[$nameOpt]; } 

For authorization on the forum you need to transfer the username and password through POST. To do this, we will create an authorization url (host + authorization url)

 protected function getLoginUrl() { return $this->host . self::REQUEST_URI_LOGIN; } 

And the POST request string

 protected function createPostRequestForCurl() { return 'login=' . $this->username . '&password=' . $this->password . '&remember=1'; } 

To save the authorization, we will use the file with cookies at runtime. To get the full path of this file, create a method that receives the full path from the path alias and adds the file name to it.

 protected function getPathToCookieFile($cookieFileName = self::COOKIES_FILE_NAME) { return Yii::getAlias('@app/runtime') . DIRECTORY_SEPARATOR . $cookieFileName; } 

We implement the method of parsing the page with the passed parameters. First, we switch to action authorization, where we transmit POST values ​​and return to the transferred url, but already in an authorized account. Just in case. Since, for example, I often saw that a module for hiding content from unauthorized users is installed on this forum.
After successful loading of data in _data, we log the Yii :: info () method that the data is loaded.

 public function loadUsingCurl($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $this->loginUrl); curl_setopt($ch, CURLOPT_FAILONERROR, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_REFERER, $url); curl_setopt($ch, CURLOPT_HTTPHEADER, $this->getCurlOpt('header')); curl_setopt($ch, CURLOPT_COOKIEFILE, $this->pathToCookieFile); curl_setopt($ch, CURLOPT_COOKIEJAR, $this->pathToCookieFile); curl_setopt($ch, CURLOPT_FRESH_CONNECT, 1); curl_setopt($ch, CURLOPT_USERAGENT, $this->getCurlOpt('userAgent')); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_POSTFIELDS, $this->createPostRequestForCurl()); $this->_data = curl_exec($ch); if (curl_exec($ch) === false) { throw new \Exception(curl_errno($ch) . ': ' . curl_error($ch)); } curl_close($ch); Yii::info(Yii::t('app', 'Loading data page')); return $this; } 


The basic part of the component is implemented. Now you need to connect it to the components and configure. By specifying the data of your computer in the user-agent, for example, where the component is located, the base url and the data for authorization.
The parameters for authorization were given in the admin: admin demo. But only one was given for several days, but rather to Mar 24, 2014 at 7:26 AM

 .... 'components' => [ ... 'parser' => [ 'class' => 'app\components\ParserXenforo', 'host' => 'http://9af5766eb2759a49.demo-xenforo.com/130/index.php?', 'username' => 'admin', 'password' => 'admin', 'curlOpt' => [ 'userAgent' => 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36', 'header' => [ 'Accept: text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1', 'Accept-Language: en-US,en;q=0.8,ru;q=0.6,uk;q=0.4', 'Accept-Charset: Windows-1251, utf-8, *;q=0.1', 'Accept-Encoding: deflate, identity, *;q=0', ] ] ], ... ], .... 

In the controller we can check the performance by calling in action and see if the app.log logs did everything well

 $urlThread = 'http://9af5766eb2759a49.demo-xenforo.com/130/index.php?threads/some-thread.1/'; /** @var \app\components\ParserXenforo $dataParse */ $dataParse = Yii::$app->parser->loadUsingCurl($urlThread); 


Data parsing

Start by creating a method to get an object of the DOMDocument class of our page and add a property to store it. Before disabling errors libxml and do the opposite after downloading. To avoid some problems with the parsing of the page. As a result, we get the DOM of our page for further work with it. You could also use regular expressions. But working with the DOM is more convenient in this case.

 public function createDomDocument() { $this->_dom = new \DOMDocument(); libxml_use_internal_errors(true); if ($this->_dom->loadHTML($this->_data)) { Yii::info(Yii::t('app', 'Create DomDocument')); } else { Yii::info(Yii::t('app', 'An error occurred when creating an object of class DOMDocument')); } libxml_use_internal_errors(false); return $this; } 


We proceed to the method of obtaining a new object of the DOMXPath class so that it is convenient to execute the specified XPath expression to obtain the required data.

 public function createDomXpath() { $this->_xpath = new \DOMXPath($this->_dom); Yii::info(Yii::t('app', 'Create DomXpath')); return $this; } 


Well, now everything can be safely turned to the execution of XPath queries to get our data: title, timestamp and content.
First we get the title and add the _title property

 public function parseTitle() { $xpathQuery = '*//h1'; $nodes = $this->_xpath->query($xpathQuery, $this->_dom); if ($nodes->length === 0) { Yii::info(Yii::t('app', 'Error parse title')); } $this->_title = $nodes->item(0)->nodeValue; Yii::info(Yii::t('app', 'Parse title')); return $this; } 


Next timestamp of our topic

 public function parseTimestamp() { $xpathQuery = '*//p[@id="pageDescription"]/a/abbr'; $nodes = $this->_xpath->query($xpathQuery, $this->_dom); if ($nodes->length === 0) { Yii::info(Yii::t('app', 'Error parse timestamp')); return $this; } //   timestamp $this->_timestamp = $nodes->item(0)->getAttribute('data-time'); Yii::info(Yii::t('app', 'Parse timestamp')); return $this; } 

Last get the content

 public function parseContent() { $xpathQuery = '*//blockquote[@class="messageText ugc baseHtml"]'; $nodes = $this->_xpath->query($xpathQuery, $this->_dom); if ($nodes->length === 0) { Yii::info(Yii::t('app', 'Error parse content')); return $this; } $this->_content = $nodes->item(0)->nodeValue; Yii::info(Yii::t('app', 'Parse content')); return $this; } 


Wrap back a bit and consider in more detail what kind of XPath requests we made


Create a method for completing the parsing (it may not be entirely necessary, but it will still be more clearly seen that data parsing has been completed and all data has been received), as well as methods for accessing the received data

 /** * @return \app\components\ParserXenforo */ public function endParse() { if (isset($this->_content, $this->_timestamp, $this->_content)) { Yii::info(Yii::t('app', 'End parse')); } else { Yii::info(Yii::t('app', 'Some data were not received')); } return $this; } /** * @return string title */ public function getTitle() { return $this->_title; } /** * @return int timestamp */ public function getTimestamp() { return $this->_timestamp; } /** * @return string content */ public function getContent() { return $this->_content; } 


Output Results

We can say that the component is ready, we can see how it works by adding the necessary actions to our controller’s action and view their output

 $urlThread = 'http://9af5766eb2759a49.demo-xenforo.com/130/index.php?threads/some-thread.1/'; /** @var \app\components\ParserXenforo $dataParse */ $dataParse = Yii::$app->parser ->loadUsingCurl($urlThread) ->createDomDocument() ->createDomXpath() ->parseTitle() ->parseTimeStamp() ->parseContent() ->endParse(); return $this->render('index', ['data' => $dataParse]); 


 <?php /** * @var yii\web\View $this * @var \app\components\ParserXenforo $data */ $this->title = 'My Yii Application'; ?> <div class="site-index"> <h1><?= $data->title; ?></h1> <p>Created At: <?= date('Ymd H:i:s', $data->timestamp); ?></p> <p><?= $data->content; ?></p> </div> 


The result is a similar result.
image

Conclusion

In this article, we looked at how to make a page content parser as a component for Yii using the example of parsing the XenForo forum topic.
By analogy, you can make the parsing and other data, or create a slightly different class that will be used by us for parsing for example all the forum topics, in principle:

The theoretical aspect was not covered in this article, the article was oriented to show on a less real but simple example how to get the page data.
A link to the example code can be viewed in the resources.

Resources

Description Yii2 minimal
Yii2 documentation
Xpath 1.0 specification in Russian
Source Code Repository

Source: https://habr.com/ru/post/216227/


All Articles