📜 ⬆️ ⬇️

MODx and parser tags from another site

Probably any programmer slash site administrator is faced with the problem of importing data from foreign sites. This task is very trivial, and does not require any special knowledge ... the only question is in the wrapper. To supplement the collection of articles on MODx I am writing this article, perhaps someone will come in handy.
Attention! This record does not carry any practical value, only a theoretical load a la "A simple example of working with the back-end of MODx".
And the task was: parse the table from the example.com page, re-register it and stick it onto the site.

Five kopecks


Actually, we will use cURL, since for this purpose we will not find a better tool. First of all, we will create two templates, parserTplOuter and parserTplInner - the wrapper and its “insides”, respectively. Personally, I made a table, so in the example we will be guided by this, but no one forbids making divas with the given styles.

parserTplOuter
 <table> [+content+] </table> 

The wrapper was taken out separately specially so that it was possible to prescribe the appearance, the location of the elements or some other versatile pieces without coding the code. Adhere to the MVC model is holy!
')
parserTplInner
 <tr>
      <td> [+ 0 +] </ td> <td> [+ 1 +] </ td> <td> [+ 2 +] </ td> <td> [+ 3 +] </ td>
 </ tr>

Here I would like to explain why I made numeric identifiers: first, the development simply did not require associative arrays :) and second, although there is a chance to get confused, it turns out that there is some standardization when it is very simple to add one more element to the “row” .

Well, got the banks, let's fill them with juice and create a snippet, which we call parser :

 <? php

 if (empty ($ url)) return false;  // if the snippet does not give the address - complete the execution
 // here we need to do another bunch of checks on the validity of the URL, but we know that we will not write evil here 
 // And we will not give dunce to access;)

 $ tplInner = (empty ($ tplInner))?  'parserTplInner': $ tplInner;  // Set the default
 $ tplOuter = (empty ($ tplOuter))?  'parserTplOuter': $ tplOuter;  // chunks for snippet

 $ c = (empty ($ count) || (! is_numeric ($ count)))?  6: $ count;  // Make a certain restriction on the number of records
 $ c = ($ c> 100)?  100: $ c;  // maximum number
 $ c = ($ c <1)?  1: $ c;  // minimum number

 // initialization of the curl session
 $ ch = curl_init ();

 // set the URL and other required parameters
 curl_setopt ($ ch, CURLOPT_URL, $ url);
 curl_setopt ($ ch, CURLOPT_HEADER, 0);
 curl_setopt ($ ch, CURLOPT_TIMEOUT, 5);
 curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1);

 // load the page and output it to a variable
 $ html = curl_exec ($ ch);

 // end session and release resources
 curl_close ($ ch);

 if (mb_strlen ($ html) <100) {return '';} // if the answer is too short we exit the processing.
 // here you have to be careful, because the most diverse code can come back, but standard HTML frames add up to about 100 characters in total.

 $ pattern = "/ <table (?: [^>] +)> ([\ s \ S] +) <\ / table> / i";  // this part will cut us all the tables from the site. 
 // theoretically, this template can be stuffed into a chunk and changed there, but as practice shows, all the same, the parsing will be more or less unique for each case, and it is easier to write your own based on the code than to adjust the parameters so that the result is correct ...

 preg_match ($ pattern, $ html, $ matches);
 unset ($ matches [0]);  // who does not remember - the entire found string is written to 0 element and we do not need it

 $ array = explode ('</ tr>', $ matches [1]);  // here I went in a very tricky way and converted all the rows into array elements 

 $ separator = '| == |';  $ table = array ();  // prepared and ...

 foreach ($ array as & $ value) {
   // (bydlokod in action) 
     $ value = str_replace ('</ td> <td', '</ td>'. $ separator. '<td', $ value);  // ... made a feint with ears :) 
     $ value = strip_tags ($ value);  // it was lazy to remove table tags regularly; it’s easier to separate cells with a service character set and ..
     $ table [] = explode ($ separator, $ value);  // ... split the remainder of strip_tags by this separator
 }

 $ i = 0;  // here it is worth noting that the 0th element is the table headings.  if they are not needed, then 1 and unset ($ table [0]) should be set;  prescribe
 $ rows = '';

 foreach ($ table as $ row) {  
 // well, now we run through each row 
   if ($ i ++> $ c) break;  // check the record count
   $ rows. = $ modx-> parseChunk ($ tplRow, $ row, '[+', '+]');  // write the series

 }

 echo $ modx-> parseChunk ($ tplTable, array ('content' => $ rows), '[+', '+]');  // insert the series and display the result
 ?>

Actually this is all that was required, now we insert it in the right place on the site:
  [[parser?tplInner=`parserTplInner` &tplOuter=`parserTplOuter` &url=`http://example.com` &count=`10`]] 

and see the neat plate formed.

Absolutely THIS script is not needed by anyone, but with a head and gray matter in it, it is very easy to adapt this snippet to your needs. It is quite simple to change the template or the processing mechanism, instead of processing the table, to prescribe a diva cut - anyway, any parsing will understand that the target site has some kind of static structure. And already guided by it, we get data for processing, and how to display them to solve the problem is a completely different headache.

Source: https://habr.com/ru/post/93905/


All Articles