⬆️ ⬇️

Parsing telegram channels for content aggregator in PHP

Hi, Habr!



A few years ago, I started developing my own content aggregator to simplify my surfing on the net. Initially, I only parsed rss, vk and facebook, but last year I decided to do a full refactoring of the project: abandon parsing on the client, make a normal back-end, use the database to store data and expand the list of supported resources.



In addition to the standard set of rss, fb, vk, twitter, instagram, youtube, I added support for arbitrary open channels from the telegram.

')

image


Under the cat step by step instructions on how to parse any channels in a telegram without registering and SMS .



Initially, I assumed that parsing channels is possible through the popular BotApi, for which there were many instructions on the network. But it turned out that in order for a bot to read a channel, a bot must be added to this channel. For third-party channels, this option is not possible. I switched to reading manuals on the main API of the telegram.



After 30 minutes of studying the documentation, I was desperate. All data from the telegram is encrypted, in order to get something from their servers you need to have a master's degree in cryptography ... And instead of http requests, you use a socket that I haven’t previously encountered. In general, pure hardcore and no clear examples on the network ... It was almost a fiasco.



The last hope was to find some ready-made solution. And then, finally, luck smiled at me. On the telegram site, I came across a link to an unofficial opensource php client. Yes Yes! You can use a telegram for php, and there even support calls! This miracle is called madelineProto . It can connect to servers using cryptographic magic and give me the data I need in the form of a normal, human associative array.



I started setting up a php client.



1. Register your customer.



Unfortunately, at the beginning of the post I deceived you and we will still need registration and SMS authorization in the telegraph ...



image


If you already have an account in the telegram, it remains to register your application / client and get the keys for access to the telegram servers.



This is a standard procedure, similar to the similar in social. networks to access the API. Instructions for creating your keys.



After registering a client, we only need “App api_id” and “App api_hash” from the page my.telegram.org/apps



2. Installing madelineProto.



It requires php7 to work, but the Readme says that there is a way to run on php5.6.



With the launch of MacOs with php7 from the Mamp package, and a simple hosting for 150 rubles per month, there were no problems.



The process is not tricky: download the release, install the dependencies through the composer and you can start the setup.



To reduce the size, I removed the extra dependencies and left only danog, paragonie and phpseclib. This did not affect the client’s work.



3. Set up madelineProto and first launch.



All examples on the use and configuration are described in the client repository, but I will provide my code with comments.



First you need to start the client in the key generation mode.



At this stage, you will need to authorize a new connection and enter the verification code that will come to the previously authorized telegram client. It is necessary to run the code from the console, as in the process of the script operation it will be necessary to enter the authorization code. In the case of restarting the script, the code will change.



The number of authorizations that can be requested is limited. If something does not work, it is not necessary to run the code many times in a row, otherwise Telegram will block sending confirmations for a day or more.



I, unfortunately, learned about this feature the hard way. Normal those. There is no support for the telegram either, by the way, so in case of blocking you will have to wait :)



set_time_limit(60); //       60 ,           . require_once ROOT_DIR.'/libs/MadelineProto/vendor/autoload.php'; //   ,  -  . C        ReadMe   github. $settings = [ 'authorization' => [ 'default_temp_auth_key_expires_in' => 315576000, //   10 ,      . ], 'app_info' => [ //         https://my.telegram.org 'api_id' => XXXXX, 'api_hash' => XXXXXXXXXX ], 'logger' => [ //     'logger' => 3, //    echo 'logger_level' => 'FATAL ERROR', //    . ], 'max_tries' => [ //        .   ,          'query' => 5, 'authorization' => 5, 'response' => 5, ], 'updates' => [ //     ,         . 'handle_updates' => false, 'handle_old_updates' => false, ], ]; $MadelineProto = new \danog\MadelineProto\API($settings); $MadelineProto->phone_login(readline('Enter your phone number: ')); //      $authorization = $MadelineProto->complete_phone_login(readline('Enter the code you received: ')); //     ,     if ($authorization['_'] === 'account.noPassword') { throw new \danog\MadelineProto\Exception('2FA is enabled but no password is set!'); } if ($authorization['_'] === 'account.password') { $authorization = $MadelineProto->complete_2fa_login(readline('Please enter your password (hint '.$authorization['hint'].'): ')); //   ,     . } if ($authorization['_'] === 'account.needSignup') { $authorization = $MadelineProto->complete_signup(readline('Please enter your first name: '), readline('Please enter your last name (can be empty): ')); } $MadelineProto->session = 'session.madeline'; $MadelineProto->serialize(); //     ,       . 


After successful authorization, the code above can be deleted, it is no longer required.



In the root of the project, the file “session.madeline” will be created, in which, in binary form, the data of our session will be stored. To change the client settings, you must either create a new session, or try to edit this file in a binary editor. Taking into account the fact that the number of authorization attempts per day at the telegram is limited, I recommend that you select the settings wisely so that you would not have to change them often.



Now we can use fast resumption of the session, so we are writing a new code.



4. Getting posts from an arbitrary open telegram channel.



Resuming a session works pretty quickly. It takes me 2-4 seconds to get data, so we no longer need set_time_limit.



  require_once ROOT_DIR.'/libs/MadelineProto/vendor/autoload.php'; $MadelineProto = new \danog\MadelineProto\API('session.madeline'); $settings = array( 'peer' => '@'.$val['url'], //_,    @,  @breakingmash,   ,  limit,    0 'offset_id' => $val['offset_id']?:0, 'offset_date' => $val['offset_date']?:0, 'add_offset' => $val['add_offset']?:0, 'limit' => $val['limit']?:10, // ,    'max_id' => $val['max_id']?:0, // id  'min_id' => $val['min_id']?:0, // id  -   ,  0   . 'hash' => 0 ); $data = $MadelineProto->messages->getHistory($settings); 


Since I update many channels at once, it makes sense to use one and the same session, and not to spend 2 seconds on each channel.



The final code is as follows:



  if (!is_array($url)){ if (mb_strpos($url,',')!==false){ $url = explode(',',$url); }else{ $url = [$url]; } } if (!empty($url)) { require_once ROOT_DIR.'/libs/MadelineProto/vendor/autoload.php'; $file_contents = []; foreach ($url as $val){ if (!is_array($val)){ $val = array( 'url' => $val ); } $settings = array( 'peer' => '@'.$val['url'], 'offset_id' => $val['offset_id']?:0, 'offset_date' => $val['offset_date']?:0, 'add_offset' => $val['add_offset']?:0, 'limit' => $val['limit']?:10, 'max_id' => $val['max_id']?:0, 'min_id' => $val['min_id']?:0, 'hash' => 0 ); $file_contents[$val['url']] = $MadelineProto->messages->getHistory($settings); } } 


After execution, we get an array with the number of messages / posts we need, divided by channels. Also transmitted data about media investments.



It remains to save the text of the post, if there is a photo / video, get a preview and a caption to the media file and create a link to view the post.



4. Getting media investments.



Fortunately, recently, telegram has introduced html previews of posts, so you can not save binary data received from the client to your server, but simply take a link to the photo and video stored on the telegram servers.



By the name of the channel and the post id, we form a link of the format: t.me/KNAME_CHANNEL/ID_POSTA?embed=1 , for example t.me/breakingmash/4193?embed=1



Well, then everything is simple:



 private function telegram_media_parse($posts_data, $source){ include_once(ROOT_DIR.'/libs/phpQuery.php'); //  html       phpQuery foreach ($posts_data as &$post_data) { if (!empty($post_data['media'])){ $file_contents = self::loader($post_data['post_url'],'site');// curl  html   . $document = phpQuery::newDocumentHTML($file_contents); // dom-  html  $post_data['post_image'] = preg_replace('/[\s\S]*background-image:[ ]*url\(["\']*([\s\S]*[^"\'])["\']*\)[\s\S]*/u','$1',$document->find($source['rules']['post_img_path'])->eq(0)->attr('style')); //    background-image . $post_data['post_description'] = $document->find($source['rules']['post_text_path'])->eq(0)->text(); // caption . } unset($post_data['media']); } unset($post_data); return $posts_data; } 


At this parsing is completed and you can save posts in the database or display on the page.



I hope that my first post will be useful to someone. I don’t leave a link to my aggregator, as I’m not sure if this is allowed.

Source: https://habr.com/ru/post/349942/



All Articles