📜 ⬆️ ⬇️

Parser on PHP knee or how I replenished my music collection

And it all started with what? It all began in one home, winter, Saturday night ... And of course, with the problem for which a solution was sought)

The other day, due to my own stupidity, I lost forever my entire music collection (I am a DJ, a musician). It was very sorry, because the collection was perfectly sorted, analyzed for bitrate, tonality, etc.

Humbled, I think, okay, I will re-download all the tracks. I will download from the site promodj.com
Why the “promo”, and not some kind of soundcloud? The first reason is that I sit on this site more often than on other music portals. The second reason is that there is a very convenient search with filters a la "Top for January 2017 with a quality of 320kbps, no longer than 10 minutes and not a mash-up".
')
As you understand, very soon I got an infusion ... tired of pressing the "Download" button with my hands. And then the fun began).

Problem one: determine the regular schedule for the link!


I will not talk about how to view the source code of a page element. I do not think that there are people so stupid.

The DIV for each composition looks like this:

image

And here is the code for this div:

Spoiler header
<div class="track2 track2_no_avatar"> <div class="title"> <a amba="file:6224428" onclick="return cb(event);" href="http://promodj.com/sashasemenov/remixes/6224428/Syke_N_Sugarstarr_Feat_Alexandra_Prince_Are_You_Sasha_Semenov_Remix_Radio" class="invert">Syke 'N' Sugarstarr Feat. Alexandra Prince - Are You (Sasha Semenov Remix) (Radio)</a> </div> <div class="aftertitle"> <div id="fpp6224428" class="player"> <div class="playerr_standalone playerr __rototype_destructable" style="min-height: 70px;"> <div class="playerr_bigplaybutton"><img class="playerr_bigplaybutton" style="width: 30px; height: 30px; visibility: visible; margin-top: 20px;" src=""></div> <div class="playerr_bigdownloadbutton"><a class="playerr_bigdownloadbutton" href="http://promodj.com/download/6224428/Syke%20%27N%27%20Sugarstarr%20Feat.%20Alexandra%20Prince%20-%20Are%20You%20%28Sasha%20Semenov%20Remix%29%20%28Radio%29%20%28promodj.com%29.mp3" target="_self" style="width: 30px; height: 30px; margin-top: 19px;"><img style="visibility: visible; width: 30px; height: 30px;" src=""></a></div> <div style="padding-left: 30px; padding-right: 30px;"> <div class="playerr_waveformview" style="position: relative; width: 100%; height: 70px;"> <div style="position: absolute; width: 100%; height: 70px; clip: auto;"> <canvas style="width: 100%; height: 70px; opacity: 0.8;"></canvas> </div> <div style="position: absolute; width: 100%; height: 70px; clip: auto;"> <canvas style="width: 100%; height: 70px; opacity: 0.8;"></canvas> </div> <div style="position: absolute; width: 100%; height: 70px; clip: auto;"> <canvas style="width: 100%; height: 70px; opacity: 0.8; visibility: visible;" width="500" height="100"></canvas> </div> <div style="position: absolute; display: none; left: 0px; top: 0px; width: 1px; height: 70px; background-color: rgb(39, 39, 39);"></div> <div style="position: absolute; left: 0px; top: 0px; width: 100%; height: 70px; cursor: pointer; display: block; background-image: url("//cdn.promodj.com/core/i/playerr/playerr_0.gif"); background-position: 50% 50%; background-repeat: no-repeat;"></div> </div> </div> <div></div> </div> <div class="rbtify"></div> </div> <script> CORE.Player('fpp6224428', 'standalone.big', 6224428, { omitTitle: true, replace: true }); </script> <div class="notizer"></div> <div class="icons"> <span class="play_button" style="margin-left: 3px;"><a href="http://promodj.com/sashasemenov/remixes/6224428/Syke_N_Sugarstarr_Feat_Alexandra_Prince_Are_You_Sasha_Semenov_Remix_Radio?play=1" ambatitle="">10 204</a></span> <span class="comments_count"><a onclick="return cb(event);" href="http://promodj.com/sashasemenov/remixes/6224428/Syke_N_Sugarstarr_Feat_Alexandra_Prince_Are_You_Sasha_Semenov_Remix_Radio#comments" ambatitle=""><span class="cc17607699"><span class="newc">+18</span></span></a></span> <span class="downloads_count"><a onclick="return cb(event);" href="http://promodj.com/download/6224428/Syke%20%27N%27%20Sugarstarr%20Feat.%20Alexandra%20Prince%20-%20Are%20You%20%28Sasha%20Semenov%20Remix%29%20%28Radio%29%20%28promodj.com%29.mp3" ambatitle="">4 892</a></span> <span class="balls_count">PR <a href="#" id="fv7_6224428" ambatitle="  " onclick="Vote('file',6224428,this,'550cb446865f60595506b16fc51a35fd'); cb(event); return false;">228 â–˛</a></span> <a class="bitrate" onclick="return cb(event);" href="http://promodj.com/source/6224428/Syke%20%27N%27%20Sugarstarr%20Feat.%20Alexandra%20Prince%20-%20Are%20You%20%28Sasha%20Semenov%20Remix%29%20%28Radio%29%20%28promodj.com%29.mp3">320</a> <span class="styles_list styles"><span class="styles"><b><a href="/music/deep_house?sortby=rating&bitrate=high&no_junk=1&period=date&duration=10m&year=2017&month=2">Deep House</a></b>, <b><a href="/music/club_house?sortby=rating&bitrate=high&no_junk=1&period=date&duration=10m&year=2017&month=2">Club House</a></b></span></span> <span class="small" style="margin-left: 6px;"></span> </div> </div> </div> 


At first glance, we are interested in this line:

 <a class="playerr_bigdownloadbutton" href="http://promodj.com/download/6224428/Syke%20%27N%27%20Sugarstarr%20Feat.%20Alexandra%20Prince%20-%20Are%20You%20%28Sasha%20Semenov%20Remix%29%20%28Radio%29%20%28promodj.com%29.mp3" target="_self" style="width: 30px; height: 30px; margin-top: 19px;"> 

It was under this line that I began to develop a regular expression that will cut out the link to the audio file. But this turned out to be the wrong decision!

In addition to the query sheet with tracks, promodj.com also has ads for music tracks. And in every such advertisement with the same link, the Download button is displayed. This means that in addition to the tracks I need, songs from the ads will also be downloaded.

At first, I even wanted to spit, well, to hell with it, I will have more overdoger promotional tracks as a gift. But, considering how much extra advertising rubbish I have, I abruptly abandoned this idea.

Then I was strained by the name of the link class. “Bigdownloadbutton”, it’s possible that the site developer calls everything so beautifully in his life, and maybe there’s a smaller button ...

And there is! Remembering, about the small, inconspicuous download button under the track, I began to look for its code for parsing. Here he is:

 <span class="downloads_count"><a onclick="return cb(event);" href="http://promodj.com/download/6224428/Syke%20%27N%27%20Sugarstarr%20Feat.%20Alexandra%20Prince%20-%20Are%20You%20%28Sasha%20Semenov%20Remix%29%20%28Radio%29%20%28promodj.com%29.mp3" ambatitle="">4 892</a></span> 

Judging by the class name, it is immediately clear that this element was originally intended as a download counter. But we are interested in something else - there is a link inside it !!! Just in case, I checked, both visually and through a search in the code, whether there are any more elements with such a class on the page. Nope Perfect!!!

A couple of minutes was spent on drawing up the simplest regular season:

 /<span class="downloads_count"><a onClick="return cb(event);" href="(.*)" ambatitle="Download">/im 

With this regular we will get all the links on the page that are inside the “downloads_count” SPAN. Fine! Second phase.

Task two: generate links to pages for parsing


Since I wanted to replenish my collection with tracks of different genres, and only top ones, I designed for myself an exact goal.

For each of the styles I’m interested in, parse the first 2 pages issued for requests with filters: " Sorted by rating for 2017, for each month, with a quality of at least 320kbps, a long track no more than 10 minutes, is not a mashup " ( Mashups are dumb, fu, I want author music!).

Now I needed to generate several links to pages according to the specified criteria.

With the usual request from the browser, we have the following URL and parameters:

 Protocol: http: Hostname: promodj.com Path name: /music/club_house Arguments: sortby = rating bitrate = high no_junk = 1 period = date duration = 10m year = 2017 month = 2 page = 2 

Guess what's what, not difficult. Moreover, all parameters are passed using the GET method. It remains for the small - to generate a few urls for my request! I decided not to engage in perversion and not to print all these URLs manually.

We are progres! Let's write a script that will generate these URLs for us !!! Yes without b:

 $styles = array( 'big_room_house', 'club_house', 'dance_pop', 'deep_house', 'electrohouse', 'future_house', 'g_house', 'pop', 'progressive_housee House', 'russian_pop', 'techhouse', ); $urls = array(); foreach ($styles as $style) { for ($m=1; $m < 12; $m++) { $urls[] = "http://promodj.com/music/$style/?sortby=rating&bitrate=high&no_junk=1&period=date&duration=10m&year=2017&month=$m&page=1"; $urls[] = "http://promodj.com/music/$style/?sortby=rating&bitrate=high&no_junk=1&period=date&duration=10m&year=2017&month=$m&page=2"; } } foreach ($urls as $url) { echo $url."\r\n"; } 

And at the output we get neat references for parsing!

image

Problem three: link parsing


Well, we have links to the pages for parsing. Regular we have already done. Let's parsit gentlemen!

And how to parse? What to parse? Of course pure php! After all, we have mother KULKHACKKERS and TYZHPROGRAMMY!

We turn into all serious. Searching for a substring by regular expression in PHP is easy to implement. For this there is a function preg_match_all (). But we need to first get the HTML code of the page to parse it.

And no, we will not use the DOM, and we will not even use the Curl for this !!! We will use the standard PHP function - file_get_contents ()! Suddenly, who did not know, using this function, you can read not only the local text file, but also the HTML code compiled by the server, if you submit the URL argument!

The FULL 8 LINE takes our parsing cycle taking into account the formatting. WHOLE 8 LINE, CARL !!!

By the way, the code:

 foreach ($urls as $url) { $html = file_get_contents($url); $re = '/<span class="downloads_count"><a onclick="return cb\(event\);" href="(.*)" ambatitle="Download">/im'; preg_match_all($re, $html, $matches, PREG_SET_ORDER, 0); foreach ($matches as $key => $value) { echo $value[1]."\r\n"; } } 

Explain what is what? Well, just in case I will explain. In the loop, we throw an array of URLs generated earlier, then for each URL we get its HTML code using file_get_contents (). Next, we have a string with a regular expression obtained in the first step.

After that, we execute the preg_match_all function with HTML code arguments, strings with a regularity and an array variable where all the found strings will be written to. PREG_SET_ORDER
Orders the results so that the $ matches [0] element contains the first set of entries, the $ matches [1] element contains the second set of entries, and so on. And at the end we have already registered a cycle in which we display all links to the screen!

“ALILUAH!” I casually shouted out loud. And then I went to smoke and think about what to do next ...

Task Four: Downloading Multiple Files


According to the simplest calculations, I got 11 (number of styles for the search) * 20 (number of results on the search page) * 12 (months) * 2 (search pages) = 5,280 audio files. Plus, you need to take into account that loading each page for parsing also takes time, and the work of the regular schedule also takes time.

At first there was a decision to use Curl to download files. But a minute later I was smiling again with joy).

There is an excellent program - Download Master (not advertising)! The last time I saw her back in 2010 was the time when I only learned uTorrent.

The trick of the program is that it can take as input a list of URL files for download !!!

The second problem. If I now take all the links at once, shove them in the Download Master and go to drink tea / smoke / sleep, then in the end I will have all the music in one folder !!! Well, that is, not sorted by style.

The solution is simple and logical - I will parse each style in turn and throw it into the Download Master. Only I started to copy the received URLs, as a polite downloader suggested that I start downloading them!

image

Moreover, I immediately asked where to save these files and whether to apply the same settings for all the other files in the list!

image

image

Well, now I finally broke into a smile and went to drink tea / smoke / sleep!)

If someone needs it, here is the full PHP code for the parser:

promodj_parser.php
 <?php $styles = array( 'big_room_house', 'club_house', 'dance_pop', 'deep_house', 'electrohouse', 'future_house', 'g_house', 'pop', 'progressive_housee', 'russian_pop', 'techhouse', ); $styles = array( 'techhouse', ); $urls = array(); foreach ($styles as $style) { for ($m=1; $m < 12; $m++) { $urls[] = "http://promodj.com/music/$style/?sortby=rating&bitrate=high&no_junk=1&period=date&duration=10m&year=2017&month=$m&page=1"; $urls[] = "http://promodj.com/music/$style/?sortby=rating&bitrate=high&no_junk=1&period=date&duration=10m&year=2017&month=$m&page=2"; } } foreach ($urls as $url) { $html = file_get_contents($url); $re = '/<span class="downloads_count"><a onclick="return cb\(event\);" href="(.*)" ambatitle="Download">/im'; preg_match_all($re, $html, $matches, PREG_SET_ORDER, 0); foreach ($matches as $key => $value) { echo $value[1]."\r\n"; } } ?> 



UPD: If anyone is interested, I got 3350+ tracks with a total weight of 36.5 GB at the output. I don’t think I could have done it with my hands))

Source: https://habr.com/ru/post/343294/


All Articles