⬆️ ⬇️

We continue to parse RSS now kinozal'a using grep, wget / curl

RSS feed

In my previous post about automating downloads of new episodes from LostFilm 's RSS feed, the AmoN habrauser raised the correct question of the impossibility of the method of downloading distributions I described, direct links to the torrent file of which are not contained in the RSS feed. As an example, the cinema hall tv tracker was given. This post is dedicated to solving this issue;)



Instead of introducing


I will retell in brief the essence of the last post. Many popular torrent clients allow you to set up tracking folders in their settings, analyzing which for the appearance of new files automatically start downloading. The shell script written earlier periodically scans the RSS tracker feed, selects distributions of interest to us, and uploads their torrent files to the tracking folder.



What's in a name?


The basis of the selection and filtering of RSS feeds of the past way was a regular expression analysis of the link to the torrent file. For example, even glancing at the link of the form http://www.lostfilm.tv/download.php/2035/Lost.s06e07.rus.PROPER.LostFilm.TV.torrent

You can immediately see what kind of series, season and episode. However, as AmoN correctly noted, not all RSS trackers contain direct links to torrent files, which makes our download automation task somewhat difficult. It is this feature that caused this post :)



Nuss, let's get started


To start, I carefully looked at the format of the experimental RSS feed. And that's what I saw:

')

<item>

<title>The 3 Great Tenors - VA / Classic / 2002 / MP3 / 320 kbps</title>

<description>: - </description>

<link>http://kinozal.tv/details.php?id=546381</link>

</item>




Namely: the link not only does not contain the distribution name, but is not a direct link to the torrent file. Well, it means that in order to get the torrent file itself you need to follow the link, and on the downloaded page you already have a direct link to the file.



We develop a plan


A little thought, I invented the following algorithm:

  1. read the RSS feed http://kinozal.tv/rss.xml and grep 'om choose the distribution of interest to us according to the description:



    curl -s http://kinozal.tv/rss.xml | grep -iA 2 'MP3'



    where " -s " is an indication to "be quiet",

    " -i " is case-insensitive search,

    " -A 2 " - tells grep along with the found string to output two more following it (it is in them that the link of interest is contained)



  2. among the selected distributions with grep 'and leave only the links:



    grep -ioe 'http.*[0-9]'



  3. open the loop on all found links:



    for i in ... ; do ... ; done



    where in place of the list, using the "magic" quotes `...` substitute the two results of our previous surveys:



    for i in `curl -s http://kinozal.tv/rss.xml | grep -iA 2 'MP3' | grep -ioe 'http.*[0-9]'`; do ... ; done



  4. in the loop, for each of the links we load the page and, again, with grep 'we pull out a link to the torrent file from it:



    curl -sb "uid=***; pass=***; countrys=ua" $i | grep -m 1 -ioe 'download.*\.torrent'



    where, " -b "uid=***; pass=***; countrys=ua" -b "uid=***; pass=***; countrys=ua" -b "uid=***; pass=***; countrys=ua" " - option to set the transmitted cookies with authorization information,

    " -m 1 " - leaves only the first of two direct links to the torrent file (yes, the link to the same file is found twice on the cinema distribution pages)



    I draw attention to the fact that neither the password nor the uid is not transmitted in the clear ! Their values ​​can be seen by opening the window for viewing cookies in your browser, or, for example, use the plugin for FireFox .



  5. Download torrent files wget 'om:



    wget -nc -qi - -B "http://kinozal.tv/" -P ~/.config/watch_dir --header "Cookie: uid=***; pass=***; countrys=ua"



    where from the options I will note " -B "http://kinozal.tv/" " - setting the prefix / domain for downloading relative links (namely, they are on the pages of the movie distribution descriptions),

    and " --header "Cookie: uid=***; pass=***; countrys=ua" --header "Cookie: uid=***; pass=***; countrys=ua" --header "Cookie: uid=***; pass=***; countrys=ua" " - setting the header for the GET request (this time I wanted to transfer cookies in this way and not through the file :))



  6. go to cycle start




And what do we have


As a result, we have such a " simple " team:

for i in `curl -s http://kinozal.tv/rss.xml | grep -iA 2 'mp3' | grep -ioe 'http.*[0-9]'`; do curl -sb "uid=***; pass=***; countrys=ua" $i | grep -m 1 -ioe 'download.*\.torrent' | wget -nc -qi - -B "http://kinozal.tv/" -P ~/.config/watch_dir --header "Cookie: uid=***; pass=***; countrys=ua"; done



And for complete happiness, this command should be written in cron :



*/15 * * * * > /dev/null 2>&1



Behind this all, allow me to leave :)




UPD . In the comments to my previous post in this series, several interesting suggestions were made to optimize server load:

habrahabr.ru/blogs/p2p/87042/#comment_2609116 (check for the existence of files)

habrahabr.ru/blogs/p2p/87042/#comment_2609714 (using Last-Modified and ETag)



UPD2 . On advice, apatrushev replaced " head -1 " with the grep " -m 1 " option.

Source: https://habr.com/ru/post/87166/



All Articles