📜 ⬆️ ⬇️

Parsim RSS LostFilm with grep and transfer to download via wget

RSS feed
Once I got tired of manually browsing LostFilm for new episodes and I decided to automate this process. The fact is that many BitTorrent clients have in their settings the so-called tracking folders (wach directory). As soon as a new torrent file appears in this folder, the BitTorrent client immediately starts downloading it. The usual practice, for example, is to create such a folder and open access to it for writing via FTP . Now, all we need is to automate the downloading of a torrent file upon the release of a new episode in the specified folder for their further automatic download. How to do exactly that I will now show.

For reference , the tracking folders in Transmission are set with options ( watch-dir-enabled and watch-dir ), and in rTorrent you need to add the following line to the configuration file:
schedule = watch_directory,5,5,load_start=./watch/*.torrent

Point one

So, first of all we need to get an RSS feed from LostFilm 'a. To do this, use the wget command:

wget -qO - http://www.lostfilm.tv/rssdd.xml

here: the " -q " option tells wget not to display information about their work, i.e. " be quiet ";
" -O - " causes the loaded tape to be output not to a file, but to the standard output stream. This is done so that the data obtained can be passed down a pipeline to the input of the grep filter.

Second point

Now we need to select all links to torrent files from the resulting tape. To do this, we will ask grep to look for substrings using the following regular expression: 'http.*torrent' . Here the point symbol means “any character” and the asterisk means “repeat any number of times”. Those. we will find all entries starting with “http” and ending with “torrent” which will be links to torrent files. The team itself looks like this:
')
grep -ioe 'http.*torrent'

where " -i " is case-insensitive search,
" -o " - select only the matched part of the substring (done to filter the tags that surround the link),
" -e " - search by regular expression

Point three

After we have found all the links to the torrent files, we need to select only those that are interesting to us. For example, I like the series Lost, House MD, Lie to Me and Spartacus . On their example, I will show how to filter. All links to torrent files in LostFilm RSS feed are:

http://lostfilm.tv/download.php/2030/The.Oscars.The.Red.Carpet.2010.rus.LostFilm.TV.torrent

Thus, to highlight the titles of the series I am interested in, I used the following regular expression: '[0-9]{4}/(lost|house|lie|spartacus)' . It searches for 4 digits in a row ("[0-9] {4}", where the number of repetitions is given in curly braces), followed by a slash, and then one of the four choices by series name (" (lost|house|lie|spartacus) ", where the character" | "is read as OR). But, for the grep command, service characters need to be escaped with "\". Total, we have:

grep -ie '[0-9]\{4\}/\(lost\|house\|lie\|spartacus\)'

Point four

Now we only have links to the torrent files of the series we are interested in. Now we have to upload them to the tracking folder of our torrent client. But, the fact is that without authorization, LostFilm will not allow you to download files. In order to be able to download files, you need to send cookies with the authorization information along with the GET request. Fortunately, the wget team can load cookies from the specified file. Look at the wget call:

wget -nc -qi - -P ~/.config/watch_dir --load-cookies=~/.config/cookies.txt

where the option " -nc " tells the command not to reload files if we already have them on the disk,
" -q " - option above, indicates the command " to be quiet ",
" -i - " - get a list of files to load from standard input,
" -P ~/.config/watch_dir " - an indication of our tracking folder where files will be downloaded,
" --load-cookies=~/.config/cookies.txt " - use cookies from the specified file.

the file with cookies has the following format:

.lostfilm.tv TRUE / FALSE 2147483643 pass < >
.lostfilm.tv TRUE / FALSE 2147483643 uid < >


I draw attention to the fact that neither the password nor the uid is not transmitted in the clear ! Their values ​​can be seen by opening the window for viewing cookies in your browser, or, for example, use the plugin for FireFox to export all cookies to a file which should be wget' to wget' .

Last item

and now all together:

wget -qO - http://www.lostfilm.tv/rssdd.xml | grep -ioe 'http.*torrent' | grep -ie '[0-9]\{4\}/\(lost\|house\|lie\|spartacus\)' | wget -nc -qi - -P ~/.config/watch_dir --load-cookies=~/.config/cookies.txt

The final point :)

Well, now, for final automation, let's write all this in cron :

*/15 * * * * wget -qO - http://www.lostfilm.tv/rssdd.xml | grep -ioe 'http.*torrent' | grep -ie '[0-9]\{4\}/\(lost\|house\|lie\|spartacus\)' | wget -nc -qi - -P ~/.config/watch_dir --load-cookies=~/.config/cookies.txt > /dev/null 2>&1

where " > /dev/null 2>&1 " suppresses the output of the command and thus does not force cron generate you an email with the output of commands.

UPD . Added a continuation of the article where the issue with RSS feeds that have no direct links to torrent files is being considered.

UPD2. The comments rightly noted that in this implementation, access to the server occurs every time, even if there is no new data on it.

So, habrochelok Nebulosa suggested in his commentary "checking for the existence of files so that wget does not pull the server every time",

and Guria , at the same time, recommends “in order not to parse and load the same thing, and why the server should not be wasted, write the value of the Last-Modified header and transmit it in the If-Modified-Since header. Also the server can support ETag. ”

UPD3 . If there are difficulties with the transfer of cookies, you can use another way. To do this, replace the last wget command in the pipeline with the following:

wget -nc -qi - -P ~/ --header "Cookie: uid=***; pass=***"

Source: https://habr.com/ru/post/87042/


All Articles