A little cycling under Linux or RSS-aggregator on the knee with your own hands

Inspired by the article "Google continues to destroy RSS."

Some time ago I needed to read several rss feeds. The question for me was relatively new, earlier I had a very occasional rss deal, so I began to study this topic and select a reader from scratch.

The results were ... let's say so - not pleasing.
')
On cloud services with web interfaces, I have a chronic allergy in half with paranoia, and the subsequent funeral of the googlreeder has once again confirmed that paranoia will not advise the bad. Yes, and regular changes to familiar interfaces for web services is also not encouraging.

Among the locally installed readers there were several pieces that I tried to use, but none of them was chosen as a worker. Here and annoying little things, like a poorly customizable interface, poor filtering tools or strange conservation logic; and a more global problem with the fact that one of the tapes that I wanted to watch was Flibustian reviews about books, where the whole tape could easily be updated within a day, and some of the posts would eventually be skipped; constantly keeping the reader running is not an inspirational option at all. And besides, the zoo from dissimilar readers on the machine did not want to breed at all.

And then, after some time thinking about the imperfections of this world, and formulating the women-women, a file was customarily taken into the hands and a couple of days were devoted to cycling.

The result has been working successfully for over a year now. Perhaps, a ready-made recipe will save someone time and effort, although of course, the recipe may not be suitable for everyone, since it was based on the fact that there was already a home server for Centos 6, connected to the Internet around the clock.

So, to build my own aggregator of matches and acorns, besides the server itself, I needed additional packages:
rsstool
ssmtp
fdupes
... and also some time and desire to tinker with perl.

As a result, a rather primitive script was written, which would not make much sense at all, it would be enough to have the general logic of operation:

After launching, the script reads the list of tapes from the text config in the form:
habrahabr # http: //habrahabr.ru/rss - that is, the header that will be used later for working with the tape, the separator "#" and the address of the tape itself. All this, plus the file names generated by the header for storing intermediate data, is stored in several arrays. Then we run through these arrays foreach (), in which the following actions are performed:

1. Via rsstool, the script downloads the tape itself as a csv with delimiters @, iconv converts the downloaded koi-8 encoding, grep cuts off the header, and all this is saved to a file for freshly downloaded.

rsstool --wget --csv=@ @url[$i](  http://habrahabr.ru/rss) | iconv -c -f UTF-8 -t KOI8-R| grep SITE\@DATE\@URL\@TITLE\@DESC -v >/path_to_rss2email/feed_new_@rssname[$i].rss (  /path_to_rss2email/feed_new_habrahabr.rss)

2 Editing diff to the file with the tape archive and the file with freshly downloaded. The difference between them, marked ">" and will be required by us the news that appeared since the last download. This difference is recorded in a separate file, as well as in the archive of the tape - so that on the next download these messages will be marked as already received.

 diff -iaEbwB --strip-trailing-cr $path_temp/@rssarcfile[$i] (.. /path_to_rss2email/habrahabr.rss) $path_temp/feed_new_@rssname[$i].rss (.. /path_to_rss2email/feed_new_habrahabr.rss) | grep ^\\>\\ >$path_temp/@rssdiffer[$i] (.. /path_to_rss2email/habrahabr.diff)

3. Now you can start to rejoice - we received the file habrahabr.diff with the latest messages of the tape, and now we can do everything with this file. We do this by reading the file line by line, and then, sorting out the lines of the form:

"Habrahabr / Hacked / Thematic / Posts" @ "1399799340" @ " habrahabr.ru/post/222391 " @ "Google continues to destroy RSS" @ "This week, namely on May 8, Google has disabled RSS-feed <... cut ...>. Read more → »

... create files in a separate directory according to the format of RFC2822. After that, the script continues its work, processing the next tape in the list, or falls asleep for a minute; in the meantime, the generated letters are processed by the second script launched by cron. Using fdupes, it deletes in the directory with the RFC2822 files all identical files, except for one ... (Yes, I know, this is a crutch. Yes, it is embarrassing. But the problem is that separate duplicate lines, in spite of everything, are still not cleared diff ' om, surfaced a little later after writing the script, but there was no time to figure it out.) ... and then, using ssmtp, send them to a specially created mailbox.

And now, having received the tape to the post office, we can calmly, without suffering from the next bout of creativity of designers and the momentous decisions of managers, read it everywhere, in the same familiar and beloved mail client (the same TheBat !, for example), using when This all the rich tools in the form of filters, sorting, searching and other utilities that will make reading the tape more convenient and efficient.

UPD: Having slightly combed and commented out I post the actual scripts. The first one does the actual downloading of the tapes and creates files with posts for sending to the mail, the second one sends them. Both run on cron. Well, plus a piece of config for example.

rsstoemail.pl

 #! /usr/bin/perl use MIME::Base64; $path_temp="/home/media/rsstool"; #    / #          -  , url ,     open(config,"</etc/rss/rsstoemail.conf"); $numb=0; while (!eof(config)) { $strcnf=<config>; $strcnf=~tr/\n//d; (@rss[$numb],@rssname[$numb],@rssfeeds[$numb])=split "#", $strcnf; @rssarcfile[$numb]="@rssname[$numb].rss"; @rssdiffer[$numb]="@rssname[$numb].diff"; $numb=$numb+1; } close config; #          -  , url ,     #       -   foreach $srtname1(@rssarcfile) { if (! -e "$path_temp/$srtname1") { `echo >$path_temp/$srtname1`; } } #       -   #         -  $run=`ps -AH|grep -c rsstoemail.pl`; $run=~tr/\n//d; if ($run ne "1") { exit; } #         -  #    foreach $i(@rss) { #      -  if (@rss[$i] ne "") { $chkdiff=""; #  -    `echo >$path_temp/@rssdiffer[$i]`; printf "run rsstool\n"; #  ,    koi-8,        `/sbin/myscript/rsstool --wget --csv=@ @rssfeeds[$i] | iconv -c -f UTF-8 -t KOI8-R| grep SITE\@DATE\@URL\@TITLE\@DESC -v >$path_temp/feed_new_@rssname[$i].rss`; #      ,     `diff -iaEbwB --strip-trailing-cr $path_temp/@rssarcfile[$i] $path_temp/feed_new_@rssname[$i].rss | grep ^\\>\\ >$path_temp/@rssdiffer[$i]`; #       $chkdiff=`grep \@ $path_temp/@rssdiffer[$i] -c`; $chkdiff=~tr/\n//d; #    -  if ($chkdiff ne "0") { my @toemail = (); my @original = (); #  ,  ,    >   diff #           open(diffopen,"<$path_temp/@rssdiffer[$i]"); while (!eof(diffopen)) { $string = <diffopen>; $string=~s/^(\>\ )//; $string=~tr/\n//d; push (@toemail, $string); push (@original, $string); } close diffopen; #  ,  ,    >   diff #           #    #                 -   $dirr=`date '+%Y-%m-%d-%H_%M_%S'`; $newm=$#toemail+1; printf "$dirr new messages $newm\n"; #                 -   #       foreach $difs(@toemail) { #      #   " $difs=~tr/\"//d; #    $difs=~s/\ $//; $nmf=""; $date=""; $url=""; $bookname=""; $body=""; $resser=""; $ssurl=""; $sndusr=""; #     ($nmf,$date,$url,$bookname,$body)=split '@', $difs; #   ,    #                 if (@rssname[$i] eq "flibusta" or @rssname[$i] eq "librusec") { $body=~m/(.+?)(\ \ )/; $sndusr=$1; $sndusr=~tr/\n//d; } #                 #        url    (    #   proxy.flibusta.net, flibusta.net, www.flibusta.net )    url   if (@rssname[$i] eq "flibusta" or @rssname[$i] eq "flibustanewbooks") { ($resser,$ssurl)=split "\/b\/",$url; $resser=~m/(flibusta.net)/; if ($1 eq "flibusta.net") { $surl="http://flibusta.net/b/$ssurl/download\n"; } else { $surl=$url; } } else { $surl=$url; } #        url    (    #   proxy.flibusta.net, flibusta.net, www.flibusta.net )    url   #            -  -  if (@rssname[$i] eq "flibustanewbooks") { ($book_auth,$book_name,$book_genre)=split "- ",$bookname; $bookname="$book_name - $book_auth - $book_genre"; } #            -  -  #   ,    #     base64        SUBJ: $booknameenc=encode_base64("$bookname my_@rssname[$i]_rss"); $booknameenc=~tr/\n//d; $booknameenc="\=\?KOI8\-R\?B\?$booknameenc\?\="; #     base64        SUBJ: #          sleep 1; $dirr=`date '+%Y-%m-%d-%H_%M_%S'`; #          #             msgrss_2014-05-20-19_26_09       open(fopen,">>$path_temp/mail/msgrss_$dirr"); print fopen "From\:\ fromemail\@domen\.ru\n"; print fopen "To\:\ toemail\@gmail\.com\n"; print fopen "Subject\: $booknameenc\n"; print fopen "MIME-Version: 1.0\n"; print fopen "Content-Type: multipart/mixed;\n"; print fopen " boundary=\"----------12012917B16D15D68\"\n"; print fopen "------------12012917B16D15D68\n"; print fopen "Content-Type: text/plain; charset=koi8-r\n"; print fopen "Content-Transfer-Encoding: 8bit\n"; print fopen "\n"; print fopen "$nmf\n"; print fopen "$surl\n"; print fopen "$bookname\n"; print fopen "\n"; if ($sndusr ne "") { print fopen "sendbyuser:$sndusr\n"; } print fopen "\n"; print fopen "$body\n"; print fopen "\n"; close fopen; #             msgrss_2014-05-20-09_26_09       } #        #            -    `cp $path_temp/@rssdiffer[$i] $path_temp/arhiv/diffnew\_$dirr`; #         -    `echo >>$path_temp/@rssarcfile[$i]`; #         open(arcrss,">>$path_temp/@rssarcfile[$i]"); foreach $difffs(@original) { $difffs=~s/^(\>\ )//; $difffs=~tr/\n//d; print arcrss "$difffs\n"; } close(arcrss); #         } } } printf "start send messages\n"; #         if ($hour eq "04" and $min<"30") { foreach $srtname(@rssarcfile) { `uniq -u $path_temp/$srtname >$path_temp/$srtname.tmp`; `mv -f $path_temp/$srtname.tmp $path_temp/$srtname`; } sleep 1000; } #

rsssendemail.pl

 #! /usr/bin/perl #     $path_all="/home/media/rsstool"; #  $path_mail="$path_all/mail"; $path_mail_arhiv="$path_all/mail_arhiv"; $path_arhiv="$path_all/arhiv"; #      _send   @list_dupes=`/sbin/myscript/fdupes -f $path_mail|grep -v send|grep -v \^\$`; #       foreach $fl(@list_dupes) { $fl=~tr/\n//d; `rm -f $fl`; printf "rm dupe $fl\n"; } #       #      _send   @list_files=`ls $path_mail/ |grep -v send`; sleep 5; #       ssmtp      *_send foreach $i(@list_files) { $i=~tr/\n//d; printf "sended $path_mail/$i\n"; `/sbin/myscript/ssmtp toemail\@gmail.com \<$path_mail/$i`; `mv $path_mail/$i $path_mail/$i\_send`; sleep 1; } #       ssmtp      *_send #         @all_files=`ls $path_mail/`; $cont=$#all_files-1000; if ($cont > 0) { printf "bigger 1000 to $cont\n"; for ($y=0; $y<$cont;$y++) { @all_files[$y]=~tr/\n//d; `mv $path_mail/@all_files[$y] $path_mail_arhiv`; printf "@all_files[$y]\n"; } } #

rsstoemail.conf

 1#librusec#http://lib.rus.ec/polka/show/all/rss 2#flibusta#http://flibusta.net/polka/show/all/rss 3#nnm#http://nnm.me/rss/ 4#habrahabr#http://habrahabr.ru/rss 5#3dnewssoft#http://www.3dnews.ru/software-news/rss/ 6#3dnewshard#http://www.3dnews.ru/news/rss/ 7#flibustanewbooks#http://flibusta.net/new/rss

Source: https://habr.com/ru/post/223399/

All Articles

A little cycling under Linux or RSS-aggregator on the knee with your own hands

More articles: