📜 ⬆️ ⬇️

Proxy RSS feed with Python

One of the most convenient ways to receive news, articles, etc. from various sites - is RSS. However, every year the number of tapes is steadily increasing, the number of feeds is increasing, and the time for raking up this entire economy is becoming less and less. Obviously - you need to somehow automatically filter articles. This is what we will do today.


Existing Solutions


A lot of services and offline readers were tried, but the ideal could not be found. All desktop applications had surprisingly little functionality in terms of filtering. One after another, he put more than a dozen different programs and did not find one that could provide an interface for a simple task - filter feeds by reference to the full article. Not to mention the fact that we wanted to do this through regular expressions. And about the fact that it would be possible to edit the news - and there is no question. I stopped at the reader built into the opera. There it is possible to specify certain rules for filtering incoming feeds. Everything worked so badly, but in some build, the opera began to tag me with the extra articles read. I readily believe that everything was repaired there long ago, but the trust was undermined.

And I turned for help to online services. There are a number of rather mediocre ones like feedrinse.com, who really don’t know how to do anything, and which is much worse - it’s quite likely in a month or two they will suddenly cease to exist. However, one of the services stood out from the general range very strongly - yahoo pipes. He seemed to have everything he needed, allowed flexible filtering of feeds, merging tapes and much more, but occasional glitches and brakes nullify all the amenities of the service. Yes, and visual programming, as I understand it, digging with pipes, not for me.
')

Ideas


There was nothing else to do but sit down to write your own bicycle.

At first I thought about making my own offline RSS reader with convenient and spreading filtering and other goodies. I even wrote a list of requirements for the program, a kind of TK and an application framework ... However, I quickly realized that it would take at best several months of work in the evenings and as a result it is unlikely that something much more flexible and convenient than the existing analogs will turn out. . And the idea was discarded as unsuitable.

Then I decided to revise my requirements and select the main ones:

As a result, it was decided to write a kind of proxy, which parses the incoming RSS feed, represents each of its elements as an object that stores all the attributes of the feed, then this object is passed to a user-defined function written in a scripting language. And in the function, the end user can already do anything with the received object - filter by any property of the feed, change the internal contents, add something of their own. Those. no restrictive visual interface, only pure programming, with potentially limitless possibilities. The result of the proxy will be rss feed, i.e. in fact, the xml document on which we will target our reader. After some hesitation, Python was chosen as the language as simple and expressive.

Implementation


Firstly, it was decided to transfer all subscriptions to RSS feeds to the google reader, thereby obtaining some unified interface and a few buns, such as: an endless story, the ability to tag messages and tapes.

Next - it took a web server with the ability to run scripts in python. I took the path of least resistance - I picked up IIS7 on my machine and set up python on it (the setup is painted for example here ). Who ideologically does not accept IIS of course can take apache or something else.

Then we write a script for filtering feeds for each tape, I’ll show everything using the example of a habr, so I called the file habrahabr.py, and put it in the directory to the web server, there should also be a small library that is a wrapper for google reader api . Everything you need, along with an example, can be downloaded from here .

So, back to the habrahabr.py script, it should have something like this:
import re import lib import const import functools def hook_channel(channel): pass def hook_entry(reg_exclude, entry): result = reg_exclude.match(entry._link) if result == None: return entry else: return None def run(): gr = lib.GReader() if not gr.login(const.EMAIL, const.PASSWORD): print "login failed" return pattern = 'http://habrahabr.ru/blogs/(%s)/.*' w = 'javascript|php|Flash_Platform' reg_exclude = re.compile(pattern % w, re.IGNORECASE) fhook_entry = functools.partial(hook_entry, reg_exclude) xml = gr.read_tag("habrahabr.ru", 300, hook_channel, fhook_entry) print "Content-Type: text/xml" print print xml if __name__=='__main__' : run() 


It's all pretty simple:

First of all, we connect to Goggle Reader (details about authentication in google for applications, you can read here ), for ease of use, the authorization request is wrapped into one function and called like this: gr.login (const.EMAIL, const.PASSWORD)

Next, I compile a regular expression, with which I will filter out uninteresting feeds by reference to the full version of the article.

Then by the function gr.read_tag ​​("habrahabr.ru", 300, hook_channel, fhook_entry)
we receive the last 300 articles from gReader lying there in the habrahabr.ru folder (it is not necessary to refer to the folder name, it can be an arbitrary tag), then we transfer two hooks:

hook_channel is not very interesting, it only allows you to change the parameters of the channel (so far only its title)

fhook_entry - allows you to filter and modify feeds. As an input parameter, an instance of the Entry class is passed to it (from the lib.py file), which is the parsed feed itself, its attributes correspond to the feed attributes. I note that any of these attributes can be arbitrarily changed and the already changed value will be inserted into the tape. Hook must return a modified object of class Entry, or None if we want to “cut” this record from the tape.

Function read_tag ​​- returns xml string in rss v 2.0 format. The resulting string, I print, adding meta information for the web server.

Well, that's all, it remains only in the RSS reader to subscribe to a new address, I got it like this: 127.0.0.1:8080/python/habrahabr.py.

Read in detail about the gReader API here . And here you can find the RSS specification.

Summing up


The script is now clearly stored login and password for a google account, in general, for me it is not critical, because I trust everyone who has physical access to the computer, and I don’t believe that a virus can get to this script and get the password out. But in any case, it is probably worthwhile to start a separate account for gReader, then all problems should be solved.
In general, the script turns out to be damp and not very functional, but it meets all the requirements I submit.
There are plans to add auto-generation to it of a button for adding an article to a readitlater or evernote, as well as for feeds with podcasts — auto downloading a podcast to a specific daddy.
I also hope for some feedback from the hardware community, which will help improve the script.

Source: https://habr.com/ru/post/133032/


All Articles