📜 ⬆️ ⬇️

Alert the emergence of a new topic on Habrahabr using Python

I like it when the program / code is completely its own ... you understand the purpose of each letter and why the solution is exactly that. In this topic I want to offer my own Habrahabr tops parser to Python without third-party libraries.
When a new topic appears, a pop-up window reports this.


The current version is for Linux with GNOME, but 1 line is remade for your system.
It works on words like this:

1) Download the site root file to your own under the desired name.
2) Open the file
3) Line by line we read up to the right moment with the name of the Blog and filter the name itself
4) We continue to read until the moment with the name of the Topic and filter the name itself.
5) The same with the date of the topic
6) Compare the title of each topic with the last seen
7) If the topic does not match - create a pop-up window with a Blog / Topic / Date of a new topic
8) If matched - exit the program
')
Implement it in Python (parts of the code go sequentially and without clippings):

Specify the interpreter, the encoding, the necessary modules and the user's home folder where temporary files will be stored (last edited under your system):
#!/usr/bin/python # -*- coding: utf-8 -*- import os, sys HOME_DIR = "/home/user" 


We specify the variables necessary for the script to work:
 LAST_DIR = HOME_DIR + "/.habralast" #       HTML_DIR = HOME_DIR + "/.habr.html" #     SHOW_FIRST_TIME = 5 #          n = 1 new_addr = 0 count = 0 


Check if the script is running for the first time: if the .habralast file exists - the script has already been launched, otherwise we create a file with an empty string. The topic1 variable is assigned the name of the last seen topic and an empty string if the script is run for the first time:
 if os.path.isfile(LAST_DIR): fp = open(LAST_DIR, "r") topic1 = fp.readline() fp.close() last_existed = 1 else: fp = open(LAST_DIR, "w") topic1 = "" fp.close() last_existed = 0 


Download the Habrahabr root file (10 topics are displayed on the main page - if you missed more: open the page habrahabr.ru/page N, where N is the page number):
 while(1): if n == 1: url = "habrahabr.ru" else: url = "habrahabr.ru/page" + str(n) + "/" wget = "wget " + url + " -O " + HTML_DIR try: os.system(wget) except: print "Cannot connect to server" sys.exit() 


Open the resulting text page:
  index = open(HTML_DIR, "r") 


In these lines, the basis - later we will read the file line by line until we find exactly such parts:
  s = ' <a href="http://habrahabr.ru/' #     ss = ' <a' #     sss = ' <div class="published"><!--    ISO   title -->' #      

I was convinced many times that these parts of the lines will not meet anywhere else! Therefore, we can safely use it, there will be no confusion.

We check each line in turn for the presence of a blog name feature (2000 is taken experimentally from the number of lines in the HTML file allocated to topics) and filter it by assigning it to the blog variable:
  for i in range(2000): line = index.readline() if s in line: blog_s = line.find('">') blog_e = line.find("</a>") blog = line[blog_s+2:blog_e] 


Found a blog - look for the topic name tag (you can see from the page code that the topic is no further than 50 lines from the blog) and filter it by assigning it to the topic variable. If the topic did not coincide with the previously seen last ( topic! = Topic1 ) - we write the new one in the. Habitralast file, we don’t do such a check, so as not to write down a later topic, since first newest:
  for j in range(50): line = index.readline() if ss in line: topic_s = line.find('">') topic_e = line.find("</a>") topic = line[topic_s+2:topic_e] if topic.find("</span>") != -1: topic = topic[topic.find("</span>")+7:] if topic != topic1: if new_addr == 0: fp = open(LAST_DIR, "w") fp.write(topic) fp.close() new_addr = 1 print "Blog:\t" + blog print "Topic:\t" + topic 

... I noticed that sometimes tags are inserted at the beginning of the topic name, but we don’t need them, so we filter them out. We make print with the names of the blog and the topic (if desired, all lines with print can be commented out).

Then again we read line by line until the appearance of the topic date feature, no more than 100 lines for this:
  for k in range(100): line = index.readline() if sss in line: line = index.readline() time_s = line.find("<span>") time_e = line.find("</span>") date = line[time_s+6:time_e] print "Date:\t" + date + "\n" notify = "notify-send 'Habrahabr.ru: " + blog + "' '" + topic + "\n<i>" + date + "</i>'" os.system(notify) count += 1 if count == SHOW_FIRST_TIME and last_existed == 0: os.system("rm -f " + HTML_DIR) sys.exit() break break 

The string os.system (notify) creates the very pop-up window with information about the new topic. Content is filled in the line above. We delete the source HTML file as unnecessary and exit the program.

As soon as we find the last topic we’ve previously seen, we delete the HTML file and log out:
  else: os.system("rm -f " + HTML_DIR) sys.exit() 


This was the first iteration of the main loop. If you missed more topics than there is on the first page - open the following and everything repeats:
  n += 1 index.close() 


In the Gnome environment, the notify-send command is responsible for pop-up windows. It may be different in your system. Then edit the line with the variable notify = "notify-send 'Habrahabr.ru:" + blog + "' '" + topic + "\ n " + date + " '" to your command with its syntax.

Here I deliberately did not align the lines to the left, so that it was clearer what then goes and what it depends on. It was necessary to correct the 3rd code from the bottom on 2 Tabs to the left, otherwise it doesn't look very nice. Therefore, here's the whole script, so as not to be confused:
 #!/usr/bin/python # -*- coding: utf-8 -*- import os, sys HOME_DIR = "/home/user" LAST_DIR = HOME_DIR + "/.habralast" HTML_DIR = HOME_DIR + "/.habr.html" SHOW_FIRST_TIME = 5 n = 1 new_addr = 0 count = 0 if os.path.isfile(LAST_DIR): fp = open(LAST_DIR, "r") topic1 = fp.readline() fp.close() last_existed = 1 else: fp = open(LAST_DIR, "w") topic1 = "" fp.close() last_existed = 0 while(1): if n == 1: url = "habrahabr.ru" else: url = "habrahabr.ru/page" + str(n) + "/" wget = "wget " + url + " -O " + HTML_DIR try: os.system(wget) except: print "Cannot connect to server" sys.exit() index = open(HTML_DIR, "r") s = ' <a href="http://habrahabr.ru/' ss = ' <a' sss = ' <div class="published"><!--    ISO   title -->' for i in range(2000): line = index.readline() if s in line: blog_s = line.find('">') blog_e = line.find("</a>") blog = line[blog_s+2:blog_e] for j in range(50): line = index.readline() if ss in line: topic_s = line.find('">') topic_e = line.find("</a>") topic = line[topic_s+2:topic_e] if topic.find("</span>") != -1: topic = topic[topic.find("</span>")+7:] if topic != topic1: if new_addr == 0: fp = open(LAST_DIR, "w") fp.write(topic) fp.close() new_addr = 1 print "Blog:\t" + blog print "Topic:\t" + topic for k in range(100): line = index.readline() if sss in line: line = index.readline() time_s = line.find("<span>") time_e = line.find("</span>") date = line[time_s+6:time_e] print "Date:\t" + date + "\n" notify = "notify-send 'Habrahabr.ru: " + blog + "' '" + topic + "\n<i>" + date + "</i>'" os.system(notify) count += 1 if count == SHOW_FIRST_TIME and last_existed == 0: os.system("rm -f " + HTML_DIR) sys.exit() break break else: os.system("rm -f " + HTML_DIR) sys.exit() n += 1 index.close() 


I ran this script in the terminal, and the script output and pop-up windows appeared tobish. If you use the script only in the terminal and you do not need pop-up windows - comment out the os.system (notify) line from the previous one. Otherwise, you can put the command python path_to_script / HabraParser.py in crontab and call, say, every 30 minutes. And you can manually do it - the decision is yours! The main thing is to correct the directory where files will be saved.

Here's how it looks to me at the bottom of the screen when a new topic is found:
image

That's all ... mine is convenient!

Source: https://habr.com/ru/post/127806/


All Articles