📜 ⬆️ ⬇️

First steps in Python programming

A couple of months ago, I took up learning Python. After reading about the structures used, working with strings, generators, the basics of OOP, I pondered what a useful program to write that would all apply to a real task.
By a happy coincidence, I was approached by my acquaintances with a request to download the cartoon “Miracles on the Turns”.

Closer to the point


Going to one of the popular trackers in UA-IX, I found this cartoon, only now each series was laid out separately, and I didn’t want to click the Download button 65 times. At that moment I remembered Python.
Immediately began to look for information on how to get files from the site. The answer was quickly received thanks to Google and the well-known site stackoverflow . It turned out that you can “pull out” files by importing the library and adding a couple of lines. After testing on files of a small dimension, how it all works, I moved on to the next stage. It was necessary to collect all the download links and the corresponding file names.
Within the limits of one tag they were not indicated anywhere, therefore links and file names were collected separately.
For collecting links, the lxml library was used, which has already been reviewed on this site. After downloading and installing this library, I started writing the program itself. The program code is presented below:
#! /usr/bin/env python import urllib import lxml.html load = 'load' page = urllib.urlopen('http://www.***.ua/view/12345678') doc = lxml.html.document_fromstring(page.read()) for link in doc.cssselect('p span.r_button_small a'): if link.text == None: continue if load not in link.get('href'): continue print 'http://***.ua'+link.get('href') 

All collected links were saved to a file for further work with them. The if statements were used to filter all data. Thus, I received only the links that were used to download the file to the computer.
The file names did not look quite comfortable. Therefore, when the program received the name of the file, it immediately changed it to a more convenient one. Thus, all files received the name of the form: “Miracles on the turns. XX series, instead of XX - the number of the series.
Program Code:
 #! /usr/bin/env python # -*- coding: utf-8 -*- import urllib import lxml.html file_name = u'  .  ' episode = 0 page = urllib.urlopen('http://www.***.ua/view/12345678') doc = lxml.html.document_fromstring(page.read()) for name in doc.cssselect('tr td a'): if name.text == None: continue if not name.text.endswith('.avi'): continue name.text = file_name + str(episode) + name.text[-4:] print name.text.encode('utf8') episode += 1 

So, as the version of the Python 2.6 interpreter used, for correct work with the Cyrillic I had to use the encode method. The collected data was also saved to a file.
After the work of both programs, there were two text files on the hard disk. One contained links for downloading files, and the other series titles.
I used the dictionary to link the link and the file name. The key was the link, and the key name was stored in the key value. After that, it was necessary only to take the key, substitute it into the calling function and specify the place, the name of the file to save.
The code that performs these actions:
 #! usr/bin/env python # -*- coding: utf-8 -*- import urllib links = open('link','r') names = open('file_name', 'r') download = {} path = '/media/6A9F550C59BC1824/TaleSpin/' url = 'http://www.***.ua/load/12345678' loadf = [] download = dict(zip(links, names)) for link in download.iterkeys(): name = download[link].rstrip() if name not in loadf: urllib.urlretrieve(link,path+name) loadf.append(name) else: continue 

Also used is a list in which the names of the series that have already been downloaded are entered. This is used to ensure that if the download is interrupted, the series that already exist on the hard disk do not rock.

Conclusion

Perhaps it took more time to write all this code compared to if I manually clicked on the “Load” button. But the working program brought much more fun. Plus, this is also new knowledge.

Used materials

  1. "LXML" or how to parse HTML with ease
  2. Official lxml documentation
  3. Documentation for the urllib library
  4. Python Tips, Tricks, and Hacks (Part 2)

Thank you for attention.
Website address is hidden so as not considered for advertising.

')

Source: https://habr.com/ru/post/134863/


All Articles