📜 ⬆️ ⬇️

Downloading audio from mail.ru



The task that we face is downloading music from the site providing such an opportunity. We will use the programming language Python .

To accomplish this, we will need knowledge about site parsing and working with media files.

The figure above shows the general algorithm for parsing sat. We will perform the parsing using the BeautifulSoup and request modules, and the re module will be enough for us to work with the text.
')

Import


import requests #   HTTP- import urllib.request # HTTP from bs4 import BeautifulSoup #    HTML import re #     

Variable declaration and main procedure


We will need only two arrays and one variable to store information:

 page_count = [] #    ,   perehod = '' #      ,       download = [] #           

We write a procedure where first of all we consider all pages on the site containing the songs we need.

 if __name__ == '__main__': #    u = str(input('   :\n')) #input -      .       ,    base_url = 'http://go.mail.ru/zaycev?sbmt=1486991736446&q='+u #  http ,    count=0 #           page_count = [base_url] #     ,      print(' . ...') #print -     while True: #  try: #try   .  ,        ,      except,      page_count = page_count+[get_page_count(get_html(page_count[count]),page_count)] #   ,    get_page_count,      page_count   .      get_html (  http)  page_count[count],  count   .  ,        -  ,     -     count = count + 1 #,       except TypeError: #    ,   TypeError.   break # break             print("   - ",len(page_count)) #         len,       

Getting HTTP Pages


For the life of the previously written, you need to write two functions, the first - will receive http and pass this parameter to the second, which in turn will receive data from this link by means of parsing.

 def get_html(url): #       url,   page_count[count] response = urllib.request.urlopen(url) #   «»  httplib,  ,          return response.read() #      read    

The next function will be the parsing itself. The main thing that we need to get information about building a site is to view its html layout. To do this, go to the site, press Shift + Ctrl + C and get the source code, which displays all the names of the widgets.

 def get_page_count(html,page_count): #      page_count (    )  html     ,     : get_html(page_count[count]) soup = BeautifulSoup(html, "html.parser") #     html-  href = soup.find('a', text = '') #       -     "". "" -   ,    .       (Shift+Ctrl+C). base_url = 'http://go.mail.ru' # ,     .        ,      page_count = base_url + href['href'] #   .         href,    html-,  ,     ['href']      (      html-). return page_count #     

Important! All data obtained using BeautifulSoup is not a string data type, but a separate “beautiful soup” data type.

The next task is to get a new download address when you click “Download” on each of the pages. Note that we will not use the array, since in this case we will have to fill it in completely and only then start downloading, which will greatly slow down the performance of the program. We will take every time a new link and work directly with it. To do this, we write the second function and add to the procedure:

 print('') #  ''     try: #    for i in page_count: #     perehod = parsing1(get_html(i),perehod) #   ,  url   .       -        url except TypeError: #,    TypeError (     ) print(' ') #  ' '     

In the third function, we will meet using re.findall (Pattern, string) —search for a given pattern in a string variable.

 def parsing1(html,perehod): #  soup = BeautifulSoup(html, "html.parser") #  html-  perehod = [] #    ,          for row in soup.find_all('a'): # ,      ( ,    ,   20)      if re.findall(r'', str(row)): #     re,      html-,   ,   ,     row, -   ( "a"    ,   ),    perehod=perehod+[row['href']] #      ,        return perehod #  

Now the procedure takes every time a new link from each page and we need to find the address of the new page, where the new button with the text field “Download” is located, with the final address for downloading. In the main procedure we write:

 for y in perehod: #           download = parsing2(get_html(y),download #download -      -        

Getting HTTP for direct download


In the last function we will find links for direct download. Here we will use two new procedures:

  1. re.sub (Template, New Fragment, Replace Line) searches for all matches with the template and replaces them with the specified value. As the first parameter, you can specify a function reference.
  2. text - gets the text from the html-code (only from the search result BeautifulSoup, the string data type will not work).

 def parsing2(html,download): #  soup = BeautifulSoup(html, "html.parser") #     html  table = soup.find('a', {'id':'audiotrack-download-link'}) # html    "a"   "id".       -       href='' #    name='' #    if table != None: #:     ,    else row = soup.find('h1', {'class':"block__header block__header_audiotrack"}) #     "h1"         name = re.sub(r'\n\t\t\t\t\t\t','',row.text) #       -     .      html-  row href=table.get('href') #     .get('href')  ['href'] -   download=[href]+[name] #,     ,     return download #  else: #        download return download #   

Write file


It remains for us to download and write the file.

  1. The get procedure allows you to send an HTTP request, which we later check with the req.status_code procedure: this is a list of HTTP status codes (the list can be found on the Internet, status <200> means successful login).
  2. The open procedure opens and closes a file for writing in binary format, wb creates a file with a name if one does not exist.

 if download != []: #, ,       req = requests.get(download[0],stream = True) #             stream  True if req.status_code == requests.codes.ok: #   http    ,   with open(download[1]+'.mp3', 'wb') as a: #     a.write(req.content) #       write 

Using only two modules BeautifulSoup and request, you can achieve solutions to virtually any tasks associated with parsing the site. Using this knowledge, you can adapt the program to download other data, even from other sites. Good luck in your work!

Source: https://habr.com/ru/post/322502/


All Articles