📜 ⬆️ ⬇️

Implementing in Python multithreaded data processing for parsing sites

The parsing process is complicated by the significant time spent on data processing. Multithreading will help to increase the speed of data processing. The site for parsing is the “Handbook of World Notes” , where we will receive the currency in relation to the other.

I quote the program code to reduce the processing time by half.

Import


import requests # HTTP- from bs4 import BeautifulSoup #  HTML import csv #    CSV from multiprocessing import Pool #    

Main procedure


 def main(): url = 'http://banknotes.finance.ua/' links = [] #        all_links = get_all_links(get_html(url), links) #  #  help with Pool(2) as p: p.map(make_all, all_links) if __name__ == '__main__': main() 

Getting the URL


 def get_html(url): r = requests.get(url) return r.text 

Multithreading functions


 def make_all(url): html = get_html(url) data = get_page_data(html) write_csv(data) 

Getting the homepage URL


 def get_all_links(html, links): #   -    f=open('coin.csv', 'w') f.close() #  html-,       soup = BeautifulSoup(html, 'lxml') href = soup.find_all('div', class_= "wm_countries") for i in href: for link in i.find_all('a'): links += [link['href']] return links 

Parsing nested pages


 def get_page_data(html): soup = BeautifulSoup(html, 'lxml') try: name = soup.find('div', 'pagehdr').find('h1').text except: name = '' try: massiv_price = [pn.find('b').text for pn in soup.find('div', class_ = 'wm_exchange').find_all('a', class_ = 'button', target = False)]+[pr.text for pr in soup.find('div', class_ = 'wm_exchange').find_all('td', class_ = 'amount')] if len(massiv_price)==6: massiv_price=massiv_price[0]+massiv_price[3]+massiv_price[1]+massiv_price[4]+massiv_price[2]+massiv_price[5] elif len(massiv_price)==4: massiv_price=massiv_price[0]+massiv_price[2]+massiv_price[1]+massiv_price[3] except: massiv_price = '' data = {'name': name, 'price': massiv_price} return data 

Write file


 def write_csv(data): with open('coin.csv', 'a') as f: writer = csv.writer(f) writer.writerow( (data['name'], data['price']) ) 

The proposed code can be widely used for parsing (and not only), taking into account the features of sites.

')

Source: https://habr.com/ru/post/323238/


All Articles