Implementing in Python multithreaded data processing for parsing sites

The parsing process is complicated by the significant time spent on data processing. Multithreading will help to increase the speed of data processing. The site for parsing is the “Handbook of World Notes” , where we will receive the currency in relation to the other.

I quote the program code to reduce the processing time by half.

Import

import requests # HTTP- from bs4 import BeautifulSoup #  HTML import csv #    CSV from multiprocessing import Pool #

Main procedure

 def main(): url = 'http://banknotes.finance.ua/' links = [] #        all_links = get_all_links(get_html(url), links) #  #  help with Pool(2) as p: p.map(make_all, all_links) if __name__ == '__main__': main()

Getting the URL

 def get_html(url): r = requests.get(url) return r.text

Multithreading functions

 def make_all(url): html = get_html(url) data = get_page_data(html) write_csv(data)

Getting the homepage URL

 def get_all_links(html, links): #   -    f=open('coin.csv', 'w') f.close() #  html-,       soup = BeautifulSoup(html, 'lxml') href = soup.find_all('div', class_= "wm_countries") for i in href: for link in i.find_all('a'): links += [link['href']] return links

Parsing nested pages

 def get_page_data(html): soup = BeautifulSoup(html, 'lxml') try: name = soup.find('div', 'pagehdr').find('h1').text except: name = '' try: massiv_price = [pn.find('b').text for pn in soup.find('div', class_ = 'wm_exchange').find_all('a', class_ = 'button', target = False)]+[pr.text for pr in soup.find('div', class_ = 'wm_exchange').find_all('td', class_ = 'amount')] if len(massiv_price)==6: massiv_price=massiv_price[0]+massiv_price[3]+massiv_price[1]+massiv_price[4]+massiv_price[2]+massiv_price[5] elif len(massiv_price)==4: massiv_price=massiv_price[0]+massiv_price[2]+massiv_price[1]+massiv_price[3] except: massiv_price = '' data = {'name': name, 'price': massiv_price} return data

Write file

 def write_csv(data): with open('coin.csv', 'a') as f: writer = csv.writer(f) writer.writerow( (data['name'], data['price']) )

The proposed code can be widely used for parsing (and not only), taking into account the features of sites.

Source: https://habr.com/ru/post/323238/

All Articles