Parsim weblancer using PROXY

Objective

Parsim website using proxy server.
Save data in CSV format.
We write a search engine based on the data found.
We build the interface.

We will use the programming language Python. The site from which we will download data is www.weblancer.net (parsing the old version of this site was posted here ), it has job offers at www.weblancer.net/jobs . From it we will receive data - this is the name, price, number of applications, category, a brief description of the proposed work.

Logging in using a proxy means entering the site under a false address. It is useful for parsing a site with a ban protection by IP address (that is, if you too often, within a short period of time, enter the site).
')

Import modules

Modules for directly parsing: requests and BeautifulSoup, they will be enough for us. Save the data in csv format will help us a module with the same name - csv. The tkinter module will help us with the interface, which is extremely painful (if you want a better interface, I advise you to use the pyQt5 module). The work of searching and replacing data will be performed by the re module.

import requests #   HTTP- import urllib.request # HTTP from lxml import html #    xml  html,      html import re #     from bs4 import BeautifulSoup #    HTML import csv #     CSV import tkinter #  from tkinter.filedialog import * #

Variables

We create an array, where we will store the previously used proxies, and two text variables, we equate the site’s address to the first one, and declare the second to be global (there are options when using global variables may adversely affect the program’s performance, learn more about its use here ) get its data in functions.

 global proxy1 #          proxy1 = '' #     BASE_URL = 'https://www.weblancer.net/jobs/' #    massiv = [] #

Variables for tkinter:

 root = Tk() #  root.geometry('850x500') #       txt1 = Text(root, width = 18, heigh = 2) #      txt2 = Text(root, width = 60, heigh = 22) #     lbl4 = Label(root, text = '') #    btn1 = Button(root, text = ' ') #   btn2 = Button(root, text = '  ') #   btn3 = Button(root, text = ' ') #    lbl1 = Label(root, text = '    ') #   lbl2 = Label(root, text = '') #     lbl3 = Label(root, text = '') #

Variable.grid (row, column) - we define the location of the element in the display window. Bind - keystroke. The following code is placed at the very end of the program:

 btn1.bind('<Button-1>', main) #      btn2.bind('<Button-1>', poisk) #     btn3.bind('<Button-1>', delete) #    lbl2.grid(row = 4, column = 1) lbl4.grid(row = 5, column = 1) lbl3.grid(row = 3, column = 1) btn1.grid(row = 1, column = 1) btn3.grid(row = 2, column = 1) btn2.grid(row = 1, column = 2) lbl1.grid(row = 2, column = 2) txt1.grid(row = 3, column = 2) txt2.grid(row = 6, column = 3) root.mainloop() #

Main function

First of all, let's write the main function (why the function, not the procedure? In the future, we will need to start it with bind (keystroke), this is easier to do with the function), and later we will add other functions. Procedures that will be useful to us:

config - makes changes to widget elements. For example, we will replace the text in the Label widgets.
update - used to update the widget. Face the problem - the widget will be changed only after the completion of the cycle, update allows you to update the contents of the widget every pass of the cycle.
re.sub (pattern, changeable string, string) - finds the pattern in the string and replaces it with the specified substring. If the pattern is not found, the string remains unchanged.
get - makes an http request, if it is equal to "200" - the entrance to the site was successful.
content - allows you to get html-code.
L.extend (K) - expands the list L, adding to the end all the elements of the list K

 def main(event): #     event (  ) page_count = get_page_count(get_html(BASE_URL)) #    ,     ,  http-   BASE_URL lbl3.config(text='  : '+str(page_count)) #    lbl3     page = 1 #   projects = [] #      while page_count != page: # ,   page      proxy = Proxy() # ,     proxy = proxy.get_proxy() # proxy- lbl4.update() #  lbl4.config(text=': '+proxy) #     global proxy1 #  proxy1 = proxy #       try: #   for i in range(1,10): #         (range - ,         ).       ,        page += 1 #      lbl2.update() #  lbl2.config(text=' %d%%'%(page / page_count * 100)) #     100% r = requests.get(BASE_URL + '?page=%d' % page, proxies={'https': proxy}) #     parsing = BeautifulSoup(r.content, "lxml") # html-   BeautifulSoup (      )       projects.extend(parse(BASE_URL + '?page=%d' % page, parsing)) #    parse (    html-)      save(projects, 'proj.csv') #     csv,    projects except requests.exceptions.ProxyError: #     continue #  while except requests.exceptions.ConnectionError: #    continue #  while except requests.exceptions.ChunkedEncodingError: #     ,    continue #  while

Counting pages

We write the function to get the url:

 def get_html(url): #       url,   page_count[count] response = urllib.request.urlopen(url) #   «»  httplib,  ,          return response.read() #      read

Now with the url we are looking for all the pages:

 def get_page_count(html): #    html soup = BeautifulSoup(html, 'html.parser') # html-  url ,   paggination = soup('ul')[3:4] #  ,     lis = [li for ul in paggination for li in ul.findAll('li')][-1] #       lis,         for link in lis.find_all('a'): #         var1 = (link.get('href')) #   var2 = var1[-3:] # ,     return int(var2) #

Getting proxy

The code was partially taken from Igor Danilov . We will use __init __ (self) - the class constructor, where self is the element in whose place the object is substituted at the time of its creation. Important! __init__ with two underscores on each side.

 class Proxy: #  proxy_url = 'http://www.ip-adress.com/proxy_list/' #   ,  - proxy_list = [] #    def __init__(self): #      self r = requests.get(self.proxy_url) #http-  get,       url str = html.fromstring(r.content) #    lxml.html.HtmlElement result = str.xpath("//tr[@class='odd']/td[1]/text()") #           for i in result: #    if i in massiv: #       yy = result.index(i) #       result del result[yy] #  result   self.list = result #    def get_proxy(self): #    self for proxy in self.list: #      if 'https://'+proxy == proxy1: #,        ,  : global massiv #massiv   massiv = massiv + [proxy] #    url = 'https://'+proxy #    return url #

Page parsing

Now we find on each page of the site we need the data. New procedures:

find_all - in the html-code of the page searches for blocks and elements that are in it.
text - retrieving from the html-code only the text displayed on the site.
L.append (K) - adds the element K to the end of the list L.

 def parse(html,parsing): #     html  parsing projects = [] #  ,       table = parsing.find('div' , {'class' : 'container-fluid cols_table show_visited'}) #  html-,  , , ,  ,   for row in table.find_all('div' , {'class' : 'row'}): #   cols = row.find_all('div') #   price = row.find_all('div' , {'class' : 'col-sm-1 amount title'}) #   cols1 = row.find_all('div' , {'class' : 'col-xs-12' , 'style' : 'margin-top: -10px; margin-bottom: -10px'}) #    if cols1==[]: #   , application_text = '' #    else: #   application_text = cols1[0].text #    html- cols2 = [category.text for category in row.find_all('a' , {'class' : 'text-muted'})] #        projects.append({'title': cols[0].a.text, 'category' : cols2[0], 'applications' : cols[2].text.strip(), 'price' : price[0].text.strip() , 'description' : application_text}) #  projects      return projects #

Cleaning function

The only procedure necessary for us to delete is to delete an object by the specified identifier or tag.

 def delete(event): #  txt1.delete(1.0, END) #     txt2.delete(1.0, END) #

Search data

The function will search for sentences in the description of which the words we need are mentioned. Writing in the field will have to be carried out taking into account the knowledge of regular expressions (for example, python | Python, C \ + \ +).

csv.DictReader - the constructor returns iterator objects to read
data from file.
split - splits a line into parts using a separator, and returns these parts as a list.
join - converts a list into a string, treating each element as a string.
insert - add items to the list by index.

 def poisk(event): #     event    file = open("proj.csv", "r") # ,      rdr = csv.DictReader(file, fieldnames = ['name', 'categori', 'zajavki', 'case', 'opisanie']) #      poisk = txt1.get(1.0, END) #       poisk = poisk[0:len(r)-1] #     ,     ('\n') for rec in rdr: # ,     csv- data = rec['opisanie'].split(';') #       data1 = rec['case'].split(';') #       data = ('').join(data) #   data1 = ('').join(data1) #   w = re.findall(poisk, data) #       if w != []: #,   w    ,   if data1 == '': # ,     ,   data1 = '' #     txt2.insert(END, data+'--'+data1+'\n'+'---------------'+'\n') #   ,  ,    , ,

Saving data

As already said: the data will be saved in csv format. If you wish, you can rewrite the function for any other format.

 def save(projects, path): #         path with open(path, 'w') as csvfile: #   path  w (    .     .     _,    ) writer = csv.writer(csvfile) #writer -   , csv -    writer.writerow(('', '', '' , '' , '')) #writerow -      for project in projects: #    try: #   writer.writerow((project['title'], project['category'], project['applications'], project['price'], project['description'])) #    except UnicodeEncodeError: # description       ,      writer.writerow((project['title'], project['category'], project['applications'], project['price'], '')) #

I hope this information will be useful in your work. Good luck.

Source: https://habr.com/ru/post/322608/

All Articles