Do-it-yourself position monitoring

We do monitoring of positions of requests in a search engine, the beginning.

Usually we are interested in increasing customers.
And in order to increase something, you need to first evaluate it.
And so it has historically developed that a part of clients to online stores comes from search engines.
(About work with contextual advertising and price-aggregators I will write in the following articles, if anyone is interested.)
And to assess your status in search engines, you usually need to collect statistics from them on the status of requests in the issue.

Our tool will consist of 2 parts:

script for parsing search results using Curl and lxml
web interface for managing and displaying on Django

We learn from yandex.ru our position on request.

I want to clarify right away, this article will describe the very basics and make the simplest version, which we will further improve.

To begin with we will make function which on urla returns html.

We will load the page using pycurl.

import pycurl c = pycurl.Curl()

Install the url that will be downloaded

 url = 'ya.ru' c.setopt(pycurl.URL, url)

To return the body of the page, curl uses the callback function, which sends a string with html.
We use the StringIO string buffer, it has a write () function at the input, and we can retrieve all the contents from it via getvalue ()

 from StringIO import StringIO c.bodyio = StringIO() c.setopt(pycurl.WRITEFUNCTION, c.bodyio.write) c.get_body = c.bodyio.getvalue

Just in case, we will make our curl look like a browser, set timeouts, user agents, headers, etc.

 c.setopt(pycurl.FOLLOWLOCATION, 1) c.setopt(pycurl.MAXREDIRS, 5) c.setopt(pycurl.CONNECTTIMEOUT, 60) c.setopt(pycurl.TIMEOUT, 120) c.setopt(pycurl.NOSIGNAL, 1) c.setopt(pycurl.USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:13.0) Gecko/20100101 Firefox/13.0') httpheader = [ 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language: ru-ru,ru;q=0.8,en-us;q=0.5,en;q=0.3', 'Accept-Charset:utf-8;q=0.7,*;q=0.5', 'Connection: keep-alive', ] c.setopt(pycurl.HTTPHEADER, httpheader)

Now load the page

 c.perform()

That's all, the page we have, we can read the html page

 print c.get_body()

We can also read the headlines.

 print c.getinfo(pycurl.HTTP_CODE)

And if you get a server response that is different from 200, then we can process it. Now we just throw an exception, we will handle exceptions in the following articles.

 if c.getinfo(pycurl.HTTP_CODE) != 200: raise Exception('HTTP code is %s' % c.getinfo(pycurl.HTTP_CODE))

')
Wrap everything up in the function, in the end we got

 import pycurl try: from cStringIO import StringIO except ImportError: from StringIO import StringIO def get_page(url, *args, **kargs): c = pycurl.Curl() c.setopt(pycurl.URL, url) c.bodyio = StringIO() c.setopt(pycurl.WRITEFUNCTION, c.bodyio.write) c.get_body = c.bodyio.getvalue c.headio = StringIO() c.setopt(pycurl.HEADERFUNCTION, c.headio.write) c.get_head = c.headio.getvalue c.setopt(pycurl.FOLLOWLOCATION, 1) c.setopt(pycurl.MAXREDIRS, 5) c.setopt(pycurl.CONNECTTIMEOUT, 60) c.setopt(pycurl.TIMEOUT, 120) c.setopt(pycurl.NOSIGNAL, 1) c.setopt(pycurl.USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:13.0) Gecko/20100101 Firefox/13.0') httpheader = [ 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language: ru-ru,ru;q=0.8,en-us;q=0.5,en;q=0.3', 'Accept-Charset:utf-8;q=0.7,*;q=0.5', 'Connection: keep-alive', ] c.setopt(pycurl.HTTPHEADER, httpheader) c.perform() if c.getinfo(pycurl.HTTP_CODE) != 200: raise Exception('HTTP code is %s' % c.getinfo(pycurl.HTTP_CODE)) return c.get_body()

Check function

 print get_page('ya.ru')

Choose from the search results page list of sites with positions

Construct a search query
we need to send 3 GET parameters to yandex.ru/yandsearch ,
'text'-query,' lr'-search region, 'p'-issue page

 import urllib import urlparse key='' region=213 page=1 params = ['http', 'yandex.ru', '/yandsearch', '', '', ''] params[4] = urllib.urlencode({ 'text':key, 'lr':region, 'p':page-1, }) url = urlparse.urlunparse(params)

We will display the url and check it in the browser

 print url

Get through the previous function page with the issuance

 html = get_page(url)

Now we will parse it by the dom model using lxml

 import lxml.html site_list = [] for h2 in lxml.html.fromstring(html).find_class('b-serp-item__title'): b = h2.find_class('b-serp-item__number') if len(b): num = b[0].text.strip() url = h2.find_class('b-serp-item__title-link')[0].attrib['href'] site = urlparse.urlparse(url).hostname site_list.append((num, site, url))

I will write in more detail what is happening here
lxml.html.fromstring (html) - from the html string we make the object html document
.find_class ('b-serp-item__title') - we search through the document for all tags that contain the class 'b-serp-item__title', get a list of H2 elements that contain information about positions that we have, and loop through them
b = h2.find_class ('b-serp-item__number') - we are looking for an element b inside the found H2 tag that contains the position number of the site, if found then continue to collect the position b [0] .text.strip () of the site and the line with url the site
urlparse.urlparse (url) .hostname - we get the domain name

Check the resulting list

 print site_list

And collect all the resulting function

 def site_list(key, region=213, page=1): params = ['http', 'yandex.ru', '/yandsearch', '', '', ''] params[4] = urllib.urlencode({ 'text':key, 'lr':region, 'p':page-1, }) url = urlparse.urlunparse(params) html = get_page(url) site_list = [] for h2 in lxml.html.fromstring(html).find_class('b-serp-item__title'): b = h2.find_class('b-serp-item__number') if len(b): num = b[0].text.strip() url = h2.find_class('b-serp-item__title-link')[0].attrib['href'] site = urlparse.urlparse(url).hostname site_list.append((num, site, url)) return site_list

Check function

 print site_list('', 213, 2)

Find our site in the list of sites

We will need an auxiliary function that cuts off 'www.' at the beginning of the site

 def cut_www(site): if site.startswith('www.'): site = site[4:] return site

Get a list of sites and compare with our site

 site = 'habrahabr.ru' for pos, s, url in site_list('python', 213, 1): if cut_www(s) == site: print pos, url

Hmm, there is no habr on the first page of the python issue; we will try to pass the output in a cycle in depth,
but we need to set a limit, max_position - to which position we will check,
at the same time we will wrap the function, and in case nothing is found we will return None, None

 def site_position(site, key, region=213, max_position=10): for page in range(1,int(math.ceil(max_position/10.0))+1): site = cut_www(site) for pos, s, url in site_list(key, region, page): if cut_www(s) == site: return pos, url return None, None

Check

 print site_position('habrahabr.ru', 'python', 213, 100)

Here we actually got our position.

Please feel sorry if this topic is interesting and does it need to continue?

What to write in the next article?
- make a web interface to this function with signs and graphs and hang the script on cron
- make captcha processing for this function, both manually and through special api
- make a monitoring script of what is thread with multithreading
- describe the work with directors on the example of generating and uploading ads via api or price management

Source: https://habr.com/ru/post/146258/

All Articles