What is Grab: Spider?

I can’t add Grab documentation in any way: Spider is part of the Grab library for writing asynchronous spiders. Thought to lay out pieces of documentation on habrakhabr. I think with some feedback the matter will go faster. At the moment in the documentation there is only an introduction, describing in general terms, what kind of beast is this Grab: Spider. I spread it.

The Spider module is a framework that allows you to describe a site parser as a set of handler functions, where each handler is responsible for a specific type of request. For example, when parsing a forum, you will have handlers for the main page, sub-forum page, topic page, member profile page. Initially, such a structure of the parser was developed due to the limitations of the asynchronous mode, but it later turned out that it was very convenient to write parsers in such a structured form (one request - one function).

The Spider module works asynchronously. This means that there is always only one workflow program. For multiple queries, neither threads nor processes are created. All created requests are processed by the multicurl library. The essence of the asynchronous approach is that the program creates network requests and waits for signals about the readiness to respond to these requests. As soon as the answer is ready, the handler function is called, which we tied to a specific request. The asynchronous approach allows processing more simultaneous connections than the approach associated with creating threads or processes since memory is occupied by only one process and the processor does not need to constantly switch between multiple processes.
')
There is one nuance that will be very unusual for those who are used to working in synchronous style. Asynchronous approach allows calling handler functions when the network response is ready. If the parsing algorithm consists of several consecutive network requests, then you need to store information somewhere about what we created the network request for and what to do with it. Spider allows you to quite conveniently solve this problem.

Each handler function takes two input arguments. The first argument is the Grab object, which stores information about the network response. The beauty of the Spider module is that it has saved the familiar interface for working with synchronous requests. The second argument of the handler function is the Task object. Task objects are created in Spideer in order to add a new task to the network request queue. Using the Task object, you can save intermediate data between multiple requests.

Consider the example of a simple parser. Suppose we want to go to the site habrahabr.ru, read the latest news headlines, then for each header find the image using images.yandex.ru and save the data to a file:

# coding: utf-8 import urllib import csv import logging from grab.spider import Spider, Task class ExampleSpider(Spider): #  ,   Spider   #         #    initial initial_urls = ['http://habrahabr.ru/'] def prepare(self): #      #  prepare      #   self.result_file = csv.writer(open('result.txt', 'w')) #       #  ,      . self.result_counter = 0 def task_initial(self, grab, task): print 'Habrahabr home page' #        initial # ..   ,     #    self.initial_urls #         #     Grab for elem in grab.xpath_list('//h1[@class="title"]/a[@class="post_title"]'): #   -    #   habrapost #  ,       #  yield -      # -    : # self.add_task(Task('habrapost', url=...)) yield Task('habrapost', url=elem.get('href')) def task_habrapost(self, grab, task): print 'Habrahabr topic: %s' % task.url #  ,     #    ,  #     ,   #    #          post = { 'url': task.url, 'title': grab.xpath_text('//h1/span[@class="post_title"]'), } #       ,  , #     Task   .   #        ,    #      .   ,   #    Task     #         query = urllib.quote_plus(post['title'].encode('utf-8')) search_url = 'http://images.yandex.ru/yandsearch?text=%s&rpt=image' % query yield Task('image_search', url=search_url, post=post) def task_image_search(self, grab, task): print 'Images search result for %s' % task.post['title'] #         ,  #     !     , #           # .      ,   #   ,     `task.post`. image_url = grab.xpath_text('//div[@class="b-image"]/a/img/@src') yield Task('image', url=image_url, post=task.post) def task_image(self, grab, task): print 'Image downloaded for %s' % task.post['title'] #      . #  ,   . path = 'images/%s.jpg' % self.result_counter grab.response.save(path) self.result_file.writerow([ task.post['url'].encode('utf-8'), task.post['title'].encode('utf-8'), path ]) #     ,  #       self.result_counter += 1 if __name__ == '__main__': logging.basicConfig(level=logging.DEBUG) #      -   #  ,     #        ,      bot = ExampleSpider(thread_number=2) bot.run()

In the example, the simplest parsers are considered and not a lot of possibilities that Spider can do are affected. Read about them in the detailed documentation. Please note that some of the handler functions will work with an error, for example, because Yandex will not find anything by a given request.

Next, I plan to describe various ways of creating Task-objects, processing network errors, and the functionality of repeating tasks that were stopped by mistake.

In writing, questions about Grab are best asked in the mailing list: groups.google.com/group/python-grab

Order a revision of Grab, as well as a parser based on Grab and Grab :: Spider here: grablab.org

Source: https://habr.com/ru/post/142288/

All Articles

What is Grab: Spider?

More articles: