Grab: Spider parsing framework

I am the author of the python library Grab, which makes it easy to write website parsers. I wrote about her introductory article some time ago on Habré. Recently, I decided to take up a lot of parsing, began to look for free-lance orders for parsing and I needed a tool for parsing sites with many pages.

I used to implement multithreaded parsers using python threads with the help of such libraries . The threading approach has pros and cons. The advantage is that we launch a separate thread (thread) and do what we want in it: we can make several network calls sequentially and all this within the same context - we don’t need to switch anywhere, remember something and remember it. The downside is that threads slow down and eat memory.

What are the alternatives?
')
Work with network resources asynchronously. There is only one program execution flow, in which all the data processing logic is executed as soon as this data is ready, the data itself is loaded asynchronously. In practice, this allows not working very hard to work with a network of several hundred threads, if you try to run so many threads, they will seriously slow down.

So, I wrote an interface to multicurl - this is part of the pycurl library, which allows you to work with the network asynchronously. I chose multicurl because Grab uses pycurl and I thought that I could use it to work with multicurl. So it happened. I was even somewhat surprised that on the very first day of the experiments it worked :) The architecture of parsers based on Grab: Spider is very similar to parsers based on the scrapy framework, which, in general, is not surprising and logical.

I will give an example of the simplest spider:

# coding: utf-8 from grab.spider import Spider, Task class SimpleSpider(Spider): initial_urls = ['http://ya.ru'] def task_initial(self, grab, task): grab.set_input('text', u'') grab.submit(make_request=False) yield Task('search', grab=grab) def task_search(self, grab, task): for elem in grab.xpath_list('//h2/a'): print elem.text_content() if __name__ == '__main__': bot = SimpleSpider() bot.run() print bot.render_stats()

What's going on here? For each URL in `self.initial_urls`, a task with the name initial is created, after multicurl downloads the document, a handler with the name` task_initial` is called. The most important thing is that inside the handler we get the Grab-object associated with the requested document: we can use practically any functions from the Grab API. In this example, we use his work with forms. Note that we need to specify the `make_request = False` parameter so that the form is not sent right there, because we want this network request to be processed asynchronously.

In short, working with Grab: Spider comes down to generating requests using Task objects and further processing them in special methods. Each task has a name; it is for him that the method is then selected to process the requested network document.

You can create a Task object in two ways. Easy way:

 Task('foo', url='http://google.com')

After the document is completely downloaded from the network, a method with the name `task_foo` will be called.

More complicated way:

 g = Grab() g.setup(....   ...) Task('foo', grab=g)

In this way we can adjust the request parameters in accordance with our needs, set cookies, special headers, generate a POST request, whatever.

Where can I create queries? In any handler method, you can make the yield Task object and it will be added to the asynchronous queue for downloading. You can also return a Task object via return. In addition, there are two more ways to generate Task objects.

1) You can specify in the attribute `self.initial_urls` a list of addresses and jobs with the name 'initial' will be created for them.

2) You can define the method `task_generator` and yield'it in it as many queries. Moreover, new requests from it will be taken as the implementation of the old. This allows for example, without problems, to iterate over a million lines from a file of a file and not to litter, oh sorry, litter, they all memory.

Initially, I planned to do the processing of the extracted data as in scrapy. There it is done with the help of Pipeline objects. For example, you got a page with a movie, propars it and returned the Pipeline object with a Movie type. And beforehand, you wrote in the config that the Movie Pipeline should be saved to a database or to a CSV file. Something like this. In practice, it turned out that it is easier not to bother with the additional wrapper and write data to the database or to the file immediately in the request handler method. Of course, this will not work in the case of paralleling methods on a cloud of machines, but still need to live up to this point, but for now it is more convenient to do everything directly in the handler method.

Task object can be passed additional arguments. For example, we make a request in google search. We form the desired url and create a Task object: Task ('search', url = '...', query = query) Next, in the `task_search` method we can find out exactly which query we were looking for by referring to the` task.query` attribute

Grab: spider automatically tries to fix network errors. In the case of network timeout, he performs the task again. You can configure the number of attempts using the `network_try_limit` option when creating a Spider object.

I must say that I really liked writing parsers in asynchronous style. And the point is not only that the asynchronous approach loads the system resources less, but also that the source code of the parser acquires a clear and understandable structure.

Unfortunately, to thoroughly describe the work of the Spider module will take a long time. I just wanted to tell the army of users of the Grab library, which, I know, has several people, about one of the possibilities covered by the gloom of underdocumentation.

Summary. If you use Grab, have a look at the spider module, you might like it. If you do not know what Grab is, perhaps you better look at the scrapy framework, it is documented a hundred times more beautiful than Grab.

PS I use mongodb to store the results of the parsing - it’s just awesome :) Just do not forget to install a 64bit system, otherwise you will not be able to create more than two gigabytes of database.

PS Example of a real parser for parsing the site dumpz.org/119395

PS Official site of the project grablib.org (there are links to the repository, google group and documentation)

PS I write to order parsers based on Grab, details here grablab.org

Source: https://habr.com/ru/post/134918/

All Articles

Grab: Spider parsing framework

More articles: