Grab - python library for parsing sites

About five or six years ago, when I was still programming mostly in PHP, I started using curl to parse websites. I needed a tool that allowed to emulate the user's session on the site, send the headers of a regular browser, give a convenient way to send POST requests. At first I tried to use the curl extension directly, but its interface was very inconvenient and I wrote a wrapper with a simpler interface. As time went on, I moved to python and ran into the same oak curl-extension API. I had to rewrite the wrapper in python.

What is grab?

This is a library for parsing sites. Its main functions are:

Preparing a network request (cookies, http headers, POST / GET data)
Request to server (possible via HTTP / SOCKS proxy)
Getting the server's response and its initial processing (header parsing, cookies parsing, document encoding definition, redirect processing (even a red refresh in the meta refresh tag is supported))
Working with the DOM response tree (if it is an HTML document)
Working with forms (filling, auto-filling)
Debugging: logging the process to the console, network requests and responses to files

Further I will tell about each item in more detail. First, let's talk about initializing the work object and preparing a network request. I will give an example of code that requests a page from Yandex and saves it to a file:

>>> g = Grab(log_file='out.html') >>> g.go('http://yandex.ru')

In fact, the `log_file` parameter is intended for debugging - it indicates where to save the response body for further study. But you can also use it to download the file.

We saw how you can configure the Grab object - right in the constructor. And here are some other variants of the same code:

 >>> g = grab() >>> g.setup(url='http://yandex.ru', log_file='out.html') >>> g.request()

 >>> g = Grab() >>> g.go('http://yandex.ru', log_file='out.html')

')
The shortest:

 >>> Grab(log_file='out.html').go('http://yandex.ru')

I summarize: you can set the configuration of Grab through the constructor, through the `setup` method or through the` go` and `request` methods. In the case of the `go` method, the requested URL can be passed as a positional argument; in other cases, it must be passed as a named argument. The difference between the `go` and` request` methods is that the `go` requires the required first parameter URL, while request does not require anything and uses the URL that we specified earlier.

In addition to the `log_file` option, there is the` log_dir` option, which makes debugging a multi-step parser incredibly easy.

 >>> import logging >>> from grab import Grab >>> logging.basicConfig(level=logging.DEBUG) >>> g = Grab() >>> g.setup(log_dir='log/grab') >>> g.go('http://yandex.ru') DEBUG:grab:[02] GET http://yandex.ru >>> g.setup(post={'hi': u', !'}) >>> g.request() DEBUG:grab:[03] POST http://yandex.ru

See you Each request received its own number. The response to each request was recorded in the /tmp/tnome number.html file, a /tmp/trnome.log file was also created, in which the http-response headers were recorded. And what does the above code do? He goes to the main page of Yandex. And then makes a meaningless POST request to the same page. Please note that in the second request we do not specify the URL - by default, the url of the previous request is used.

Let's take another debugging Grab setting.

 >>> g = Grab() >>> g.setup(debug=True) >>> g.go('http://youporn.com') >>> g.request_headers {'Accept-Language': 'en-us;q=0.9,en,ru;q=0.3', 'Accept-Encoding': 'gzip', 'Keep-Alive': '300', 'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.3', 'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.9.0.2) Gecko/2008091620 Firefox/3.0.2', 'Accept-Charset': 'utf-8,windows-1251;q=0.7,*;q=0.7', 'Host': 'www.youporn.com'}

We made a request to youporn.com. The `debug` option enables memorization of outgoing requests. If we are not sure about something, you can see what we sent to the server. The attribute `request_headers` contains a dictionary with the keys and values of the http request headers.

Consider the basic capabilities for compiling queries.

HTTP request methods

POST request. It's pretty simple. In the `post` option, specify a dictionary with keys and values. Grab will automatically change the type of request to POST.

 >>> g = Grab() >>> g.setup(post={'act': 'login', 'redirec_url': '', 'captcha': '', 'login': 'root', 'password': '123'}) >>> g.go('http://habrahabr.ru/ajax/auth/') >>> print g.xpath_text('//error')

GET request. If no explicit POST data or request method is specified, Grab will generate a GET request.

PUT, DELETE, HEAD methods. Theoretically, everything will work if you specify the option method = 'delete', method = 'put' or method = 'head'. Practically, I have worked very little with these methods and am not sure about their efficiency.

Important note about POST requests. Grab is designed so that it saves all the specified options and uses them in the following requests. The only option it does not save is the `post` option. If he saved it, then in the following example you would send a POST request to the second URL, and this is hardly what you wanted:

 >>> g.setup(post={'login': 'root', 'password': '123'}) >>> g.go('http://example.com/login') >>> g.go('http://example.com/news/recent')

Configuring http headers

Now let's look at how you can customize the http headers to send. Just set the headers dictionary with the `headers` option. By default, Grab generates some headers to look more like a browser: Accept, Accept-Language, Accept-Charset, Keep-Alive. You can also change them with the `headers` option:

 >>> g = Grab() >>> g.setup(headers={'Accept-Encoding': ''}) >>> g.go('http://digg.com') >>> print g.response.headers.get('Content-Encoding') None >>> g.setup(headers={'Accept-Encoding': 'gzip'}) >>> g.go('http://digg.com') >>> print g.response.headers['Content-Encoding'] gzip

Work with cookies

By default, Grab saves the received cookies and sends them in the next request. You get emulation of user sessions out of the box. If you do not need this, disable the `reuse_cookies` option. You can set cookies manually with the `cookies` option, it should contain a dictionary, the processing of which is similar to the processing of the data passed to the` post` option.

 >>> g.setup(cookies={'secureid': '234287a68s7df8asd6f'})

You can specify a file that should be used as a cookie storage, using the `cookiefile` option. This will allow you to save cookies between program launches.

At any time you can write cookies of the Grab object to the file using the `dump_cookies` method or download from the file using the` load_cookies` method. To clear the Grab object's cookies, use the `clear_cookies` method.

User-Agent

By default, Grab is implemented as a real browser. It has a list of different User-Agent strings, one of which is randomly selected when creating a Grab object. Of course, you can set your User-Agent with the `user_agent` option.

 >>> from grab import Grab >>> g = Grab() >>> g.go('http://whatsmyuseragent.com/') >>> g.xpath('//td[contains(./h3/text(), "Your User Agent")]').text_content() 'The Elements of Your User Agent String Are:\nMozilla/5.0\r\nWindows\r\nU\r\nWindows\r\nNT\r\n5.1\r\nen\r\nrv\r\n1.9.0.1\r\nGecko/2008070208\r\nFirefox/3.0.1' >>> g.setup(user_agent='Porn-Parser') >>> g.go('http://whatsmyuseragent.com/') >>> g.xpath('//td[contains(./h3/text(), "Your User Agent")]').text_content() 'The Elements of Your User Agent String Are:\nPorn-Parser'

Work with a proxy server

Everything is trite. In the `proxy` option, you need to pass the proxy address as“ server: port ”, in the` proxy_type` option, we pass its type: “http”, “socks4” or “socks5” If your proxies require authorization, use the option `proxy_userpwd`, meaning which has the form "user: password".
The simplest search engine proxy servers based on Google search:

 >>> from grab import Grab, GrabError >>> from urllib import quote >>> import re >>> g = Grab() >>> g.go('http://www.google.ru/search?num=100&q=' + quote('free proxy +":8080"')) >>> rex = re.compile(r'(?:(?:[-a-z0-9]+\.)+)[a-z0-9]+:\d{2,4}') >>> for proxy in rex.findall(g.drop_space(g.css_text('body'))): ... g.setup(proxy=proxy, proxy_type='http', connect_timeout=5, timeout=5) ... try: ... g.go('http://google.com') ... except GrabError: ... print proxy, 'FAIL' ... else: ... print proxy, 'OK' ... 210.158.6.201:8080 FAIL ... proxy2.com:80 OK …. 210.107.100.251:8080 OK ….

Work with answer

Let's say you made a network request using Grab. What's next? The `go` and` request` methods will return a Response object, which is also available through the `response` attribute of the Grab object. You may be interested in the following attributes and methods of the Response object: code, body, headers, url, cookies, charset.

code - HTTP response code. If the answer is different from the 200th, no exceptions will be generated, keep this in mind.
body is the actual body of the response, excluding http headers
headers - and these are headers in the dictionary
url - may be different from the original, if there was a redirect
cookies - cookies in the dictionary
charset is the document encoding, is searched in the META tag of the document, also in the Content-Type http header of the response and the xml declaration of the XML documents.

The Grab object has a `response_unicode_body` method that returns the response body converted to unicode, note that HTML entities of the type" & "are not converted to unicode counterparts.

The response object of the last request is always stored in the attribute `response` of the Grab object.

 >>> g = Grab() >>> g.go('http://aport.ru') >>> g.response.code 200 >>> g.response.cookies {'aportuid': 'AAAAGU5gdfAAABRJAwMFAg=='} >>> g.response.headers['Set-Cookie'] 'aportuid=AAAAGU5gdfAAABRJAwMFAg==; path=/; domain=.aport.ru; expires=Wed, 01-Sep-21 18:21:36 GMT' >>> g.response.charset 'windows-1251'

Working with response text (grab.ext.text extension)

The `search` method allows you to set whether a given string is present in the response body, the` search_rex` method takes a regular expression object as a parameter. The `assert_substring` and` assert_rex` methods generate a DataNotFound exception if the argument was not found. Also in this extension are such convenient functions as `find_number - searches for the first numeric entry,` drop_space` - removes any whitespace characters, and `normalize_space` - replaces sequences of spaces with one space.

 >>> g = Grab() >>> g.go('http://habrahabr.ru') >>> g.search(u'Google') True >>> g.search(u'') False >>> g.search(u'') False >>> g.search(u'') False >>> g.search(u'') True >>> g.search('') Traceback (most recent call last): File "", line 1, in File "grab/ext/text.py", line 37, in search raise GrabMisuseError('The anchor should be byte string in non-byte mode') grab.grab.GrabMisuseError: The anchor should be byte string in non-byte mode >>> g.search('', byte=True) True >>> import re >>> g.search_rex(re.compile('Google')) <_sre.SRE_Match object at 0xb6b0a6b0> >>> g.search_rex(re.compile('Google\s+\w+', re.U)) <_sre.SRE_Match object at 0xb6b0a6e8> >>> g.search_rex(re.compile('Google\s+\w+', re.U)).group(0‌) u'Google Chrome' >>> g.assert_substring('  ') Traceback (most recent call last): File "", line 1, in File "grab/ext/text.py", line 62, in assert_substring if not self.search(anchor, byte=byte): File "grab/ext/text.py", line 37, in search raise GrabMisuseError('The anchor should be byte string in non-byte mode') grab.grab.GrabMisuseError: The anchor should be byte string in non-byte mode >>> g.assert_substring(u'  ') Traceback (most recent call last): File "", line 1, in File "grab/ext/text.py", line 63, in assert_substring raise DataNotFound('Substring not found: %s' % anchor) grab.grab.DataNotFound >>> g.drop_spaces('foo bar') Traceback (most recent call last): File "", line 1, in AttributeError: 'Grab' object has no attribute 'drop_spaces' >>> g.drop_space('foo bar') 'foobar' >>> g.normalize_space(' foo \n \t bar') 'foo bar' >>> g.find_number('12    ') '12'

Working with the DOM tree (grab.ext.lxml extension)

We approach the most interesting. Thanks to the wonderful lxml library, Grab gives you the opportunity to work with xpath expressions to search for data. Briefly, through the `tree` attribute, you can access a DOM tree with an ElementTree interface. The tree is built using the lxml library parser. You can work with the DOM tree using two query languages: xpath and css.

Methods for working with xpath:

xpath - return the first element satisfying the query
xpath_list - return all elements xpath_text - return the text content of the element (and all nested elements)
xpath_number - return the first numeric occurrence from the text of the element (and all nested elements)

If the item was not found, then the `xpath`,` xpath_text` and `xpath_number` functions will generate a DataNotFound exception.

The `css`,` css_list`, `css_text`, and` css_number` functions work in the same way, with one exception; the argument should not be the xpath path, but the css selector.

 >>> g = Grab() >>> g.go('http://habrahabr.ru') >>> g.xpath('//h2/a[@class="topic"]').get('href') 'http://habrahabr.ru/blogs/qt_software/127555/' >>> print g.xpath_text('//h2/a[@class="topic"]')  Qt Creator 2.3.0‌ >>> print g.css_text('h2 a.topic')  Qt Creator 2.3.0‌ >>> print 'Comments:', g.css_number('.comments .all') Comments: 5 >>> from urlparse import urlsplit >>> print ', '.join(urlsplit(x.get('href')).netloc for x in g.css_list('.hentry a') if not 'habrahabr.ru' in x.get('href') and x.get('href').startswith('http:')) labs.qt.nokia.com, labs.qt.nokia.com, thisismynext.com, www.htc.com, www.htc.com, droider.ru, radikal.ru, www.gosuslugi.ru, bit.ly

Forms (grab.ext.lxml_form extension)

When I implemented the auto-fill functionality, I was very pleased. Look and you! So, there are methods `set_input` - fills the field with the specified name,` set_input_by_id` - by the value of the id attribute, and `set_input_by_number` - just by number. These methods work with a form that can be set by hand, but usually Grab guesses correctly what form you need to work with. If the form is one, everything is clear, and if there are several? Grab will take the form in which the most fields. To manually set the form, use the `choose_form` method. Method `submit` you can send the completed form. Grab will build a POST / GET request for the fields that we have not explicitly filled out (for example, hidden fields), will calculate the action of the form and the method of the request. There is also a method `form_fields` which returns all fields and form values in the dictionary.

 >>> g.go('http://ya.ru/') >>> g.set_input('text', u' ') >>> g.submit() >>> print ', '.join(x.get('href') for x in g.css_list('.b-serp-url__link')) http://gigporno.ru/, http://drochinehochu.ru/, http://porno.bllogs.ru/, http://www.pornoflv.net/, http://www.plombir.ru/, http://vuku.ru/, http://www.carol.ru/, http://www.Porno-Mama.ru/, http://kashtanka.com/, http://www.xvidon.ru/

Transports

By default, Grab uses pycurl for all network operations. This functionality is also implemented in the form of expansion, and you can connect another transport extension, for example, for requests through the urllib2 library. There is only one problem, you must first write this extension :) Works on the urllib2 extension are underway, but very slowly - I am 100% satisfied with pycurl. I think the pycurl and urllib2 extensions will be similar in their capabilities, except that urllib2 does not know how to work with SOCKS proxies. All the examples in this article use the pycurl transport, which is enabled by default.

 >>> g = Grab() >>> g.curl <pycurl.Curl object at 0x9d4ba04> >>> g.extensions [<grab.ext.pycurl.Extension object at 0xb749056c>, <grab.ext.lxml.Extension object at 0xb749046c>, <grab.ext.lxml_form.Extension object at 0xb6de136c>, <grab.ext.django.Extension object at 0xb6a7e0ac>]

Hammer mode (hammer-mode)

This mode is enabled by default. Grab has a timeout for each request. In the hammer mode, in the case of a timeout, Grab does not immediately generate an exception, but tries several times to make a query with increasing timeouts. This mode allows you to significantly increase the stability of the program. micro-pauses in the work of sites or breaks in the channel are quite often. To enable mode, use the `hammer_mode` option, to adjust the number and length of timeouts, use the` hammer_timeouts` option to which the list of numeric pairs should be transferred: the first number is the timeout for connecting to the server socket, the second number is the timeout for the entire operation, including getting an answer.

 >>> import logging >>> logging.basicConfig(level=logging.DEBUG) >>> g = Grab() >>> g.setup(hammer_mode=True, hammer_timeouts=((1, 1), (2, 2), (30, 30))) >>> URL = 'http://download.wikimedia.org/enwiki/20110803/enwiki-20110803-stub-articles5.xml.gz' >>> g.go(URL, method='head') DEBUG:grab:[01] HEAD http://download.wikimedia.org/enwiki/20110803/enwiki-20110803-stub-articles5.xml.gz >>> print 'File size: %d Mb' % (int(g.response.headers['Content-Length']) / (1024 * 1024)) File size: 3 Mb >>> g.go(URL, method='get') DEBUG:grab:[02] GET http://download.wikimedia.org/enwiki/20110803/enwiki-20110803-stub-articles5.xml.gz DEBUG:grab:Trying another timeouts. Connect: 2 sec., total: 2 sec. DEBUG:grab:[03] GET http://download.wikimedia.org/enwiki/20110803/enwiki-20110803-stub-articles5.xml.gz DEBUG:grab:Trying another timeouts. Connect: 30 sec., total: 30 sec. DEBUG:grab:[04] GET http://download.wikimedia.org/enwiki/20110803/enwiki-20110803-stub-articles5.xml.gz >>> print 'Downloaded: %d Mb' % (len(g.response.body) / (1024 * 1024)) Downloaded: 3 Mb

Django extension (grab.ext.django)

Yes Yes. There is something like this :-) Suppose you have a Movie model with an ImageField field of `picture`. Here's how to download a picture and save it to a Movie object.

 >>> obj = Movie.objects.get(pk=797) >>> g = Grab() >>> g.go('http://img.yandex.net/i/www/logo.png') >>> obj.picture = g.django_file() >>> obj.save()

What else is in Grab?

There are other chips, but I am afraid that the article will be too big. The main rule of the Grab library user is that if something is not clear, you need to look into the code. Documentation is weak

Development plans

I have been using Grab for many years, including in production sites, such as the aggregator, where you can buy discount coupons in Moscow and other cities. In 2011, I started writing tests and documentation. Maybe I'll write a functional for asynchronous requests based on multicurl. It would also be nice to finish urllib-transport.

How can I help the project? Just use it, send bug reports and patches. You can also order me to write parsers, grabbers , information processing scripts. I regularly write these things using grab.

Official project repository: bitbucket.org/lorien/grab You can also put the library from pypi.python.org, but usually the code is fresh in the repository.

UPD: In comments sounded all sorts of alternatives to hornbeam. I decided to summarize them with a list + something from my head. In fact, alternatives to these car and small truck. I think that every N-th programmer one day decides to bind himself a utility for network requests:

UPD2: Please write your library questions to the google group: groups.google.com/group/python-grab/ Other grab users will find it helpful to familiarize themselves with the questions and answers.

UPD3: Current documentation is available at: docs.grablib.org/

UPD4: Current project site: grablib.org

UPD5: Fixed source code examples in the article. After another upgrade, Habrahabr, for reasons that were incomprehensible to me, did not correct the formatting of the code in old articles and it went everywhere. Thanks to Alexey Mazanov for correcting the article. He also wants to get on Habr, if you have an invite, his email is: egocentrist@me.com

Source: https://habr.com/ru/post/127584/

All Articles