📜 ⬆️ ⬇️

Grab - new interface for working with DOM-tree of HTML-document

Historical excursion


Earlier, I wrote on Habré about Grab - a framework for writing parser sites: one , two , three , four . In a nutshell, Grab is a handy wrapper on top of two libraries: pycurl for networking and lxml for parsing HTML documents.

The lxml library allows you to make XPATH queries to the DOM tree and get the results in the form of ElementTree objects that have a lot of useful properties. A few years ago, I developed a few simple methods that allowed us to apply xpath queries to a document loaded through the hornbeam. I will illustrate with the code:

>>> from grab import Grab >>> g = Grab() >>> g.go('http://habrahabr.ru/') <grab.response.Response object at 0x7fe5f7189850> >>> print g.xpath_text('//title')    /  /  

')
This is essentially the same as the following code:

 >>> from urllib import urlopen >>> from lxml.html import fromstring >>> data = urlopen('http://habrahabr.ru/').read() >>> dom = fromstring(data) >>> print dom.xpath('//title')[0].text_content()    /  /  


The convenience of the xpath_text method is that it is automatically applied to a document loaded via Grab, you do not need to build a tree, it is done automatically, you also do not need to manually select the first element, the xpath_text method does it automatically, and this method automatically extracts text from all nested elements . Next, I list all the methods of the Grab library with their brief description:



Not without embarrassment. The method grab.xpath - returns the first element of the selection, while the xpath method of the ElementTree object returns the entire list. People repeatedly ran across this rake. I also want to note that there was exactly the same set of methods for working with css requests i. grab.css, grab.css_list, grab.css_text, etc., but I personally refused to use CSS expressions in favor of XPATH. XPATH is a more powerful tool and it often makes sense to use it and I did not want to see in the code a jumble of CSS and XPATH expressions.

The above methods had a number of drawbacks:

First, when it was required to bring the code of the selection of elements into a separate function, there was a temptation to transfer into it the entire Grab object in order to call these functions from it. In another way: either we pass a Grab object or we pass a bare DOM object that has no useful functions, such as xpath_text.

Secondly, the output of the functions grab.xpath and grab.xpath_list are bare ElementTree elements that no longer have methods of type xpath_text.

Thirdly, although this is more likely a problem of the framework extensions, one way or another, the Grab object namespace is clogged with many of the methods described above.

And, yes, and fourth. I was shaken by the question of how to get the HTML code of elements found using the grab.xpath and grab.xpath_list methods. The people did not want to understand that grab is just a wrapper around lxml and that you just need to read the manual on lxml.de

The new interface for working with the DOM tree is designed to eliminate these shortcomings. If you use the Scrapy framework, then the things described below will already be familiar to you. I want to talk about selectors.

Selectors



Selectors, what is it? These are wrappers around ElementTree elements. The original wrapper wraps the entire DOM document tree i. The wrapper is built around the root html element. Next, we can use the select method to get a list of elements that satisfy the XPATH expression, and each such element will again be wrapped in a Selector wrapper.

Let's see what we can do with selectors. First, we design the selector

 >>> from grab.selector import Selector >>> from lxml.html import fromstring >>> root = Selector(fromstring('<html><body><h1>Header</h1><ul><li>Item 1</li><li><li>item 2</li></ul><span id="color">green</span>')) 


Now let's make a selection using the select method, get a list of new selectors. We can refer to the desired selector by index, there is also a one () method for selecting the first selector. Notice that in order to access the ElementTree element directly, we need to refer to the node attribute of any selector.

 >>> root.select('//ul') <grab.selector.selector.SelectorList object at 0x7fe5f41922d0> >>> root.select('//ul')[0] <grab.selector.selector.Selector object at 0x7fe5f419bed0> >>> root.select('//ul')[0].node <Element ul at 0x7fe5f41a7a70> >>> root.select('//ul').one() <grab.selector.selector.Selector object at 0x7fe5f419bed0> 


What actions are available on the found selectors? We can extract text content, try to find numeric content, and even apply a regular expression.

 >>> root.select('//ul/li')[0].text() 'Item 1' >>> root.select('//ul/li')[0].number() 1 >>> root.select('//ul/li/text()')[0].rex('(\w+)').text() 'Item' 


Note that we use an index to refer to the first selector found. All these entries can be reduced without specifying an index, then, by default, the first selector will be used.

 >>> root.select('//ul/li').text() 'Item 1' >>> root.select('//ul/li').number() 1 >>> root.select('//ul/li/text()').rex('em (\d+)').text() '1' >>> root.select('//ul/li/text()').rex('em (\d+)').number() 1 


What else? The html method for getting the HTML code of the selector, the exists method for checking the existence of the selector. You can also call the select method on any selector.

 >>> root.select('//span')[0].html() u'<span id="color">green</span>' >>> root.select('//span').exists() True >>> root.select('//god').exists() False >>> root.select('//ul')[0].select('./li[3]').text() 'item 2' 


How to work with the selector directly from the Grab object? With the help of the doc attribute, you can access the root selector of the DOM tree and then use the select method for the desired sample:

 >>> from grab import Grab >>> g = Grab() >>> g.go('http://habrahabr.ru/') <grab.response.Response object at 0x2853410> >>> print g.doc.select('//h1').text()   ,      MIT    >>> print g.doc.select('//div[contains(@class, "post")][2]')[0].select('.//div[@class="favs_count"]').number() 60 >>> print g.doc.select('//div[contains(@class, "post")][2]')[0].select('.//div[@class="favs_count"]')[0].html() <div class="favs_count" title=" ,    ">60</div> 


The current implementation of selectors in Grab is still quite raw, but I can already understand and appreciate the new interface.

In the version of Grab, available through pypi selectors yet. If you want to play around with selectors, put Grab from the repository: bitbucket.org/lorien/grab . Specifically, the implementation of selectors is here.

I represent the company GrabLab - we are engaged in parsing sites, parsing with the help of Grab and not only. If your company uses Grab, you can contact us about the Grab customization to suit your needs.

Source: https://habr.com/ru/post/173509/


All Articles