Here already casually mentions of this
framework for data collection skipped. The tool is really powerful and deserves more attention. In this review I will tell you how

- create a spider that performs GET requests,
- extract data from an HTML document
- process and export data.
')
Installation
Requirements: Python 2.5+ (3rd branch not supported), Twisted, lxml or libxml2, simplejson, pyopenssl (for HTTPS support)
No problem installed from the
Ubuntu repositories. The
Installation guide page describes installation in other Linux distributions, as well as in Mac OS X and Windows.
Task
Probably, someone wants to parse an online store and pull out the entire catalog with product descriptions and photos, but I will not intentionally do that. Let's take some open data, for example, a
list of educational institutions . The site is quite typical and it can show several techniques.
Before writing a spider, you need to examine the source site. Note, the site is built on frames (?!), In the frameset we are looking for a frame with a
start page . Here is a search form. Suppose we only need universities in Moscow, so fill out the appropriate field, click "Find".
We analyze. We have a page with pagination links, 15 universities per page. Filter parameters are transmitted via GET, only the page value is changed.
So, we formulate the problem:
- Go to abitur.nica.ru/new/www/search.php?region=77&town=0&opf=0&type=0&spec=0&ed_level=0&ed_form=0&qualif=&substr=&page=1
- Go through each page with the results, changing the value of the page
- Go to the university description abitur.nica.ru/new/www/vuz_detail.php?code=486®ion=77&town=0&opf=0&type=0&spec=0&ed_level=0&ed_form=0&qualif=&substr=&page=1
- Save a detailed description of the university in the CSV file
Creating a project
Go to the folder where our project will be located, create it:
scrapy startproject abitur cd abitur
In the
abitur folder of our project there are files:
- items.py contains classes that list the fields of the data to be collected,
- pipelines.py allows you to specify certain actions when opening / closing a spider, saving data,
- settings.py contains custom spider settings,
- spiders - the folder in which the files are stored with the classes of spiders. Each spider is usually written in a separate file named name_spider.py.
Spider
In the created file spiders / abitur_spider.py we describe our spider
class AbiturSpider(CrawlSpider): name = "abitur" allowed_domains = ["abitur.nica.ru"] start_urls = ["http://abitur.nica.ru/new/www/search.php?region=77&town=0&opf=0&type=0&spec=0&ed_level=0&ed_form=0&qualif=&substr=&page=1"] rules = ( Rule(SgmlLinkExtractor(allow=('search\.php\?.+')), follow=True), Rule(SgmlLinkExtractor(allow=('vuz_detail\.php\?.+')), callback='parse_item'), ) "..."
Our class is inherited from
CrawlSpider , which will allow us to register link templates that the spider itself will extract and navigate through them.
In order:
- name - the name of the spider, is used to run,
- allowed_domains - the domains of the site, outside of which the spider should not look for anything,
- start_urls - a list of starting addresses
- rules - a list of rules for extracting links.
As you noticed, among the rules, the callback function is passed as a parameter. We will be back soon.
Items
As I said before,
items.py contains classes that list the fields of the data to be collected.
This can be done like this:
class AbiturItem(Item): name = Field() state = Field() "..."
Parsed data can be processed before export. For example, an educational institution can be “state” and “non-state”, and we want to store this value in boolean format or the date “January 1, 2011” should be recorded as “01/01/2011”.
To do this, there are input and output handlers, so we write the state field differently:
class AbiturItem(Item): name = Field() state = Field(input_processor=MapCompose(lambda s: not re.match(u'\s*', s))) "...."
MapCompose is applied to each state list item.
Search for items on the page
We return to our parse_item method.
For each
Item element you can use your own loader. Its purpose is also related to data processing.
class AbiturLoader(XPathItemLoader): default_input_processor = MapCompose(lambda s: re.sub('\s+', ' ', s.strip())) default_output_processor = TakeFirst() class AbiturSpider(CrawlSpider): "..." def parse_item(self, response): hxs = HtmlXPathSelector(response) l = AbiturLoader(AbiturItem(), hxs) l.add_xpath('name', '//td[@id="content"]/h1/text()') l.add_xpath('state', '//td[@id="content"]/div/span[@class="gray"]/text()') "..." return l.load_item()
In our case, extreme and duplicate spaces are removed from each field. You can also add individual rules to the bootloader, which we did in the AbiturItem class:
class AbiturLoader(XPathItemLoader): "..." state_in = MapCompose(lambda s: not re.match(u'\s*', s))
So, do as you like.
The parse_item () function returns an Item object, which is passed to Pipeline (described in pipelines.py). There you can write your own classes for saving data in formats that are not provided by the standard Scrapy functionality. For example, export to mongodb.
The fields of this element are set using
XPath , which can be read
here or
here . If you use FirePath, note that it adds the tbody tag inside the table. Use the built-in
console to verify
XPath paths.
And one more note. When you use
XPath , the results found are returned in the form of a list, so it is convenient to connect the
TakeFirst output processor, which takes the first element of this list.
Launch
You can get the source code here, to launch it, go to the project folder and type in the console
scrapy crawl abitur --set FEED_URI=scraped_data.csv --set FEED_FORMAT=csv
In short, I described everything, but this is only a small part of the
Scrapy features :
- find and extract their HTML and XML data
- data conversion before export
- export to JSON, CSV, XML
- file download
- framework extension with own middlewares, pipelines
- POST requests, support for cookies and sessions, authentication
- substitution user-agent
- shell debug console
- logging system
- monitoring via web-interface
- telnet console management
It’s impossible to describe all all in one article, so ask questions in the comments, read the
documentation , suggest topics for future articles on Scrapy.
The working example posted on
GitHub .