📜 ⬆️ ⬇️

Collect data using Scrapy

Here already casually mentions of this framework for data collection skipped. The tool is really powerful and deserves more attention. In this review I will tell you how

scrapy


')




Installation



Requirements: Python 2.5+ (3rd branch not supported), Twisted, lxml or libxml2, simplejson, pyopenssl (for HTTPS support)

No problem installed from the Ubuntu repositories. The Installation guide page describes installation in other Linux distributions, as well as in Mac OS X and Windows.

Task



Probably, someone wants to parse an online store and pull out the entire catalog with product descriptions and photos, but I will not intentionally do that. Let's take some open data, for example, a list of educational institutions . The site is quite typical and it can show several techniques.

Before writing a spider, you need to examine the source site. Note, the site is built on frames (?!), In the frameset we are looking for a frame with a start page . Here is a search form. Suppose we only need universities in Moscow, so fill out the appropriate field, click "Find".

We analyze. We have a page with pagination links, 15 universities per page. Filter parameters are transmitted via GET, only the page value is changed.

So, we formulate the problem:

  1. Go to abitur.nica.ru/new/www/search.php?region=77&town=0&opf=0&type=0&spec=0&ed_level=0&ed_form=0&qualif=&substr=&page=1
  2. Go through each page with the results, changing the value of the page
  3. Go to the university description abitur.nica.ru/new/www/vuz_detail.php?code=486®ion=77&town=0&opf=0&type=0&spec=0&ed_level=0&ed_form=0&qualif=&substr=&page=1
  4. Save a detailed description of the university in the CSV file


Creating a project



Go to the folder where our project will be located, create it:

scrapy startproject abitur cd abitur 


In the abitur folder of our project there are files:



Spider



In the created file spiders / abitur_spider.py we describe our spider

 class AbiturSpider(CrawlSpider): name = "abitur" allowed_domains = ["abitur.nica.ru"] start_urls = ["http://abitur.nica.ru/new/www/search.php?region=77&town=0&opf=0&type=0&spec=0&ed_level=0&ed_form=0&qualif=&substr=&page=1"] rules = ( Rule(SgmlLinkExtractor(allow=('search\.php\?.+')), follow=True), Rule(SgmlLinkExtractor(allow=('vuz_detail\.php\?.+')), callback='parse_item'), ) "..." 


Our class is inherited from CrawlSpider , which will allow us to register link templates that the spider itself will extract and navigate through them.

In order:



As you noticed, among the rules, the callback function is passed as a parameter. We will be back soon.

Items



As I said before, items.py contains classes that list the fields of the data to be collected.
This can be done like this:

 class AbiturItem(Item): name = Field() state = Field() "..." 


Parsed data can be processed before export. For example, an educational institution can be “state” and “non-state”, and we want to store this value in boolean format or the date “January 1, 2011” should be recorded as “01/01/2011”.

To do this, there are input and output handlers, so we write the state field differently:

 class AbiturItem(Item): name = Field() state = Field(input_processor=MapCompose(lambda s: not re.match(u'\s*', s))) "...." 


MapCompose is applied to each state list item.

Search for items on the page



We return to our parse_item method.

For each Item element you can use your own loader. Its purpose is also related to data processing.

 class AbiturLoader(XPathItemLoader): default_input_processor = MapCompose(lambda s: re.sub('\s+', ' ', s.strip())) default_output_processor = TakeFirst() class AbiturSpider(CrawlSpider): "..." def parse_item(self, response): hxs = HtmlXPathSelector(response) l = AbiturLoader(AbiturItem(), hxs) l.add_xpath('name', '//td[@id="content"]/h1/text()') l.add_xpath('state', '//td[@id="content"]/div/span[@class="gray"]/text()') "..." return l.load_item() 


In our case, extreme and duplicate spaces are removed from each field. You can also add individual rules to the bootloader, which we did in the AbiturItem class:

 class AbiturLoader(XPathItemLoader): "..." state_in = MapCompose(lambda s: not re.match(u'\s*', s)) 


So, do as you like.

The parse_item () function returns an Item object, which is passed to Pipeline (described in pipelines.py). There you can write your own classes for saving data in formats that are not provided by the standard Scrapy functionality. For example, export to mongodb.

The fields of this element are set using XPath , which can be read here or here . If you use FirePath, note that it adds the tbody tag inside the table. Use the built-in console to verify XPath paths.

And one more note. When you use XPath , the results found are returned in the form of a list, so it is convenient to connect the TakeFirst output processor, which takes the first element of this list.

Launch



You can get the source code here, to launch it, go to the project folder and type in the console

 scrapy crawl abitur --set FEED_URI=scraped_data.csv --set FEED_FORMAT=csv 


In short, I described everything, but this is only a small part of the Scrapy features :


It’s impossible to describe all all in one article, so ask questions in the comments, read the documentation , suggest topics for future articles on Scrapy.

The working example posted on GitHub .

Source: https://habr.com/ru/post/115710/


All Articles