Collect data using Scrapy

Here already casually mentions of this framework for data collection skipped. The tool is really powerful and deserves more attention. In this review I will tell you how

scrapy

create a spider that performs GET requests,
extract data from an HTML document
process and export data.

Installation

Requirements: Python 2.5+ (3rd branch not supported), Twisted, lxml or libxml2, simplejson, pyopenssl (for HTTPS support)

No problem installed from the Ubuntu repositories. The Installation guide page describes installation in other Linux distributions, as well as in Mac OS X and Windows.

Task

Probably, someone wants to parse an online store and pull out the entire catalog with product descriptions and photos, but I will not intentionally do that. Let's take some open data, for example, a list of educational institutions . The site is quite typical and it can show several techniques.

Before writing a spider, you need to examine the source site. Note, the site is built on frames (?!), In the frameset we are looking for a frame with a start page . Here is a search form. Suppose we only need universities in Moscow, so fill out the appropriate field, click "Find".

We analyze. We have a page with pagination links, 15 universities per page. Filter parameters are transmitted via GET, only the page value is changed.

So, we formulate the problem:

Go to abitur.nica.ru/new/www/search.php?region=77&town=0&opf=0&type=0&spec=0&ed_level=0&ed_form=0&qualif=&substr=&page=1
Go through each page with the results, changing the value of the page
Go to the university description abitur.nica.ru/new/www/vuz_detail.php?code=486®ion=77&town=0&opf=0&type=0&spec=0&ed_level=0&ed_form=0&qualif=&substr=&page=1
Save a detailed description of the university in the CSV file

Creating a project

Go to the folder where our project will be located, create it:

scrapy startproject abitur cd abitur

In the abitur folder of our project there are files:

items.py contains classes that list the fields of the data to be collected,
pipelines.py allows you to specify certain actions when opening / closing a spider, saving data,
settings.py contains custom spider settings,
spiders - the folder in which the files are stored with the classes of spiders. Each spider is usually written in a separate file named name_spider.py.

Spider

In the created file spiders / abitur_spider.py we describe our spider

 class AbiturSpider(CrawlSpider): name = "abitur" allowed_domains = ["abitur.nica.ru"] start_urls = ["http://abitur.nica.ru/new/www/search.php?region=77&town=0&opf=0&type=0&spec=0&ed_level=0&ed_form=0&qualif=&substr=&page=1"] rules = ( Rule(SgmlLinkExtractor(allow=('search\.php\?.+')), follow=True), Rule(SgmlLinkExtractor(allow=('vuz_detail\.php\?.+')), callback='parse_item'), ) "..."

Our class is inherited from CrawlSpider , which will allow us to register link templates that the spider itself will extract and navigate through them.

In order:

name - the name of the spider, is used to run,
allowed_domains - the domains of the site, outside of which the spider should not look for anything,
start_urls - a list of starting addresses
rules - a list of rules for extracting links.

As you noticed, among the rules, the callback function is passed as a parameter. We will be back soon.

Items

As I said before, items.py contains classes that list the fields of the data to be collected.
This can be done like this:

 class AbiturItem(Item): name = Field() state = Field() "..."

Parsed data can be processed before export. For example, an educational institution can be “state” and “non-state”, and we want to store this value in boolean format or the date “January 1, 2011” should be recorded as “01/01/2011”.

To do this, there are input and output handlers, so we write the state field differently:

 class AbiturItem(Item): name = Field() state = Field(input_processor=MapCompose(lambda s: not re.match(u'\s*', s))) "...."

MapCompose is applied to each state list item.

Search for items on the page

We return to our parse_item method.

For each Item element you can use your own loader. Its purpose is also related to data processing.

 class AbiturLoader(XPathItemLoader): default_input_processor = MapCompose(lambda s: re.sub('\s+', ' ', s.strip())) default_output_processor = TakeFirst() class AbiturSpider(CrawlSpider): "..." def parse_item(self, response): hxs = HtmlXPathSelector(response) l = AbiturLoader(AbiturItem(), hxs) l.add_xpath('name', '//td[@id="content"]/h1/text()') l.add_xpath('state', '//td[@id="content"]/div/span[@class="gray"]/text()') "..." return l.load_item()

In our case, extreme and duplicate spaces are removed from each field. You can also add individual rules to the bootloader, which we did in the AbiturItem class:

 class AbiturLoader(XPathItemLoader): "..." state_in = MapCompose(lambda s: not re.match(u'\s*', s))

So, do as you like.

The parse_item () function returns an Item object, which is passed to Pipeline (described in pipelines.py). There you can write your own classes for saving data in formats that are not provided by the standard Scrapy functionality. For example, export to mongodb.

The fields of this element are set using XPath , which can be read here or here . If you use FirePath, note that it adds the tbody tag inside the table. Use the built-in console to verify XPath paths.

And one more note. When you use XPath , the results found are returned in the form of a list, so it is convenient to connect the TakeFirst output processor, which takes the first element of this list.

Launch

You can get the source code here, to launch it, go to the project folder and type in the console

 scrapy crawl abitur --set FEED_URI=scraped_data.csv --set FEED_FORMAT=csv

In short, I described everything, but this is only a small part of the Scrapy features :

find and extract their HTML and XML data
data conversion before export
export to JSON, CSV, XML
file download
framework extension with own middlewares, pipelines
POST requests, support for cookies and sessions, authentication
substitution user-agent
shell debug console
logging system
monitoring via web-interface
telnet console management

It’s impossible to describe all all in one article, so ask questions in the comments, read the documentation , suggest topics for future articles on Scrapy.

The working example posted on GitHub .

Source: https://habr.com/ru/post/115710/

All Articles