Scrapy: collect data and save to database

Introduction

I am interested in this framework for collecting information from sites. There were publications on Scrapy, but since there is not enough detailed information in Russian, I would like to tell you about my experience.

Task

Go to the page with a list of applicants oreluniver.ru/abits?src=all_postupil . Then follow each link and collect data on applicants and points scored.
From the very same page, collect data on specialties, which was conducted recruitment.
Save all results to database

Decision

To solve the problem, I used Python 2.7, Scrapy 1.1 Sqlalchemy 1, Sqlite. Installed everything as described in the documentation. The article also describes the installation in Russian, in the same place about the creation of the spider itself. Here's what I got.

Project structure:

\ spiders
\ spiders \ __ init__.py
\ spiders \ abiturlist.py
\ spiders \ SpecSpider.py
__init__.py
items.py
pipelines.py
settings.py

Items.py file

from scrapy.item import Item, Field class SpecItem(Item): spec = Field() SpecName = Field() class GtudataItem(Item): family = Field() name = Field() surname = Field() spec = Field() ball = Field() url = Field() pagespec = Field()

Here is the spider class for getting the list of applicants.
')

Abiturlist.py file

 from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.loader.processors import TakeFirst, Identity from scrapy.loader import ItemLoader from scrapy.selector import HtmlXPathSelector, Selector from gtudata.items import GtudataItem class AbiturLoader(ItemLoader): default_output_processor = Identity() class AbiturlistSpider(CrawlSpider): name = "abiturlist" allowed_domains = ["oreluniver.ru"] start_urls = ["http://oreluniver.ru/abits?src=all_postupil"] rules = ( Rule(LinkExtractor(allow=('spec_id=04.03.01')), callback='parse_item'), ) def parse_item(self, response): hxs = Selector(response) all = hxs.xpath("//tr[position()>1]") pg_spec = hxs.xpath("//div[@class='page-content']/b/div/text()").extract()[0].strip() for fld in all: Item = GtudataItem() FIO = fld.xpath("./td[2]/p/text()").extract()[0].split() Item['family'] = FIO[0] Item['name'] = FIO[1] Item['surname'] = FIO[2] Item['spec'] = fld.xpath("./td[last()]/p/text()").extract()[0] ball = fld.xpath("string(./td[3]/p)").extract()[0] Item['ball'] = ball Item['url'] = response.url Item['pagespec'] = pg_spec yield Item

This class describes the spider class to collect a list of specialties. To determine the fields used Xpath. To separate the number of a specialty and its name, we use slices; the specialty number takes 9 characters.

SpecSpider.py file

 from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.selector import HtmlXPathSelector, Selector from gtudata.items import SpecItem class SpecSpider(CrawlSpider): name = "speclist" allowed_domains = ["oreluniver.ru"] start_urls = ["http://oreluniver.ru/abits?src=all_postupil"] # /abits?src=all_postupil rules = ( Rule(LinkExtractor(allow = ('src=all_postupil')), callback='parse_item'), ) def parse_item(self, response): hxs = Selector(response) all = hxs.xpath('//a[contains(@href, "spec_id")]/text()').extract() # print 'test' for fld in all: txt = fld.strip() Item = SpecItem() Item['SpecName'] = txt[9:] Item['spec'] = txt[:8] yield Item

I would like to mention the possibility of creating several classes for the collected data in the Items.py file. In my case:

SpecItem - for the list of specialties;
GtudataItem - for these applicants.

Saving results to the database

The file pipelines.py describes the steps to save data. To create a sqlite database with a given table structure, I used sqlalchemy.

First, create an instance of the declarative_base () class, from which we will inherit the classes, to describe the database tables in which we will store the found information. These are classes SpecTable, for saving the list of specialties, and DataTable, for saving data of applicants.

In each class, set the attribute __tablename__. This is the name of the table in the database. then we set the fields:

 id = Column(Integer, primary_key=True)

id - integer, primary key.

The remaining fields, for example the number of specialty:

 spec = Column(String)

In the __init __ () method we fill in the fields of the table.

The GtudataPipeline class describes how to work with a database. During initialization, we check for the presence of a database file in the project folder. If the file is missing, then create a database with a given structure.

 Base.metadata.create_all(self.engine)

In the process_item method we describe the actual storage to the database. Check which class is item. Depending on this, fill in one of the two tables. To do this, create instances of the DataTable and SpecTabl classes.

 dt = DataTable(item['family'],item['name'], item['surname'], item['spec'], item['ball'], item['url'], item['pagespec']) dt = SpecTable(item['spec'],item['SpecName'])

To ensure the uniqueness of the stored data (applicants in the tables can be repeated) use the fio attribute. This set, the elements of which are formed by the next line.

 fio = item['family'] + item['name'] + item['surname']

If such an applicant is in the database, the record is not saved.

 if fio not in self.fio: dt = DataTable(item['family'],item['name'], item['surname'], item['spec'], item['ball'], item['url'], item['pagespec']) self.fio.add(fio) self.session.add(dt)

Add a new entry:

 self.session.add(dt)

When opening a spider, create a session:

  def open_spider(self, spider): self.session = Session(bind=self.engine)

When closing the spider we complete the changes:

  def close_spider(self, spider): self.session.commit() self.session.close()

Here's what happened:

Pipelines.py file

 from sqlalchemy.ext.declarative import declarative_base from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData, ForeignKey from sqlalchemy.orm import Session import os from gtudata.items import SpecItem, GtudataItem from scrapy.exceptions import DropItem Base = declarative_base() class SpecTable(Base): __tablename__ = 'specdata' id = Column(Integer, primary_key=True) spec = Column(String) spectitle = Column(String) def __init__(self, spec, spectitle): self.spec= spec self.spectitle = spectitle def __repr__(self): return "<Data %s, %s>" % (self.spec, self.spectitle) class DataTable(Base): __tablename__ = 'gtudata' id = Column(Integer, primary_key=True) family = Column(String) name = Column(String) surname = Column(String) spec = Column(String) ball = Column(Integer) url = Column(String) pagespec = Column(String) def __init__(self, family, name, surname, spec, ball, url, pagespec): self.family = family self.name = name self.surname = surname self.spec = spec self.ball = ball self.url = url self.pagespec = pagespec def __repr__(self): return "<Data %s, %s, %s, %s, %s, %s, %s>" % \ (self.family, self.name, self.surname, self.spec, self.ball, self.url, self.pagespec) class GtudataPipeline(object): def __init__(self): basename = 'data_scraped' self.engine = create_engine("sqlite:///%s" % basename, echo=False) if not os.path.exists(basename): Base.metadata.create_all(self.engine) self.fio = set() def process_item(self, item, spider): if isinstance(item, GtudataItem): fio = item['family'] + item['name'] + item['surname'] if fio not in self.fio: dt = DataTable(item['family'],item['name'], item['surname'], item['spec'], item['ball'], item['url'], item['pagespec']) self.fio.add(fio) self.session.add(dt) elif isinstance(item, SpecItem): dt = SpecTable(item['spec'],item['SpecName']) self.session.add(dt) return item def close_spider(self, spider): self.session.commit() self.session.close() def open_spider(self, spider): self.session = Session(bind=self.engine)

Run in the project folder:

 scrapy crawl speclist scrapy crawl abiturlist

And we get the result. The full version of the project is posted on GitHub

List of sources

Source: https://habr.com/ru/post/308660/

All Articles