Using Grab: Spider for parsing sites

Hello!

I am an active user of the open-source framework Grab ( itforge has already written about it here and here ) and 1/2 of the GrabLab project (which deals with the actual commercial exploitation of the framework). Because We often parse the sites, a lot and the tasks are usually completely different from each other, I would like to share my experience in building a typical project parsing.

It is a little about toolkit which helps me with work

As a working browser, I use FireFox with HttpFox plug-ins (analyze incoming / outgoing http traffic), XPather (allows you to check xpath expressions), SQLite Manager (view sqlite tables), I type code in emacs, where I actively use snippets (YASnippets) for common designs.
')
Due to the specificity of the framework, as a rule, at the first stage, the site is completely (or if there is a lot of data, partially for ease of further development) is stored in the local mongodb-based cache, which saves time because reading pages comes from the cache.

To work with sql databases where, as a rule (less often in json / xml), you need to decompose the data we use ORM - SQLAlchemy.

Actually, the Grab framework itself implies greater flexibility in the construction of the project and control over its actions. However, the last few projects went well into the following structure, well known to those involved in web development:

1) models.py - describing data models
2) config.py is an analogue of settings.py from the world of jungle, settings, orm initialization.
3) /spiders/*.py - spiders code
4) spider.py or project_name.py - the main project file, in combination usually implements a command-line interface for launching various spiders, since often the site is parsed in parts.

As an example, let's write the “Trending projects” and “Most Popular Python projects” parser with the open-source GitHub stronghold.

The full code of the example can be found here .

First you need to describe the model .

class Item(Base): __tablename__ = 'item' sqlite_autoincrement = True id = Column(Integer, primary_key=True) title = Column(String(160)) author = Column(String(160)) description = Column(String(255)) url = Column(String(160)) last_update = Column(DateTime, default=datetime.datetime.now)

Further, in the config.py file, the initial initialization of the orm, the creation of tables, constants are performed and the function that constructs the parameters for launching the spider depending on the settings (default_spider_params), which is common for all spiders in the project, is located.

 def init_engine(): db_engine = create_engine( 'sqlite+pysqlite:///data.sqlite', encoding='utf-8') Base.metadata.create_all(db_engine) return db_engine db_engine = init_engine() Session = sessionmaker(bind=db_engine) def default_spider_params(): params = { 'thread_number': MAX_THREADS, 'network_try_limit': 20, 'task_try_limit': 20, } if USE_CACHE: params.update({ 'thread_number': 3, 'use_cache': True, 'cache_db': CACHE_DB, 'debug_error' :True, }) return params

In most cases, there is no need to use mongodb on the server, so it's convenient to make the cache disconnectable. With a draft of the project, I just put USE_CACHE = False and everything works fine. SAVE_TO_DB is used to enable / disable writing data to the database.

Actually, we turn to the most interesting part: we will have 2a spider, the first will parse 5 of the Top Trending repositories of the projects, and the second Most watched Python.

It is clearly visible that these spiders have common parts that can and should be taken out into a separate, base class and inherit from it already, which reduces the code, simplifies support and makes the program more readable. In a more or less complex project, where there is a large number of pages slightly different from each other, the need to bring some of the functionality into superclasses arises constantly.

Let's not neglect OOP and write BaseHubSpider in which we define 2a of the save () and log_progress () methods.

 class BaseHubSpider(Spider): initial_urls = ['http://github.com'] items_total = 0 def save(self, data): if not SAVE_TO_DB: return session = Session() if not session.query(Item).filter_by(title=data['title']).first(): obj = Item(**data) session.add(obj) session.commit() def log_progress(self, str): self.items_total += 1 print "(%d) Item scraped: %s" % (self.items_total, str)

In a real application, the function of parsing the page is very likely depending on some parameters - field names which are different on each page, while the xpath path to them is almost the same, etc.

For example, something like this (this is not a working example, but just an illustration for better understanding):

  XPATH = u'//table[@class="standart-table table"]' + \ u'//tr[th[text() = "%s"]]/td' values = ( ('title', u' '), ('rating', u''), ('categories', u' '), ('description', u''), ) for db_field, field_title in values: try: data[db_field] = get_node_text(grab.xpath( XPATH % field_title, None)).strip() except AttributeError: data[db_field] = ''

https://github.com/istinspring/grab-default-project-example/blob/master/spiders/lang_python.py

Spider code that parses and saves the 20 most popular python projects to the database.

note

  repos = grab.xpath_list( '//table[@class="repo"]//tr/td[@class="title"]/..') for repo in repos: data = { 'author': repo.xpath('./td[@class="owner"]/a/text()')[0], 'title': repo.xpath('./td[@class="title"]/a/text()')[0],}

repos = grab.xpath_list ('') - returns a list of lxml object, while for example grab.xpath ('') returns the first element, since xpath in this case is the method of the grab object, i.e. operating in the repo.xpath loop ('./ h3 / a [1] / text ()') - we get a list or an exception if lxml could not find the xpath. Simply put, the xpath from the grab object and the xpath from the lxml object are different things, in the first case the first element will return (either default or throw an exception), and the second will return the list of ['something'] elements.

^^ It is incomprehensible to read, but as soon as you meet this in practice, immediately think about this paragraph.

Hope the information was helpful. Comrade itforge works tirelessly on the development of the open source product Grab , the documentation on grab is available in Russian , but for grab: spider, unfortunately, only the introductory part is available .

For questions on the framework, we have a jabber conference at grablab@conference.jabber.ru

Source: https://habr.com/ru/post/142212/

All Articles

Using Grab: Spider for parsing sites

It is a little about toolkit which helps me with work

note

More articles: