📜 ⬆️ ⬇️

Using morph.io for web parsing

If you have read the previous two articles, Web Parsing for Ruby and Advanced Web Parsing with Mechanize , then you have a basic knowledge of how to write a parser that gets structured data from a web site.

The next logical step is to run the parser regularly to always have fresh data. This is exactly what morph.io does from talented people from OpenAustralia .

Morph.io positions itself as “Heroku for Parsers”. You can choose to either run the parsers manually, or work automatically every day. In this case, you can use the API to extract data in JSON / CSV and use it in your application or download the sqlite database with data.
Morph.io fills the gap left by the Scraperwiki Classic . The parsers in morph.io are hosted on GitHub, which means that you can fork them and fix them later if they stop working.
')
image

Create parser


We will use the code from my previous post to show how easy it is to run your parser on morph.io.

To log in to morph.io, you must use your GitHub account. After authorization you can create a parser . At the moment, morp.io supports parsers written in Ruby, PHP, Python or Perl, select a language and set the name of your parser, I called pitchfork_scraper . Then click the “Create Scraper” button to create a new GitHub repository containing the skeleton of the parser according to the language you selected.

Clone your repository created in the previous step, in my case it looks like this:

 git clone https://github.com/chrismytton/pitchfork_scraper 

The repository will contain the files README.md and scraper.rb .

Morph.io expects two things from the parser. The first is that the parser's repository must contain the scraper.rb file for Ruby parsers, the second is that the parser itself must write to the sqlite database, which is called data.sqlite .

Note on file names
Accordingly, for the Python parser, the file should be called - scraper.py, for PHP - scraper.php, and for Perl - scraper.pl.


In order to add this to our parser, you need to make small changes to output the data to the sqlite database, and not to STDOUT in the form of JSON.

First you need to add the code from the previous article to our scraper.rb , then you need to change the code so that it uses the scraperwiki gem to write data to the sqlite database.

 diff --git a/scraper.rb b/scraper.rb index 2d2baaa..f8b14d6 100644 --- a/scraper.rb +++ b/scraper.rb @@ -1,6 +1,8 @@ require 'mechanize' require 'date' -require 'json' +require 'scraperwiki' + +ScraperWiki.config = { db: 'data.sqlite', default_table_name: 'data' } agent = Mechanize.new page = agent.get("http://pitchfork.com/reviews/albums/") @@ -34,4 +36,6 @@ reviews = review_links.map do |link| } end -puts JSON.pretty_generate(reviews) +reviews.each do |review| + ScraperWiki.save_sqlite([:artist, :album], review) +end 

This code uses the ScraperWiki.save_sqlite method to save the overview to the database. The first argument is a list of fields that must be unique. In this case, we use the artist and the album, as it is unlikely that the same artist will release two albums with the same name.
To run the code locally, you need to install the Ruby gem scraperwiki in addition to the dependencies that already exist.

 gem install scraperwiki 

Then you can run the code on the local machine:

 ruby scraper.rb 

As a result, a new file will be created in the current directory called data.sqlite , which will contain the sparse data.

Run parser on morph.io


Now that we have made all the necessary changes, we can run our code on morph.io. First, we commit our changes to git commit and send them to the git push repository.

Now you can run the parser and the result will be added to the database on morph.io. It should look something like this:
image
As you can see, the data is available to authorized users in JSON or CSV format, or you can download the sqlite database and view it locally.

Parser code is available on GitHub . You can see parser output on morph.io . Note that you need to log in via GitHub to access the data itself and control through the API.

This article should give you enough knowledge to start hosting your parsers on morph.io. In my opinion, this is an amazing service that takes care of the use and maintenance of parsers, allowing you to concentrate on unique parts of your application.

Go ahead for structured data from the web!

All articles in the series:

Source: https://habr.com/ru/post/262991/


All Articles