If you have read the previous two articles,
Web Parsing for Ruby and
Advanced Web Parsing with Mechanize , then you have a basic knowledge of how to write a parser that gets structured data from a web site.
The next logical step is to run the parser regularly to always have fresh data. This is exactly
what morph.io does from talented people from
OpenAustralia .
Morph.io positions itself as “Heroku for Parsers”. You can choose to either run the parsers manually, or work automatically every day. In this case, you can use the API to extract data in JSON / CSV and use it in your application or download the sqlite database with data.
Morph.io fills the gap left by the
Scraperwiki Classic . The parsers in morph.io are hosted on GitHub, which means that you can fork them and fix them later if they stop working.
')

Create parser
We will use the code from my
previous post to show how easy it is to run your parser on morph.io.
To log in to morph.io, you must use your GitHub account. After authorization you can
create a parser . At the moment, morp.io supports parsers written in Ruby, PHP, Python or Perl, select a language and set the name of your parser, I called
pitchfork_scraper
. Then click the “Create Scraper” button to create a new GitHub repository containing the skeleton of the parser according to the language you selected.
Clone your repository created in the previous step, in my case it looks like this:
git clone https:
The repository will contain the files
README.md
and
scraper.rb
.
Morph.io expects two things from the parser. The first is that the parser's repository must contain the
scraper.rb
file for Ruby parsers, the second is that the parser itself must write to the sqlite database, which is called
data.sqlite
.
Note on file namesAccordingly, for the Python parser, the file should be called - scraper.py, for PHP - scraper.php, and for Perl - scraper.pl.
In order to add this to our parser, you need to make small changes to output the data to the sqlite database, and not to STDOUT in the form of JSON.
First you need to add the
code from the previous article to our
scraper.rb
, then you need to change the code so that it uses the
scraperwiki
gem to write data to the sqlite database.
diff --git a/scraper.rb b/scraper.rb index 2d2baaa..f8b14d6 100644 --- a/scraper.rb +++ b/scraper.rb @@ -1,6 +1,8 @@ require 'mechanize' require 'date' -require 'json' +require 'scraperwiki' + +ScraperWiki.config = { db: 'data.sqlite', default_table_name: 'data' } agent = Mechanize.new page = agent.get("http://pitchfork.com/reviews/albums/") @@ -34,4 +36,6 @@ reviews = review_links.map do |link| } end -puts JSON.pretty_generate(reviews) +reviews.each do |review| + ScraperWiki.save_sqlite([:artist, :album], review) +end
This code uses the
ScraperWiki.save_sqlite
method to save the overview to the database. The first argument is a list of fields that must be unique. In this case, we use the artist and the album, as it is unlikely that the same artist will release two albums with the same name.
To run the code locally, you need to install the Ruby gem
scraperwiki
in addition to the dependencies that already exist.
gem install scraperwiki
Then you can run the code on the local machine:
ruby scraper.rb
As a result, a new file will be created in the current directory called
data.sqlite
, which will contain the sparse data.
Run parser on morph.io
Now that we have made all the necessary changes, we can run our code on morph.io. First, we commit our changes to
git commit
and send them to the
git push
repository.
Now you can run the parser and the result will be added to the database on morph.io. It should look something like this:

As you can see, the data is available to authorized users in JSON or CSV format, or you can download the sqlite database and view it locally.
Parser code is available on
GitHub . You can see parser output on
morph.io . Note that you need to log in via GitHub to access the data itself and control through the API.
This article should give you enough knowledge to start hosting your parsers on morph.io. In my opinion, this is an amazing service that takes care of the use and maintenance of parsers, allowing you to concentrate on unique parts of your application.
Go ahead for structured data from the web!
All articles in the series: