📜 ⬆️ ⬇️

Advanced website parsing with Mechanize

In continuation of the topic of parsing Ruby sites , I decided to translate the following article by the same author.

In the previous post I described the basics - an introduction to web parsing on Ruby . At the end of the post, I mentioned the Mechanize tool, which is used for advanced parsing.

This article explains how to do advanced parsing of websites using Mechanize, which, in turn, allows you to do excellent HTML processing while working on Nokogiri.

Parsing Reviews with Pitchfork


Mechanize out of the box provides tools that allow you to fill out fields in forms, follow links and take into account the robots.txt file. In this post, I'll show you how to use it to get the latest reviews from the Pitchfork site.
')
Parse neatly
You should always parse carefully. Read the article Is scraping legal? from the ScraperWiki blog to read discussions on this topic.


The reviews are divided into several pages, therefore, we cannot simply take one page and disassemble it with the help of Nokogiri. Here we will need Mechanize with its ability to click on links and navigate to other pages.

Installation


First you need to install Mechanize itself and its dependencies through Rubygems.

$ gem install mechanize 


You can start writing our parser. Create a scraper.rb file and add some require . This will indicate dependencies that are necessary for our script. date and json are parts of the standard ruby ​​library, so there is no need to additionally install them.

 require 'mechanize' require 'date' require 'json' 


Now we can start using Mechanize. The first thing to do is create a new instance of the Mechanize class ( agent ) and use it to download the page.

 agent = Mehanize.new page = agent.get("http://pitchfork.com/reviews/albums/") 


Find links to reviews


Now we can use the page object to find links to reviews.
Mehanize allows you to use the .links_with method, which, as the name implies, finds links with the specified attributes. Here we look for links that match the regular expression.

This will return an array of links, but we only need links to reviews, not pagination. To remove unnecessary we can call .reject and drop links that look like pagination.

 review_links = page.links_with(href: %r{^/reviews/albums/\w+}) review_links = review_links.reject do |link| parent_classes = link.node.parent['class'].split parent_classes.any? { |p| %w[next-container page-number].include?(p) } end 


For illustrative purposes and in order not to burden the Pitchfork server, we will only take links to the first 4 reviews.

 review_links = review_links[0...4] 


Processing each review


We have received a list of links and want to process each one individually, for this we will use the .map method and return the hash after each iteration.

The page object has a .search method that is delegated to the .search .search method. This means that we can use the CSS selector as an argument for .serach and it will return an array of matched elements.

First we take the review metadata using the CSS selector #main .review-meta .info , and then we search inside the review_meta element for the pieces of information we need.

 reviews = review_links.map do |link| review = link.click review_meta = review.search('#main .review-meta .info') artist = review_meta.search('h1')[0].text album = review_meta.search('h2')[0].text label, year = review_meta.search('h3')[0].text.split(';').map(&:strip) reviewer = review_meta.search('h4 address')[0].text review_date = Date.parse(review_meta.search('.pub-date')[0].text) score = review_meta.search('.score').text.to_f { artist: artist, album: album, label: label, year: year, reviewer: reviewer, review_date: review_date, score: score } end 


Now we have an array of hashes with reviews, which we can, for example, output in JSON format.

 puts JSON.pretty_generate(reviews) 


Together


Complete script:

 require 'mechanize' require 'date' require 'json' agent = Mechanize.new page = agent.get("http://pitchfork.com/reviews/albums/") review_links = page.links_with(href: %r{^/reviews/albums/\w+}) review_links = review_links.reject do |link| parent_classes = link.node.parent['class'].split parent_classes.any? { |p| %w[next-container page-number].include?(p) } end review_links = review_links[0...4] reviews = review_links.map do |link| review = link.click review_meta = review.search('#main .review-meta .info') artist = review_meta.search('h1')[0].text album = review_meta.search('h2')[0].text label, year = review_meta.search('h3')[0].text.split(';').map(&:strip) reviewer = review_meta.search('h4 address')[0].text review_date = Date.parse(review_meta.search('.pub-date')[0].text) score = review_meta.search('.score').text.to_f { artist: artist, album: album, label: label, year: year, reviewer: reviewer, review_date: review_date, score: score } end puts JSON.pretty_generate(reviews) 


Saving this code in our scraper.rb file and running it with the command:

 $ ruby scraper.rb 


We get something like this:

 [ { "artist": "Viet Cong", "album": "Viet Cong", "label": "Jagjaguwar", "year": "2015", "reviewer": "Ian Cohen", "review_date": "2015-01-22", "score": 8.5 }, { "artist": "Lupe Fiasco", "album": "Tetsuo & Youth", "label": "Atlantic / 1st and 15th", "year": "2015", "reviewer": "Jayson Greene", "review_date": "2015-01-22", "score": 7.2 }, { "artist": "The Go-Betweens", "album": "G Stands for Go-Betweens: Volume 1, 1978-1984", "label": "Domino", "year": "2015", "reviewer": "Douglas Wolk", "review_date": "2015-01-22", "score": 8.2 }, { "artist": "The Sidekicks", "album": "Runners in the Nerved World", "label": "Epitaph", "year": "2015", "reviewer": "Ian Cohen", "review_date": "2015-01-22", "score": 7.4 } ] 


If you want, you can redirect this data to a file.

 $ ruby scraper.rb > reviews.json 


Conclusion


This is just the pinnacle of Mechanize features. In this article, I did not even touch on the ability of Mechanize to fill out and submit forms. If you are interested, I recommend reading the Mechanize manual and usage examples.

Many people in the comments on the previous post said that I should have just used Mechanize. Although I agree that Mechanize is a great tool, the example I gave in the first post on this topic was simple, and using Mechanize in it seems to me superfluous.

However, given the ability to Mechanize, I begin to think that even for simple tasks of parsing, it will often be better to use it.

All articles in the series:

Source: https://habr.com/ru/post/253439/


All Articles