$ gem install mechanize
scraper.rb
file and add some require
. This will indicate dependencies that are necessary for our script. date
and json
are parts of the standard ruby library, so there is no need to additionally install them. require 'mechanize' require 'date' require 'json'
agent
) and use it to download the page. agent = Mehanize.new page = agent.get("http://pitchfork.com/reviews/albums/")
page
object to find links to reviews..links_with
method, which, as the name implies, finds links with the specified attributes. Here we look for links that match the regular expression..reject
and drop links that look like pagination. review_links = page.links_with(href: %r{^/reviews/albums/\w+}) review_links = review_links.reject do |link| parent_classes = link.node.parent['class'].split parent_classes.any? { |p| %w[next-container page-number].include?(p) } end
review_links = review_links[0...4]
.map
method and return the hash after each iteration.page
object has a .search
method that is delegated to the .search
.search method. This means that we can use the CSS selector as an argument for .serach
and it will return an array of matched elements.#main .review-meta .info
, and then we search inside the review_meta
element for the pieces of information we need. reviews = review_links.map do |link| review = link.click review_meta = review.search('#main .review-meta .info') artist = review_meta.search('h1')[0].text album = review_meta.search('h2')[0].text label, year = review_meta.search('h3')[0].text.split(';').map(&:strip) reviewer = review_meta.search('h4 address')[0].text review_date = Date.parse(review_meta.search('.pub-date')[0].text) score = review_meta.search('.score').text.to_f { artist: artist, album: album, label: label, year: year, reviewer: reviewer, review_date: review_date, score: score } end
puts JSON.pretty_generate(reviews)
require 'mechanize' require 'date' require 'json' agent = Mechanize.new page = agent.get("http://pitchfork.com/reviews/albums/") review_links = page.links_with(href: %r{^/reviews/albums/\w+}) review_links = review_links.reject do |link| parent_classes = link.node.parent['class'].split parent_classes.any? { |p| %w[next-container page-number].include?(p) } end review_links = review_links[0...4] reviews = review_links.map do |link| review = link.click review_meta = review.search('#main .review-meta .info') artist = review_meta.search('h1')[0].text album = review_meta.search('h2')[0].text label, year = review_meta.search('h3')[0].text.split(';').map(&:strip) reviewer = review_meta.search('h4 address')[0].text review_date = Date.parse(review_meta.search('.pub-date')[0].text) score = review_meta.search('.score').text.to_f { artist: artist, album: album, label: label, year: year, reviewer: reviewer, review_date: review_date, score: score } end puts JSON.pretty_generate(reviews)
scraper.rb
file and running it with the command: $ ruby scraper.rb
[ { "artist": "Viet Cong", "album": "Viet Cong", "label": "Jagjaguwar", "year": "2015", "reviewer": "Ian Cohen", "review_date": "2015-01-22", "score": 8.5 }, { "artist": "Lupe Fiasco", "album": "Tetsuo & Youth", "label": "Atlantic / 1st and 15th", "year": "2015", "reviewer": "Jayson Greene", "review_date": "2015-01-22", "score": 7.2 }, { "artist": "The Go-Betweens", "album": "G Stands for Go-Betweens: Volume 1, 1978-1984", "label": "Domino", "year": "2015", "reviewer": "Douglas Wolk", "review_date": "2015-01-22", "score": 8.2 }, { "artist": "The Sidekicks", "album": "Runners in the Nerved World", "label": "Epitaph", "year": "2015", "reviewer": "Ian Cohen", "review_date": "2015-01-22", "score": 7.4 } ]
$ ruby scraper.rb > reviews.json
Source: https://habr.com/ru/post/253439/
All Articles