$ gem install mechanize scraper.rb file and add some require . This will indicate dependencies that are necessary for our script. date and json are parts of the standard ruby library, so there is no need to additionally install them. require 'mechanize' require 'date' require 'json' agent ) and use it to download the page. agent = Mehanize.new page = agent.get("http://pitchfork.com/reviews/albums/") page object to find links to reviews..links_with method, which, as the name implies, finds links with the specified attributes. Here we look for links that match the regular expression..reject and drop links that look like pagination. review_links = page.links_with(href: %r{^/reviews/albums/\w+}) review_links = review_links.reject do |link| parent_classes = link.node.parent['class'].split parent_classes.any? { |p| %w[next-container page-number].include?(p) } end review_links = review_links[0...4] .map method and return the hash after each iteration.page object has a .search method that is delegated to the .search .search method. This means that we can use the CSS selector as an argument for .serach and it will return an array of matched elements.#main .review-meta .info , and then we search inside the review_meta element for the pieces of information we need. reviews = review_links.map do |link| review = link.click review_meta = review.search('#main .review-meta .info') artist = review_meta.search('h1')[0].text album = review_meta.search('h2')[0].text label, year = review_meta.search('h3')[0].text.split(';').map(&:strip) reviewer = review_meta.search('h4 address')[0].text review_date = Date.parse(review_meta.search('.pub-date')[0].text) score = review_meta.search('.score').text.to_f { artist: artist, album: album, label: label, year: year, reviewer: reviewer, review_date: review_date, score: score } end puts JSON.pretty_generate(reviews) require 'mechanize' require 'date' require 'json' agent = Mechanize.new page = agent.get("http://pitchfork.com/reviews/albums/") review_links = page.links_with(href: %r{^/reviews/albums/\w+}) review_links = review_links.reject do |link| parent_classes = link.node.parent['class'].split parent_classes.any? { |p| %w[next-container page-number].include?(p) } end review_links = review_links[0...4] reviews = review_links.map do |link| review = link.click review_meta = review.search('#main .review-meta .info') artist = review_meta.search('h1')[0].text album = review_meta.search('h2')[0].text label, year = review_meta.search('h3')[0].text.split(';').map(&:strip) reviewer = review_meta.search('h4 address')[0].text review_date = Date.parse(review_meta.search('.pub-date')[0].text) score = review_meta.search('.score').text.to_f { artist: artist, album: album, label: label, year: year, reviewer: reviewer, review_date: review_date, score: score } end puts JSON.pretty_generate(reviews) scraper.rb file and running it with the command: $ ruby scraper.rb [ { "artist": "Viet Cong", "album": "Viet Cong", "label": "Jagjaguwar", "year": "2015", "reviewer": "Ian Cohen", "review_date": "2015-01-22", "score": 8.5 }, { "artist": "Lupe Fiasco", "album": "Tetsuo & Youth", "label": "Atlantic / 1st and 15th", "year": "2015", "reviewer": "Jayson Greene", "review_date": "2015-01-22", "score": 7.2 }, { "artist": "The Go-Betweens", "album": "G Stands for Go-Betweens: Volume 1, 1978-1984", "label": "Domino", "year": "2015", "reviewer": "Douglas Wolk", "review_date": "2015-01-22", "score": 8.2 }, { "artist": "The Sidekicks", "album": "Runners in the Nerved World", "label": "Epitaph", "year": "2015", "reviewer": "Ian Cohen", "review_date": "2015-01-22", "score": 7.4 } ] $ ruby scraper.rb > reviews.json Source: https://habr.com/ru/post/253439/
All Articles