Net::HTTP
, and an add-on above it - open-uri
. require 'open-uri' url = 'http://www.cubecinema.com/programme' html = open(url)
require 'nokogiri' doc = Nokogiri::HTML(html)
<div class="showing" id="event_7557"> <a href="/programme/event/live-stand-up-monty-python-and-the-holy-grail,7557/"> <img src="" alt="Picture for event Live stand up + Monty Python and the Holy Grail"> </a> <span class="tags"><a href="/programme/view/comedy/" class="tag_comedy">comedy</a> <a href="/programme/view/dvd/" class="tag_dvd">dvd</a> <a href="/programme/view/film/" class="tag_film">film</a> </span> <h1> <a href="/programme/event/live-stand-up-monty-python-and-the-holy-grail,7557/"> <span class="pre_title">Comedy Combo presents</span> Live stand up + Monty Python and the Holy Grail <span class="post_title">Rare screening from 35mm!</span> </a> </h1> <div class="event_details"> <p class="start_and_pricing"> Sat 20 December | 19:30 <br> </p> <p class="copy">Brave (and not so brave) Knights of the Round Table! Gain shelter from the vicious chicken of Bristol as we gather to bear witness to this 100% factually accurate retelling ... [<a class="more" href="/programme/event/live-stand-up-monty-python-and-the-holy-grail,7557/">more...</a>]</p> </div> </div>
.showing
, so that we can select all the shows and process them in turn. showings = [] doc.css('.showing').each do |showing| showing_id = showing['id'].split('_').last.to_i tags = showing.css('.tags a').map { |tag| tag.text.strip } title_el = showing.at_css('h1 a') title_el.children.each { |c| c.remove if c.name == 'span' } title = title_el.text.strip dates = showing.at_css('.start_and_pricing').inner_html.strip dates = dates.split('<br>').map(&:strip).map { |d| DateTime.parse(d) } description = showing.at_css('.copy').text.gsub('[more...]', '').strip showings.push( id: showing_id, title: title, tags: tags, dates: dates, description: description ) end
showing_id = showing['id'].split('_').last.to_i
showing['id']
should be "event_7557". We are interested only in a numeric identifier, so we divide the result using the underscore .split('_')
and then take the last element from the resulting array and convert it to the integer format .last.to_i
. tags = showing.css('.tags a').map { |tag| tag.text.strip }
.css
method, which returns an array of matching elements. Then we map (use the map method) elements, take the text from them and remove spaces in it. For our html, the result will be ["comedy", "dvd", "film"]
. title_el = showing.at_css('h1 a') title_el.children.each { |c| c.remove if c.name == 'span' } title = title_el.text.strip
.at_css
, which returns one matching element. Then we iterate over each descendant of the header and delete the extra spans. At the end, when the span is removed, we get the header text and clean it from extra spaces. dates = showing.at_css('.start_and_pricing').inner_html.strip dates = dates.split('<br>').map(&:strip).map { |d| DateTime.parse(d) }
DateTime.parse
and as a result we get an array of Ruby objects - DateTime
. description = showing.at_css('.copy').text.gsub('[more...]', '').strip
[more...]
using .gsub
showings.push( id: showing_id, title: title, tags: tags, dates: dates, description: description )
require 'json' puts JSON.pretty_generate(showings)
require 'open-uri' require 'nokogiri' require 'json' url = 'http://www.cubecinema.com/programme' html = open(url) doc = Nokogiri::HTML(html) showings = [] doc.css('.showing').each do |showing| showing_id = showing['id'].split('_').last.to_i tags = showing.css('.tags a').map { |tag| tag.text.strip } title_el = showing.at_css('h1 a') title_el.children.each { |c| c.remove if c.name == 'span' } title = title_el.text.strip dates = showing.at_css('.start_and_pricing').inner_html.strip dates = dates.split('<br>').map(&:strip).map { |d| DateTime.parse(d) } description = showing.at_css('.copy').text.gsub('[more...]', '').strip showings.push( id: showing_id, title: title, tags: tags, dates: dates, description: description ) end puts JSON.pretty_generate(showings)
scraper.rb
and run ruby scraper.rb
, then you should see the output in JSON format. It should be similar to what is presented below. [ { "id": 7686, "title": "Harry Dean Stanton - Partly Fiction", "tags": [ "dcp", "film", "ttt" ], "dates": [ "2015-01-19T20:00:00+00:00", "2015-01-20T20:00:00+00:00" ], "description": "A mesmerizing, impressionistic portrait of the iconic actor in his intimate moments, with film clips from some of his 250 films and his own heart-breaking renditions of American folk songs. ..." }, { "id": 7519, "title": "Bang the Bore Audiovisual Spectacle: VA AA LR + Stephen Cornford + Seth Cooke", "tags": [ "music" ], "dates": [ "2015-01-21T20:00:00+00:00" ], "description": "An evening of hacked TVs, 4 screen cinematic drone and electroacoustics. VAAALR: Vasco Alves, Adam Asnan and Louie Rice create spectacles using distress flares, C02 and junk electronics. Stephen Cornford: ..." } ]
Source: https://habr.com/ru/post/252379/
All Articles