📜 ⬆️ ⬇️

Ruby Web Parsing

image
This is a translation of the article “Web Scraping with Ruby” , which I found useful when learning the Ruby programming language. Parsing interests me for personal reasons. It seems to me that this is not only a useful skill, but also a good way to learn a language.

Ruby Web Parsing is easier than you might think. Let's start with a simple example: I want to get a beautifully formatted JSON array of objects representing a list of movies from a local independent cinema site.

In the beginning, we need a way to download an html page that contains all the movie ads. Ruby has a built-in http client, Net::HTTP , and an add-on above it - open-uri .

Open-uri
Open-uri is good for basic things, like the ones we do in the lesson, but it has some problems , so you may want to find another http client for the production environment.

')
So, the first thing to do is download html from a remote server.

 require 'open-uri' url = 'http://www.cubecinema.com/programme' html = open(url) 

Great, now we have a page that we want to parse, now we need to get some information out of it. The best tool for this is Nokogiri . We are creating a new instance of Nokogiri for our html, which we have just downloaded.

 require 'nokogiri' doc = Nokogiri::HTML(html) 

Nokogiri is cool because it allows you to access html using CSS selectors, which, in my opinion, is much more convenient than using xpath.

Ok, now we have a document from which we can pull a list of movies. Each element of the list has the following html structure, as shown below.

 <div class="showing" id="event_7557"> <a href="/programme/event/live-stand-up-monty-python-and-the-holy-grail,7557/"> <img src="" alt="Picture for event Live stand up + Monty Python and the Holy Grail"> </a> <span class="tags"><a href="/programme/view/comedy/" class="tag_comedy">comedy</a> <a href="/programme/view/dvd/" class="tag_dvd">dvd</a> <a href="/programme/view/film/" class="tag_film">film</a> </span> <h1> <a href="/programme/event/live-stand-up-monty-python-and-the-holy-grail,7557/"> <span class="pre_title">Comedy Combo presents</span> Live stand up + Monty Python and the Holy Grail <span class="post_title">Rare screening from 35mm!</span> </a> </h1> <div class="event_details"> <p class="start_and_pricing"> Sat 20 December | 19:30 <br> </p> <p class="copy">Brave (and not so brave) Knights of the Round Table! Gain shelter from the vicious chicken of Bristol as we gather to bear witness to this 100% factually accurate retelling ... [<a class="more" href="/programme/event/live-stand-up-monty-python-and-the-holy-grail,7557/">more...</a>]</p> </div> </div> 

Html processing


Each movie has a css class .showing , so that we can select all the shows and process them in turn.

 showings = [] doc.css('.showing').each do |showing| showing_id = showing['id'].split('_').last.to_i tags = showing.css('.tags a').map { |tag| tag.text.strip } title_el = showing.at_css('h1 a') title_el.children.each { |c| c.remove if c.name == 'span' } title = title_el.text.strip dates = showing.at_css('.start_and_pricing').inner_html.strip dates = dates.split('<br>').map(&:strip).map { |d| DateTime.parse(d) } description = showing.at_css('.copy').text.gsub('[more...]', '').strip showings.push( id: showing_id, title: title, tags: tags, dates: dates, description: description ) end 

Let's take a look at the code presented above.

 showing_id = showing['id'].split('_').last.to_i 

In the beginning, we take a unique id id, which is kindly set as an attribute of the html identifier in the markup. Using square brackets, we can access the attributes of the elements. Thus, in the case of the html presented above, showing['id'] should be "event_7557". We are interested only in a numeric identifier, so we divide the result using the underscore .split('_') and then take the last element from the resulting array and convert it to the integer format .last.to_i .

 tags = showing.css('.tags a').map { |tag| tag.text.strip } 

Here we find all the tags for a movie using the .css method, which returns an array of matching elements. Then we map (use the map method) elements, take the text from them and remove spaces in it. For our html, the result will be ["comedy", "dvd", "film"] .

 title_el = showing.at_css('h1 a') title_el.children.each { |c| c.remove if c.name == 'span' } title = title_el.text.strip 

The code for getting the title is a bit more complicated, because this element in html contains some additional span elements with prefixes and suffixes. We take the header using .at_css , which returns one matching element. Then we iterate over each descendant of the header and delete the extra spans. At the end, when the span is removed, we get the header text and clean it from extra spaces.

 dates = showing.at_css('.start_and_pricing').inner_html.strip dates = dates.split('<br>').map(&:strip).map { |d| DateTime.parse(d) } 

Next, the code for the date and time of the show. It is a little more complicated here, because films can show several days and, sometimes, the price can be in the same element. We map the dates that we find using DateTime.parse and as a result we get an array of Ruby objects - DateTime .

 description = showing.at_css('.copy').text.gsub('[more...]', '').strip 

Getting the description is a fairly simple process, the only thing that needs to be done is to remove the text [more...] using .gsub

 showings.push( id: showing_id, title: title, tags: tags, dates: dates, description: description ) 

Now, having all the necessary parts in the variables, we can write them into our hash (hash), created to display all the films.

Json output


Now that every movie is taken from us and we have their array, we can convert the result into JSON format.

 require 'json' puts JSON.pretty_generate(showings) 

This code displays an array of showings recoded in JSON format, when running the script, the output can be redirected to a file or another program for further processing.

Putting it all together


Having collected all the parts in one place, we get the full version of our script:

 require 'open-uri' require 'nokogiri' require 'json' url = 'http://www.cubecinema.com/programme' html = open(url) doc = Nokogiri::HTML(html) showings = [] doc.css('.showing').each do |showing| showing_id = showing['id'].split('_').last.to_i tags = showing.css('.tags a').map { |tag| tag.text.strip } title_el = showing.at_css('h1 a') title_el.children.each { |c| c.remove if c.name == 'span' } title = title_el.text.strip dates = showing.at_css('.start_and_pricing').inner_html.strip dates = dates.split('<br>').map(&:strip).map { |d| DateTime.parse(d) } description = showing.at_css('.copy').text.gsub('[more...]', '').strip showings.push( id: showing_id, title: title, tags: tags, dates: dates, description: description ) end puts JSON.pretty_generate(showings) 

If you save it to a file, for example, scraper.rb and run ruby scraper.rb , then you should see the output in JSON format. It should be similar to what is presented below.

 [ { "id": 7686, "title": "Harry Dean Stanton - Partly Fiction", "tags": [ "dcp", "film", "ttt" ], "dates": [ "2015-01-19T20:00:00+00:00", "2015-01-20T20:00:00+00:00" ], "description": "A mesmerizing, impressionistic portrait of the iconic actor in his intimate moments, with film clips from some of his 250 films and his own heart-breaking renditions of American folk songs. ..." }, { "id": 7519, "title": "Bang the Bore Audiovisual Spectacle: VA AA LR + Stephen Cornford + Seth Cooke", "tags": [ "music" ], "dates": [ "2015-01-21T20:00:00+00:00" ], "description": "An evening of hacked TVs, 4 screen cinematic drone and electroacoustics. VAAALR: Vasco Alves, Adam Asnan and Louie Rice create spectacles using distress flares, C02 and junk electronics. Stephen Cornford: ..." } ] 

Everything. And this is just a basic example of parsing. It is more difficult to parse a site that requires you to log in at the beginning. For such cases, I recommend looking in the direction of mechanize , which works on Nokogiri.

I hope this introduction to parsing will give you ideas about the data you want to see in a more structured format using the methods described above.

I also plan to translate another article on the topic of parsing from the same author.

All articles in the series:

Source: https://habr.com/ru/post/252379/


All Articles