Let's beat Ruby together! Drop twelfth

It is time to write something suitable for use on Ruby;) Today we will learn to tear out the information we need from web pages using Ruby using the example of Habr. Let's start with karma.

open-uri

Let's each open their personal habcenter (well, or someone else’s, if you still haven’t got an invite;) with an address like% username.habrahabr.ru. Our task is to extract the value of our karma from half a thousand lines of HTML code. We assume that for this we need to save the page code to a file, open and read it and, using regular expressions, obtain the necessary information.

The open-uri library will do the first part of the work for us. After its inclusion in the program, the open method becomes available, which allows you to open both local files and URLs:

require 'open-uri' url = 'http://maxelc.habrahabr.ru/' page = open ( url ) text = page. read

open will save the page to a tempfile, from where we will read the contents in the text string.
')

We use regular expressions

In order to catch karma, we create a regular expression of the form / /(.*)/ . /(.*)/ . Let's look at the piece of HTML code we need:

68,25

And the simple regular expression is ready:

%r{mark">(.*)}m

We use %r{} to forget about the correct slashes (very convenient, in particular with HTML), m at the end of the line tells Ruby to look for matches on several lines (in our case it does not matter, however, again, it is very useful in working with HTML ). To search for matches in the string, we will use the scan method:

karma = text. scan ( %r{mark">(.*)} ) puts "Karma = #{karma}"

Done! Now you can add the ability to enter an arbitrary username, the output of habrasila, and also arrange everything in OOP: select methods and classes.

hpricot

It would be convenient enough to parse HTML using its own tags - because they structure the information remarkably well. In our example, it is enough to know the text enclosed in a span, which in turn is enclosed in the mark class span - and that’s all, no complex regular expressions. However, there is one discrepancy here - not all documents have really valid HTML, for example, unclosed and missing tags. To solve the problem, it is enough to translate HTML to XML, a clearly structured format, the parsing of which is a common task.

Hpricot is a fast, easy-to-use, HTML-based parser that works just like that. JQuery libraries are used to parse XML.
Install: gem install hpricot . Begin to code. We hpricot in the program and load the URL, find the necessary element, wrapping everything in OOP at once:

class Karma require 'rubygems' require 'hpricot' require 'open-uri' def initialize ( name ) @url = "http:\/\/" + name + ".habrahabr.ru\/" ; @hp = Hpricot( open ( @url )) end def get ( @hp / "span.mark/span" ) . inner_text end end karma = Karma . new ( 'maxelc' ) puts "Karma = #{karma.get}"

Hpricot(open()) just converts HTML to XML and creates methods for a variable. @hp/"span.mark" is a shortcut from @hp.search("//span[@class='mark']") , meaning “look for the ” (search as a parameter accepts an XPath or CSS expression). The inner_html method gets the content of the element (in the case of HTML, what is enclosed in tags). By changing the request, we can go into nested tags, which we did: @hp/"span.mark/span” .

WWW :: Mechanize

Today, most of the data is in the “deep web” - in databases, accessible through forms. Information is missing in static pages and is generated on the fly, or is only available after only registration and authentication. At this moment WWW::Mechanize comes into play.

We have already learned to count karma, but what if we want to find out, for example, the number of unread private messages in a habr? We need to pass authentication, get cookies and only then pull out the number of messages. Let's try to solve the problem in the most convenient way!
As always, let's start with the installation of the jam: gem install mechanize . Write the code:

require 'rubygems' require 'mechanize' require 'hpricot' agent = WWW :: Mechanize . new # , . page = agent. get 'http://habrahabr.ru/login/' form = page. forms . first # , form. login = 'MaxElc' # . name HTML form. password = '****' page = agent. submit form # a = agent. get ( 'http://habrahabr.ru/' ) . search ( ".//a[@href='http://maxelc.habrahabr.ru/mail/']" ) . inner_text # hpricot puts " #{a}!"

Thus, Mechanize allows us to fill out forms, click buttons, follow links, imitating a browser. Along with hpricot - a dangerous mixture;)

Epilogue

Actually, a little more information - we will continue to look at other libraries, practice their useful use. Waiting for feedback!

Source: https://habr.com/ru/post/51610/

All Articles