HTML :: TokeParser

One of the most frequently used modules for parsing HTML is HTML :: TokeParser. This module breaks the entire HTML document into tokens, with which you can later conveniently work.

Let's look at some example in practice. Take the site habrahabr.ru

Example 1. It is necessary to parse the list of links to full articles.
')
The first. Determine the encoding used. To do this, just look at the meta tag, for habr it is UTF-8

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

The second. Save the web page to a file. We write a small script

 use strict; use warnings; use HTML::TokeParser; use Data::Dumper; open (my $f,"<", $ARGV[0]) ; my $p = HTML::TokeParser->new($f); while (my $token = $p->get_token()) { print Dumper ($token); }

We transfer to it our saved file as input and redirect the data from STDOUT to a file. We should get something like

 $VAR1 = [ 'T', ' ', '' ]; $VAR1 = [ 'D', '<!DOCTYPE html>' ]; $VAR1 = [ 'T', ' ', '' ]; $VAR1 = [ 'S', 'html', { 'xmlns' => 'http://www.w3.org/1999/xhtml', 'xml:lang' => 'ru' }, [ 'xmlns', 'xml:lang' ], '<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ru">' ];

etc. This file will be used for debugging.

Third. We use Firebug and see what the link to the full version of the article is. Here is what we get in our case.

 <a href="http://habrahabr.ru/post/163525/#habracut" class="button habracut">  →</a>

We guess that we can easily find all the links thanks to class = "button habracut". We are looking for in the file created in step 2 the line button habracut . We write the parser, I usually make out it in the form of a separate class. The parser should receive data in HTML. That's what we get

Test.pl

 use strict; use warnings; use habr_parse; use LWP::UserAgent; use Data::Dumper; my $ua = LWP::UserAgent->new(); my $res = $ua->get("http://habrahabr.ru"); if ($res->is_success()) { my $parser = habr_parse->new(); # print Dumper ($res); my $conf = {}; $conf->{content} = $res->content; $conf->{cp} = 'utf8'; my $r = $parser->get_page_links($conf); print Dumper ($r); }

Habr_parse.pm

 package habr_parse; use strict; use warnings; use HTML::TokeParser; use HTML::Entities; use Data::Dumper; use Encode; sub new { my $class = shift; my $self = {}; bless ($self, $class); } sub get_page_links { my $self = shift; my $conf = shift; my @data; # get internal format $conf->{content} = decode($conf->{cp},$conf->{content}); # print Dumper ($conf); decode_entities($conf->{content}); my $p = HTML::TokeParser->new(\$conf->{content}); while (my $token = $p->get_token()) { # we found our link if ($token->[0] eq 'S' && $token->[1] eq 'a' && defined ($token->[2]->{class}) && $token->[2]->{class}=~/^\s*button\s+habracut$/i) { push @data, $token->[2]->{href}; } } # print Dumper ($p); return \@data; } return 1;

For writing a line of code below, the presence of the file created in step 2 helps a lot (especially if there are many conditions)

  if ($token->[0] eq 'S' && $token->[1] eq 'a' && defined ($token->[2]->{class}) && $token->[2]->{class}=~/^\s*button\s+habracut$/i)

In principle, this is a simple example, because each link has a unique attribute (value class) , which is not found anywhere else. But the power of HTML :: TokeParser is not that. Consider example 2.

Example 2. Required for each article will receive a list of categories. With Firebug, we notice that the categories are inside a div tag with the attribute class = 'hubs'.

Since we go to the site without cookies and any authentication, we cannot be subscribed to any hub, so the links for the title = 'You are not subscribed to this hub'

If you look at our dump created in step 2 (example 1), here’s what fragment we need

 $VAR1 = [ 'S', 'a', { 'href' => 'http://habrahabr.ru/hub/photo/', 'title' => '     ', 'class' => 'hub ' }, [ 'href', 'class', 'title' ], '<a href="http://habrahabr.ru/hub/photo/" class="hub " title="     " >' ]; $VAR1 = [ 'T', '', '' ];

Everything turns out to be simple, if we first find the link with title = 'You are not subscribed to this hub', we will get the next token and if this is the text, save it.

I will show a slightly different technique, which is based on the fact that we push the tokens onto the stack, checking the latest token, until we find what we need. If we didn’t meet the required token, we use unget_token ().

Pay attention to another pattern after the data we need is a token with a closing tag a

 $VAR1 = [ 'T', '.   ', '' ]; $VAR1 = [ 'E', 'a', '</a>' ];

Change habr_parse.pm

 package habr_parse; use strict; use warnings; use HTML::TokeParser; use HTML::Entities; use Data::Dumper; use Encode; sub new { my $class = shift; my $self = {}; bless ($self, $class); } sub get_page_links { my $self = shift; my $conf = shift; my @data; # get internal format # $conf->{content} = decode($conf->{cp},$conf->{content}); # print Dumper ($conf); # decode_entities($conf->{content}); my $p = HTML::TokeParser->new(\$conf->{content}); my $tmp_conf = {}; while (my $token = $p->get_token()) { # we found our link if ($token->[0] eq 'S' && $token->[1] eq 'a' && defined ($token->[2]->{class}) && $token->[2]->{class}=~/^\s*button\s+habracut$/i) { $tmp_conf->{href} = $token->[2]->{href}; } elsif ($token->[0] eq 'S' && $token->[1] eq 'div' && defined ($token->[2]->{class}) && $token->[2]->{class} eq 'hubs') { my @next; my $found=0; #      $tmp_conf = {}; my $token = $p->get_token(); push @next, $token; #     div ( div   ). while ($next[$#next][1] ne 'div') { push @next, $p->get_token(); # print Dumper ($next[$#next][1]); #    if ($next[$#next][0] eq 'E' && $next[$#next][1] eq 'a') { #   T     if ($next[$#next-1][0] eq 'T') { # print $next[$#next-1][1] . "\n"; push @{$tmp_conf->{cats}}, $next[$#next-1][1]; $found = 1; } } } if (!$found) { #         $p->unget_token(@next); } push @data, $tmp_conf; } } # print Dumper ($p); return \@data; } return 1;

Result

 $VAR1 = [ { 'cats' => [ '    IT', ' ' ], 'href' => 'http://habrahabr.ru/post/162053/#habracut' }, { 'cats' => [ '', ' ' ], 'href' => 'http://habrahabr.ru/post/163433/#habracut' }, { 'cats' => [ '  ', '.   ', ' ' ], 'href' => 'http://habrahabr.ru/post/163493/#habracut' }, { 'cats' => [ 'HTML', 'CSS' ], 'href' => 'http://habrahabr.ru/post/163429/#habracut' }, { 'cats' => [ '', '  Intel' ], 'href' => 'http://habrahabr.ru/company/intel/blog/162293/#habracut' }, { 'cats' => [ ' — ', '', '   ' ], 'href' => 'http://habrahabr.ru/company/tm/blog/163483/#habracut' }, { 'cats' => [ '-', 'Open source' ], 'href' => 'http://habrahabr.ru/post/163425/#habracut' }, { 'cats' => [ '', ' ', 'Open source' ], 'href' => 'http://habrahabr.ru/post/148911/#habracut' }, { 'cats' => [ '' ], 'href' => 'http://habrahabr.ru/post/163445/#habracut' }, { 'cats' => [ '  ', ' ' ], 'href' => 'http://habrahabr.ru/post/163525/#habracut' } ];

A similar approach with unget_token () also allows you to search for tokens by nesting level. For example, we need to get the third token after a certain one, all we need to do is add three tokens to the array and check the last one. If it is not sought, then return all tokens to the source stream using unget_token ()

With this approach, as in HTML :: TokeParser, information about nesting is not stored, therefore, as an option, you can use an array with tokens and unget_token () .

Source: https://habr.com/ru/post/163567/

All Articles

HTML :: TokeParser

More articles: