This article will discuss functional programming using the example of a search script for broken links using AnyEvent :: HTTP . The following topics will be covered:
An anonymous subroutine is declared as well as the usual one, but there is no name between the sub
keyword and the opening bracket of the program code block. In addition, this form of writing is regarded as part of an expression, therefore the declaration of an anonymous subroutine must be terminated with a semicolon or another expression separator, as in most cases:
sub { ... ... };
For example, we implement a subroutine tripling the value passed to it:
my $triple = sub { my $val = shift; return 3 * $val; }; say $triple->(2); # 6
The main advantage of anonymous subroutines is the use of "code as data." In other words, we save the code to a variable (for example, pass it to a function in the case of callbacks) for further execution.
Also, anonymous subroutines can be used to create recursions, including in combination with callbacks. For example, using the token __SUB__
, which appeared in the version of Perl 5.16.0
, and allows you to get a link to the current subroutine, we implement the calculation of factorial:
use 5.16.0; my $factorial = sub { my $x = shift; return 1 if $x == 1; return $x * __SUB__->($x - 1); }; say $factorial->(5); # 120
An example of using recursion in conjunction with callbacks will be shown below when considering the problem of finding broken links.
As stated in Wikipedia
A closure is a first-class function whose body contains references to variables that are declared outside the body of this function in the surrounding code and are not its parameters.
Essentially, a closure is an analogue of a class in OOP: it provides the functionality and data associated and packaged together. Consider an example of closure in Perl and a class in C ++:
sub multiplicator { my $multiplier = shift; return sub { return shift() * $multiplier; }; }
class multiplicator { public: multiplicator(const int &mul): multiplier(mul) { } long int operator()(const int &n) { return n * multiplier; } private: int multiplier; };
Let's analyze the given code:
private variable declaration:
we first define the lexical (my
) variable$multiplier
(my $multiplier = shift;
);
declare amultiplier
variable of typeint
after the access tokenprivate
;
private variable initialization:
when creating a variable, initialize the passed value;
overload the constructor so that it takes a number and in the initialization list we initialize the variable multiplier
;
creating a subroutine that multiplies the value passed to it from the previously initialized variable:
return an anonymous subroutine that takes a parameter as an input and multiplies it with the previously initialized variable $multiplier
and returns the resulting value;
we overload the function call operator()
, which takes the parametern
as an input, multiplies it with the variablemultiplier
and returns the value.
To use closures in Perl and class in C ++, they need to be defined, i.e. create object:
my $doubled = multiplicator(2);
my $tripled = multiplicator(3);
say $doubled->(3); # 6
say $tripled->(4); # 12
multiplicator doubled(2), tripled(3);
cout << doubled(3) << endl; // 6
cout << tripled(4) << endl; // 12
In C ++, the class object in which the definition operator ()
defined is often called a functional object, or functor. Functional objects are most often used as arguments for common algorithms. For example, in order to add elements of a vector, you can use the for_each algorithm, which applies the transferred function to each element of the sequence and the Sum class with an overloaded operand ()
, which adds all the elements of the sequence and returns the sum. Also, instead of the Sum class, you can use lambdas that appeared in C ++ 11.
#include <iostream> #include <vector> #include <algorithm> using std::cout; using std::endl; using std::vector; class Sum { public: Sum() : sum(0) { }; void operator() (int n) { sum += n; } inline int get_sum() { return sum; } private: int sum; }; int main() { vector<int> nums{3, 4, 2, 9, 15, 267}; Sum s = for_each(nums.begin(), nums.end(), Sum()); cout << " Sum: " << s.get_sum() << endl; long int sum_of_elems = 0; for_each(nums.begin(), nums.end(), [&](int n) { sum_of_elems += n; }); cout << " : " << sum_of_elems << endl; return 0; }
sub for_each { my($arr, $cb) = @_; for my $item (@$arr) { $cb->($item); } } my $sum = 0; for_each [3, 4, 2, 9, 15, 267], sub { $sum += $_[0]; }; say $sum;
As you can see from the example, in C ++ we declare the Sum
class, which contains:
sum
, which is initialized in the standard constructor;()
, which gets each value of dividing and adds to variable sum
;get_sum
method for accessing the private variable sum
.In the Perl example, we create a for_each
function that accepts an array reference and an anonymous function. Next we go through the array, and perform an anonymous function (closure), passing it the next element of the array as a parameter.
When using the for_each
function, we first define a lesbian variable $sum
, initialized to zero. Then, in the for_each
function, for_each
pass the reference to the array and the function closure, in which we summarize each element of the array into the $sum
variable. After the for_each
function is for_each
, the $sum
variable will contain the sum of the array.
The analogue of the function closure from the example in Perl, in C ++ is the use of lambda, as shown in the code. In the Perl example, the closure function passed to a function is also called a callback, or callback function.
As can be seen from the for_each
example, the callback function is the transfer of executable code as one of the parameters of another code. Often, the passed function works as a closure, i.e. It has access to lexical variables and can be defined in other contexts of the program code and can not be accessed directly from the parent function (the function to which the closure / callback was transferred).
In essence, the callback function is analogous to the polymorphism of functions, namely, it allows you to create more general-purpose functions instead of creating a series of functions that are identical in structure but differ only in certain places by the executed subtasks. Consider an example of the task of reading from a file and writing to a file. To do this, we will create two functions reader and writer using Perl (the example was taken from Mikhail Ozerov’s presentation Lazy iterators to parse heterogeneous data ), and using C ++ we will create the classes Reader_base, Writer_base, ReaderWriter.
use strict; use warnings; sub reader { my ($fn, $cb) = @_; open my $in, '<', $fn; while (my $ln = <$in>) { chomp $ln; $cb->($ln); # } close $in; } sub write_file { my ($fn, $cb) = @_; open my $out, '>', $fn; $cb->(sub { # my $ln = shift; syswrite($out, $ln.$/); }); close $out; } write_file('./out.cvs', sub { my $writer = shift; # sub { my $ln = shift; syswrite() } reader('./in.csv', sub { my $ln = shift; my @fields = split /;/, $ln; return unless substr($fields[1], 0, 1) == 6; @fields = @fields[0,1,2]; $writer->(join(';', @fields)); # }); });
#pragma once #include <iostream> #include <string> #include <fstream> // - using std::ifstream; using std::getline; using std::cout; using std::runtime_error; using std::endl; using std::cerr; using std::string; class Reader_base { public: Reader_base(const string &fn_in) : file_name(fn_in) { open(file_name); } virtual ~Reader_base() { infile.close(); } virtual void open(const string &fn_in) { infile.open(fn_in); // , if (! infile.is_open()) throw runtime_error("can't open input file \"" + file_name + "\""); } virtual void main_loop() { try { while(getline(infile, line)) { rcallback(line); } } catch(const runtime_error &e) { cerr << e.what() << " Try again." << endl; } } protected: virtual void rcallback(const string &ln) { throw runtime_error("Method 'callback' must me overloaded!"); }; private: ifstream infile; string line; string file_name; };
#pragma once #include <iostream> #include <string> #include <fstream> // - using std::string; using std::ofstream; using std::cout; using std::runtime_error; using std::endl; using std::cerr; class Writer_base { public: Writer_base(const string &fn_out) : file_name(fn_out) { open(file_name); } virtual ~Writer_base() { outfile.close(); } virtual void open(const string &fn_out) { outfile.open(file_name); if (! outfile.is_open()) throw runtime_error("can't open output file \"" + file_name + "\""); } virtual void write(const string &ln) { outfile << ln << endl; } private: string file_name; ofstream outfile; };
#pragma once #include "Reader.hpp" #include "Writer.hpp" class ReaderWriter : public Reader_base, public Writer_base { public: ReaderWriter(const string &fn_in, const string &fn_out) : Reader_base(fn_in), Writer_base(fn_out) {} virtual ~ReaderWriter() {} protected: virtual void rcallback(const string &ln) { write(ln); } };
#include "ReaderWriter.hpp" int main() { ReaderWriter rw("charset.out", "writer.out"); rw.main_loop(); return 0; }
Compile as follows:
$ g++ -std=c++11 -o main main.cpp
Let's analyze the code:
read from file:
in thereader
function we pass the name of the file for reading and callbacks. First we open the file for reading. Then in the loop we iterate line by line on the file, in each iteration we call the callback, passing it the next line. After completing the cycle, we close the file. If we speak in terms of OOP, then the constructor is responsible for initializing and opening the file, the main loop is responsible for themain_loop
method, whichmain_loop
over the file with a callback call. The file is closed in the destructor. Kolbek is essentially a virtual method that is overloaded in the descendant and called from the parent. This analogy can be traced in the example in C ++.
we in the constructor of theReader_base
class initialize thefile_name
variable, open the file for reading. Next, we create a virtual function member,main_loop
, in which we loop around the file line by line and pass the string to the member functionrcallback
, which should be loaded in the child.
write to file:
in thewriter
function, we pass the name of the file to be written and the callback. As in the example with thereader
function, we first open the file for writing. Then we call a callback into which we pass another callback (closure), in which we get a string and then write it to a file. After exiting the callback, we close the file. If we speak in terms of OOP, then the constructor is responsible for initializing and opening the file. The write method is responsible for writing to the file, which receives a string as input and writes it to a file. Then the file is closed in the destructor. This analogy can be traced in the example in C ++.
we in the constructor of the classWriter_base
initialize the variablefile_name
, open the file for writing. Next, we create a virtual member functionwriter
, to which a string is passed to write to the file. Then the file is closed in the destructor.
working with created functions in Perl and classes in C ++:
we first call thewriter
function, into which we transfer the name of the file to be written and the callback. In the callback, we get another callback variable in the$writer
variable, which will write the string passed to it to the file. Then we call thereader
function, into which we pass the file name for reading and callback. In the reader function callback, we get the next line from the file, work with it, and use the$writer
callback to write to the file. As can be seen from the example, the reader’s callback function is essentially a closure, since contains a link to the lexical variable$writer
.
we create aReaderWriter
class that uses multiple inheritance and inherits theReader_base
andWriter_base
. In the constructor, we initialize theReader_base
andWriter_base
classesReader_base
file name for reading and writing, respectively. Then we create an overloadedrcallback
method that receives the next line and writes to the file using thewrite
method of theWriter_base
class. The overloadedrcallback
method isrcallback
called from themain_loop
method of themain_loop
class. As you can see from the example file main.cpp, to work with classes, anrw
object of theReaderWriter
class isReaderWriter
, to the constructor of which the names of the files for reading and writing are transferred. Then we call the member functionmain_loop
objectrw
.
Next, we consider the complex practical task of finding broken links using AnyEvent :: HTTP, which will use the topics described above — anonymous subroutines, closures, and callback functions.
In order to solve the problem of finding broken links (links with response codes 4xx and 5xx), you need to understand how to implement a crawl site. In essence, the site is a link graph, i.e. URLs can refer to both external pages and internal pages. To crawl the site, we will use the following algorithm:
process_page(current_page): for each link on the current_page: if target_page is not already in your graph: create a Page object to represent target_page add it to to_be_scanned set add a link from current_page to target_page scan_website(start_page) create Page object for start_page to_be_scanned = set(start_page) while to_be_scanned is not empty: current_page = to_be_scanned.pop() process_page(current_page)
The implementation of this task lies in the Broken link checker repository. Consider the checker_with_graph.pl script. First, we initialize the variables $start_page_url
(url of the start page), $cnt
(the number of URLs to download), create a hash $to_be_scanned
and graph $g
.
Then we create the scan_website,
function scan_website,
into which we pass the limit on the maximum number of URLs for downloading and callbacks.
sub scan_website { my ($count_url_limit, $cb) = @_;
First, we initialize the $to_be_scanned
hash $to_be_scanned
start page.
# to_be_scanned = set(start_page) $to_be_scanned->{$start_page_url}{internal_urls} = [$start_page_url];
A complete analysis of the $to_be_scanned
structure will be further, and now it’s worth paying attention to the fact that the link is internal (internal_urls).
Next, create an anonymous function and execute it. Record view
my $do; $do = sub { ... }; $do->();
is a standard idiom and allows you to access the $do
variable from a closure, for example, to create a recursion:
my $do; $do = sub { ...; $do->(); ... }; $do->();
or delete circular reference:
my $do; $do = sub { ...; undef $do; ... }; $do->();
In the closure of $do
we create a %urls
hash into which we add the urls from the $to_be_scanned
hash.
my %urls; for my $parent_url (keys %$to_be_scanned) { my $type_urls = $to_be_scanned->{$parent_url}; # $type_urls - internal_urls|external_urls push @{$urls{$parent_url}}, splice(@{$type_urls->{internal_urls}}, 0, $max_connects); while (my ($root_domain, $external_urls) = each %{$type_urls->{external_urls}}) { push @{$urls{$parent_url}}, splice(@$external_urls, 0, 1); } }
The %urls
hash structure is %urls
follows:
{parent_url1 => [target_url1, target_url2, target_url3], parent_url2 => [...]}
Then we execute the function process_page
, passing it a link to the %urls
and hash.
process_page(\%urls, sub { ... });
In the process_page
function, we save the resulting hash and callback.
sub process_page { my ($current_page_urls, $cb) = @_;
Then we cycle through the url hash, getting a pair (parent_url => current_urls)
and then go through the list of current urls (current_urls)
while (my ($parent_url, $current_urls) = each %$current_page_urls) { for my $current_url (@$current_urls) {
Before proceeding to the consideration of receiving data from the pages, we will make a small digression. The basic algorithm for parsing a page and getting URLs from it assumes one HTTP GET method, regardless of whether this is internal or external. In this implementation, two HEAD and GET calls were used to reduce the server load as follows:
So, first we perform the http_head
function of the AnyEvent :: HTTP module, passing it the current URL, request parameters and callback.
$cv->begin; http_head $current_url, %params, sub {
In the callback, we get the headers (HTTP headers)
my $headers = $_[1];
from which we get the real url (url after redirects)
my $real_current_url = $headers->{URL};
Then we save the %urls_with_redirects
pair (current_url => real_current_url)
to the hash.
$urls_with_redirects{$current_url} = $real_current_url if $current_url ne $real_current_url;
Further, if an error occurs (status codes 4xx and 5xx), then we output the error to the log and save the header to the hash for further use
if ( $headers->{Status} =~ /^[45]/ && !($headers->{Status} == 405 && $headers->{allow} =~ /\bget\b/i) ) { $warn_log->("$headers->{Status} | $parent_url -> $real_current_url") if $warn; $note_log->(sub { p($headers) }) if $note; $urls_with_errors{$current_url} = $headers; # }
Otherwise, if the site is internal and this is a web page,
elsif ( # ($start_page_url_root eq $url_normalization->root_domain($real_current_url)) # - && ($headers->{'content-type'} =~ m{^text/html}) ) {
then we execute the http_get
function, with which we transfer the real current URL obtained above, the request parameters and the callback.
$cv->begin; http_get $real_current_url, %params, sub {
In the http_get function http_get
we get the headers and body of the page, we decode the page.
my ($content, $headers) = @_; $content = content_decode($content, $headers->{'content-type'});
Using the Web :: Query module, we perform the page parsing and getting URLs.
wq($content)->find('a') ->filter(sub { my $href = $_[1]->attr('href'); # , $href !~ /^#/ && $href ne '/' && $href !~ m{^mailto:(?://)?[A-Z0-9+_.-]+@[A-Z0-9.-]+}i && ++$hrefs{$href} == 1 # if $href }) ->each(sub { # for each link on the current page
At each iteration of the each
method, we get a link in the callback.
my $href = $_->attr('href');
and transform it
$href = $url_normalization->canonical($href); # '/', '/contact' (//dev.twitter.com/etc) if ($href =~ m{^/[^/].*}) { $href = $url_normalization->path($real_current_url, $href) ; } $href = $url_normalization->without_fragment($href);
Next, we check if there is no such link in the column.
unless($g->has_vertex($href)) { # if tarteg_page is not already in your graph
then we get the root domain of the link (or put it in 'fails')
my $root_domain = $url_normalization->root_domain($href) || 'fails';
Then we fill in the $new_urls
structure, which is similar to the $to_be_scanned
structure and has the following form:
$new_urls = $to_be_scanned = { parent_url => { external_urls => { root_domain1 => [qw/url1 url2 url3/], root_domain2 => [qw/url1 url2 url3/], }, internal_urls => [qw/url url url/], }, };
In the $new_urls
structure, we create a pair (parent_url => target_url)
, while target_url
divide target_url
into several parts, namely, we divide into internal URLs, which we save into an array, and external ones, which we also divide by domains and also save into an array. This structure reduces the load on sites as follows - we select $max_connects ( )
internal URLs and one external URL for each domain at a time, as shown in the closure $do
higher when constructing %urls
hash. Accordingly, at the beginning of the scan_website
function scan_website
we saved the start page as follows:
$to_be_scanned = { $start_page_url => { internal_urls => [$start_page_url], }, };
those. in this case, both the parent and the current page were the start page (in other cases, these pages are different).
The construction of this structure is as follows - if the site is internal, then we create the structure
$new_urls->{$real_current_url}{internal_urls} //= []
otherwise, if the site is internal, the structure
$new_urls->{$real_current_url}{external_urls}{$root_domain} //= []
and save one of these structures to the $urls
variable, which we then use to write to the $new_urls
structure.
push @$urls, $href; # add it to to_be_scanned set
In this case, we use links to create and work with complex data structures. The$urls
refers to the$new_urls
structure, and accordingly, when the$urls
variable changes, the$urls
structure$new_urls
. More information about data structures and algorithms in Perl can be found in the book "Jon Orwant - Mastering Algorithms with Perl".
Then we add a pair to the graph (real_current_url (parent) => href (current))
.
$g->add_edge($real_current_url, $href);
After that, we check the structure of $new_urls
- if the arrays of internal_urls
or external_urls
not empty, then output the data to the log and execute the callback, passing it the structure of $new_urls
if (is_to_be_scanned($new_urls)) { $debug_log->(($parent_url // '')." -> $real_current_url ".p($new_urls)) if $debug; $cb->($new_urls); }
If we did not hit any of the options (error or parsing the internal page), i.e. the site is external and without errors, then we execute Kolbek
else { $cb->(); }
This cobble call is needed when all external sites are on the current $current_urls
list, but $to_be_scanned
still $to_be_scanned
in $to_be_scanned
. Without this call, we will go through the $current_urls
list, http_head
, and http_head
.
In the process_page function callback, we save the resulting $new_urls
structure,
process_page(\%urls, sub { my $new_urls = shift;
combine it with the $to_be_scanned
variable.
$to_be_scanned = merge($to_be_scanned, $new_urls) if $new_urls;
Next, we check if the number of elements in the graph is greater than or equal to the limit on the number of URLs, then we exit by removing the link to the anonymous subroutine and executing $cv->send()
.
if (scalar($g->vertices) >= $count_url_limit) { undef $do; $cb->(); $cv->send; }
Otherwise, if there are URLs to check,
elsif (is_to_be_scanned($to_be_scanned)) {
then recursively call an anonymous subroutine
$do->();
which call was considered above. $to_be_scanned
process_page
( ).
, GraphViz — svg, png .. :
$ perl bin/checker_with_graph.pl -u planetperl.ru -m 500 -c 5 \ -g -f svg -o etc/panetperl_ru.svg -l "broken link check" -r "http_//planetperl.ru/" $ perl bin/checker_with_graph.pl -u habrahabr.ru -m 500 -c 5 \ -g -f svg -o etc/habr_ru.svg -l "broken link check" -r "https_//habrahabr.ru/" $ perl bin/checker_with_graph.pl -u habrahabr.ru -m 100 -c 5 \ -g -f png -o etc/habr_ru.png -l "broken link check" -r "https_//habrahabr.ru/"
Where
--url | -u --max_urls | -m --max_connects | -c --graphviz | -g --graphviz_log_level | -e , . perldoc Log::Handler --format | -f - png, svg, etc --output_file | -o --label | -l --root | -r - .. twopi
PERL_ANYEVENT_VERBOSE,
$ export PERL_ANYEVENT_VERBOSE=n
n:
Perl, , — , . Perl C++, (callbacks) Perl - C++. AnyEvent::HTTP, .
Source: https://habr.com/ru/post/326028/