Functional Perl Programming in the Examples

This article will discuss functional programming using the example of a search script for broken links using AnyEvent :: HTTP . The following topics will be covered:

anonymous subroutines;
closures;
callbacks;

Anonymous routines

An anonymous subroutine is declared as well as the usual one, but there is no name between the sub keyword and the opening bracket of the program code block. In addition, this form of writing is regarded as part of an expression, therefore the declaration of an anonymous subroutine must be terminated with a semicolon or another expression separator, as in most cases:

 sub { ...   ... };

For example, we implement a subroutine tripling the value passed to it:

 my $triple = sub { my $val = shift; return 3 * $val; }; say $triple->(2); # 6

The main advantage of anonymous subroutines is the use of "code as data." In other words, we save the code to a variable (for example, pass it to a function in the case of callbacks) for further execution.

Also, anonymous subroutines can be used to create recursions, including in combination with callbacks. For example, using the token __SUB__ , which appeared in the version of Perl 5.16.0 , and allows you to get a link to the current subroutine, we implement the calculation of factorial:

 use 5.16.0; my $factorial = sub { my $x = shift; return 1 if $x == 1; return $x * __SUB__->($x - 1); }; say $factorial->(5); # 120

An example of using recursion in conjunction with callbacks will be shown below when considering the problem of finding broken links.

Closures

As stated in Wikipedia

A closure is a first-class function whose body contains references to variables that are declared outside the body of this function in the surrounding code and are not its parameters.

Essentially, a closure is an analogue of a class in OOP: it provides the functionality and data associated and packaged together. Consider an example of closure in Perl and a class in C ++:

Perl

 sub multiplicator { my $multiplier = shift; return sub { return shift() * $multiplier; }; }

C ++

 class multiplicator { public: multiplicator(const int &mul): multiplier(mul) { } long int operator()(const int &n) { return n * multiplier; } private: int multiplier; };

Let's analyze the given code:

private variable declaration:
- Perl:
we first define the lexical ( my ) variable $multiplier ( my $multiplier = shift; );
- C ++:
declare a multiplier variable of type int after the access token private ;
private variable initialization:
- Perl:
when creating a variable, initialize the passed value;
- C ++:
overload the constructor so that it takes a number and in the initialization list we initialize the variable multiplier ;
creating a subroutine that multiplies the value passed to it from the previously initialized variable:
- Perl:
return an anonymous subroutine that takes a parameter as an input and multiplies it with the previously initialized variable $multiplier and returns the resulting value;
- C ++:
we overload the function call operator () , which takes the parameter n as an input, multiplies it with the variable multiplier and returns the value.

To use closures in Perl and class in C ++, they need to be defined, i.e. create object:

Perl:

Object Definition:

my $doubled = multiplicator(2);

my $tripled = multiplicator(3);

Using:

say $doubled->(3); # 6

say $tripled->(4); # 12

C ++:

Object Definition:

multiplicator doubled(2), tripled(3);

Using:

cout << doubled(3) << endl; // 6

cout << tripled(4) << endl; // 12

In C ++, the class object in which the definition operator () defined is often called a functional object, or functor. Functional objects are most often used as arguments for common algorithms. For example, in order to add elements of a vector, you can use the for_each algorithm, which applies the transferred function to each element of the sequence and the Sum class with an overloaded operand () , which adds all the elements of the sequence and returns the sum. Also, instead of the Sum class, you can use lambdas that appeared in C ++ 11.

C ++:

 #include <iostream> #include <vector> #include <algorithm> using std::cout; using std::endl; using std::vector; class Sum { public: Sum() : sum(0) { }; void operator() (int n) { sum += n; } inline int get_sum() { return sum; } private: int sum; }; int main() { vector<int> nums{3, 4, 2, 9, 15, 267}; Sum s = for_each(nums.begin(), nums.end(), Sum()); cout << "    Sum: " << s.get_sum() << endl; long int sum_of_elems = 0; for_each(nums.begin(), nums.end(), [&](int n) { sum_of_elems += n; }); cout << "   : " << sum_of_elems << endl; return 0; }

Perl:

 sub for_each { my($arr, $cb) = @_; for my $item (@$arr) { $cb->($item); } } my $sum = 0; for_each [3, 4, 2, 9, 15, 267], sub { $sum += $_[0]; }; say $sum;

As you can see from the example, in C ++ we declare the Sum class, which contains:

private variable sum , which is initialized in the standard constructor;
overloaded operator () , which gets each value of dividing and adds to variable sum ;
get_sum method for accessing the private variable sum .

In the Perl example, we create a for_each function that accepts an array reference and an anonymous function. Next we go through the array, and perform an anonymous function (closure), passing it the next element of the array as a parameter.

When using the for_each function, we first define a lesbian variable $sum , initialized to zero. Then, in the for_each function, for_each pass the reference to the array and the function closure, in which we summarize each element of the array into the $sum variable. After the for_each function is for_each , the $sum variable will contain the sum of the array.

The analogue of the function closure from the example in Perl, in C ++ is the use of lambda, as shown in the code. In the Perl example, the closure function passed to a function is also called a callback, or callback function.

Callback Functions

As can be seen from the for_each example, the callback function is the transfer of executable code as one of the parameters of another code. Often, the passed function works as a closure, i.e. It has access to lexical variables and can be defined in other contexts of the program code and can not be accessed directly from the parent function (the function to which the closure / callback was transferred).

In essence, the callback function is analogous to the polymorphism of functions, namely, it allows you to create more general-purpose functions instead of creating a series of functions that are identical in structure but differ only in certain places by the executed subtasks. Consider an example of the task of reading from a file and writing to a file. To do this, we will create two functions reader and writer using Perl (the example was taken from Mikhail Ozerov’s presentation Lazy iterators to parse heterogeneous data ), and using C ++ we will create the classes Reader_base, Writer_base, ReaderWriter.

Perl

read_write_file.pl

 use strict; use warnings; sub reader { my ($fn, $cb) = @_; open my $in, '<', $fn; while (my $ln = <$in>) { chomp $ln; $cb->($ln); #       } close $in; } sub write_file { my ($fn, $cb) = @_; open my $out, '>', $fn; $cb->(sub { #        my $ln = shift; syswrite($out, $ln.$/); }); close $out; } write_file('./out.cvs', sub { my $writer = shift; # sub { my $ln = shift; syswrite() } reader('./in.csv', sub { my $ln = shift; my @fields = split /;/, $ln; return unless substr($fields[1], 0, 1) == 6; @fields = @fields[0,1,2]; $writer->(join(';', @fields)); #        }); });

C ++

Reader_base.hpp

 #pragma once #include <iostream> #include <string> #include <fstream> //   - using std::ifstream; using std::getline; using std::cout; using std::runtime_error; using std::endl; using std::cerr; using std::string; class Reader_base { public: Reader_base(const string &fn_in) : file_name(fn_in) { open(file_name); } virtual ~Reader_base() { infile.close(); } virtual void open(const string &fn_in) { infile.open(fn_in); //  ,       if (! infile.is_open()) throw runtime_error("can't open input file \"" + file_name + "\""); } virtual void main_loop() { try { while(getline(infile, line)) { rcallback(line); } } catch(const runtime_error &e) { cerr << e.what() << " Try again." << endl; } } protected: virtual void rcallback(const string &ln) { throw runtime_error("Method 'callback' must me overloaded!"); }; private: ifstream infile; string line; string file_name; };

Writer_base.hpp

 #pragma once #include <iostream> #include <string> #include <fstream> //   - using std::string; using std::ofstream; using std::cout; using std::runtime_error; using std::endl; using std::cerr; class Writer_base { public: Writer_base(const string &fn_out) : file_name(fn_out) { open(file_name); } virtual ~Writer_base() { outfile.close(); } virtual void open(const string &fn_out) { outfile.open(file_name); if (! outfile.is_open()) throw runtime_error("can't open output file \"" + file_name + "\""); } virtual void write(const string &ln) { outfile << ln << endl; } private: string file_name; ofstream outfile; };

ReaderWriter.hpp

 #pragma once #include "Reader.hpp" #include "Writer.hpp" class ReaderWriter : public Reader_base, public Writer_base { public: ReaderWriter(const string &fn_in, const string &fn_out) : Reader_base(fn_in), Writer_base(fn_out) {} virtual ~ReaderWriter() {} protected: virtual void rcallback(const string &ln) { write(ln); } };

main.cpp

 #include "ReaderWriter.hpp" int main() { ReaderWriter rw("charset.out", "writer.out"); rw.main_loop(); return 0; }

Compile as follows:

 $ g++ -std=c++11 -o main main.cpp

Let's analyze the code:

read from file:
- Perl:
in the reader function we pass the name of the file for reading and callbacks. First we open the file for reading. Then in the loop we iterate line by line on the file, in each iteration we call the callback, passing it the next line. After completing the cycle, we close the file. If we speak in terms of OOP, then the constructor is responsible for initializing and opening the file, the main loop is responsible for the main_loop method, which main_loop over the file with a callback call. The file is closed in the destructor. Kolbek is essentially a virtual method that is overloaded in the descendant and called from the parent. This analogy can be traced in the example in C ++.
- C ++:
we in the constructor of the Reader_base class initialize the file_name variable, open the file for reading. Next, we create a virtual function member, main_loop , in which we loop around the file line by line and pass the string to the member function rcallback , which should be loaded in the child.
write to file:
- Perl:
in the writer function, we pass the name of the file to be written and the callback. As in the example with the reader function, we first open the file for writing. Then we call a callback into which we pass another callback (closure), in which we get a string and then write it to a file. After exiting the callback, we close the file. If we speak in terms of OOP, then the constructor is responsible for initializing and opening the file. The write method is responsible for writing to the file, which receives a string as input and writes it to a file. Then the file is closed in the destructor. This analogy can be traced in the example in C ++.
- C ++:
we in the constructor of the class Writer_base initialize the variable file_name , open the file for writing. Next, we create a virtual member function writer , to which a string is passed to write to the file. Then the file is closed in the destructor.
working with created functions in Perl and classes in C ++:
- Perl:
we first call the writer function, into which we transfer the name of the file to be written and the callback. In the callback, we get another callback variable in the $writer variable, which will write the string passed to it to the file. Then we call the reader function, into which we pass the file name for reading and callback. In the reader function callback, we get the next line from the file, work with it, and use the $writer callback to write to the file. As can be seen from the example, the reader’s callback function is essentially a closure, since contains a link to the lexical variable $writer .
- C ++:
we create a ReaderWriter class that uses multiple inheritance and inherits the Reader_base and Writer_base . In the constructor, we initialize the Reader_base and Writer_base classes Reader_base file name for reading and writing, respectively. Then we create an overloaded rcallback method that receives the next line and writes to the file using the write method of the Writer_base class. The overloaded rcallback method is rcallback called from the main_loop method of the main_loop class. As you can see from the example file main.cpp, to work with classes, an rw object of the ReaderWriter class is ReaderWriter , to the constructor of which the names of the files for reading and writing are transferred. Then we call the member function main_loop object rw .

Next, we consider the complex practical task of finding broken links using AnyEvent :: HTTP, which will use the topics described above — anonymous subroutines, closures, and callback functions.

The task of finding broken links

In order to solve the problem of finding broken links (links with response codes 4xx and 5xx), you need to understand how to implement a crawl site. In essence, the site is a link graph, i.e. URLs can refer to both external pages and internal pages. To crawl the site, we will use the following algorithm:

 process_page(current_page): for each link on the current_page: if target_page is not already in your graph: create a Page object to represent target_page add it to to_be_scanned set add a link from current_page to target_page scan_website(start_page) create Page object for start_page to_be_scanned = set(start_page) while to_be_scanned is not empty: current_page = to_be_scanned.pop() process_page(current_page)

The implementation of this task lies in the Broken link checker repository. Consider the checker_with_graph.pl script. First, we initialize the variables $start_page_url (url of the start page), $cnt (the number of URLs to download), create a hash $to_be_scanned and graph $g .

Then we create the scan_website, function scan_website, into which we pass the limit on the maximum number of URLs for downloading and callbacks.

 sub scan_website { my ($count_url_limit, $cb) = @_;

First, we initialize the $to_be_scanned hash $to_be_scanned start page.

 # to_be_scanned = set(start_page) $to_be_scanned->{$start_page_url}{internal_urls} = [$start_page_url];

A complete analysis of the $to_be_scanned structure will be further, and now it’s worth paying attention to the fact that the link is internal (internal_urls).

Next, create an anonymous function and execute it. Record view

 my $do; $do = sub { ... }; $do->();

is a standard idiom and allows you to access the $do variable from a closure, for example, to create a recursion:

 my $do; $do = sub { ...; $do->(); ... }; $do->();

or delete circular reference:

 my $do; $do = sub { ...; undef $do; ... }; $do->();

In the closure of $do we create a %urls hash into which we add the urls from the $to_be_scanned hash.

 my %urls; for my $parent_url (keys %$to_be_scanned) { my $type_urls = $to_be_scanned->{$parent_url}; # $type_urls - internal_urls|external_urls push @{$urls{$parent_url}}, splice(@{$type_urls->{internal_urls}}, 0, $max_connects); while (my ($root_domain, $external_urls) = each %{$type_urls->{external_urls}}) { push @{$urls{$parent_url}}, splice(@$external_urls, 0, 1); } }

The %urls hash structure is %urls follows:

 {parent_url1 => [target_url1, target_url2, target_url3], parent_url2 => [...]}

Then we execute the function process_page , passing it a link to the %urls and hash.

 process_page(\%urls, sub { ... });

In the process_page function, we save the resulting hash and callback.

 sub process_page { my ($current_page_urls, $cb) = @_;

Then we cycle through the url hash, getting a pair (parent_url => current_urls) and then go through the list of current urls (current_urls)

 while (my ($parent_url, $current_urls) = each %$current_page_urls) { for my $current_url (@$current_urls) {

Before proceeding to the consideration of receiving data from the pages, we will make a small digression. The basic algorithm for parsing a page and getting URLs from it assumes one HTTP GET method, regardless of whether this is internal or external. In this implementation, two HEAD and GET calls were used to reduce the server load as follows:

HEAD requests are executed for all external URLs (regardless of whether they are in error or not); for internal with an error and for non-web pages;
HEAD and GET requests are executed for internal web pages without errors;

So, first we perform the http_head function of the AnyEvent :: HTTP module, passing it the current URL, request parameters and callback.

 $cv->begin; http_head $current_url, %params, sub {

In the callback, we get the headers (HTTP headers)

 my $headers = $_[1];

from which we get the real url (url after redirects)

 my $real_current_url = $headers->{URL};

Then we save the %urls_with_redirects pair (current_url => real_current_url) to the hash.

 $urls_with_redirects{$current_url} = $real_current_url if $current_url ne $real_current_url;

Further, if an error occurs (status codes 4xx and 5xx), then we output the error to the log and save the header to the hash for further use

 if ( $headers->{Status} =~ /^[45]/ && !($headers->{Status} == 405 && $headers->{allow} =~ /\bget\b/i) ) { $warn_log->("$headers->{Status} | $parent_url -> $real_current_url") if $warn; $note_log->(sub { p($headers) }) if $note; $urls_with_errors{$current_url} = $headers; #      }

Otherwise, if the site is internal and this is a web page,

  elsif ( #   ($start_page_url_root eq $url_normalization->root_domain($real_current_url)) #   - && ($headers->{'content-type'} =~ m{^text/html}) ) {

then we execute the http_get function, with which we transfer the real current URL obtained above, the request parameters and the callback.

 $cv->begin; http_get $real_current_url, %params, sub {

In the http_get function http_get we get the headers and body of the page, we decode the page.

 my ($content, $headers) = @_; $content = content_decode($content, $headers->{'content-type'});

Using the Web :: Query module, we perform the page parsing and getting URLs.

 wq($content)->find('a') ->filter(sub { my $href = $_[1]->attr('href'); #           ,   $href !~ /^#/ && $href ne '/' && $href !~ m{^mailto:(?://)?[A-Z0-9+_.-]+@[A-Z0-9.-]+}i && ++$hrefs{$href} == 1 #      if $href }) ->each(sub { # for each link on the current page

At each iteration of the each method, we get a link in the callback.

 my $href = $_->attr('href');

and transform it

 $href = $url_normalization->canonical($href); #     '/', '/contact'    (//dev.twitter.com/etc) if ($href =~ m{^/[^/].*}) { $href = $url_normalization->path($real_current_url, $href) ; } $href = $url_normalization->without_fragment($href);

Next, we check if there is no such link in the column.

 unless($g->has_vertex($href)) { # if tarteg_page is not already in your graph

then we get the root domain of the link (or put it in 'fails')

 my $root_domain = $url_normalization->root_domain($href) || 'fails';

Then we fill in the $new_urls structure, which is similar to the $to_be_scanned structure and has the following form:

 $new_urls = $to_be_scanned = { parent_url => { external_urls => { root_domain1 => [qw/url1 url2 url3/], root_domain2 => [qw/url1 url2 url3/], }, internal_urls => [qw/url url url/], }, };

In the $new_urls structure, we create a pair (parent_url => target_url) , while target_url divide target_url into several parts, namely, we divide into internal URLs, which we save into an array, and external ones, which we also divide by domains and also save into an array. This structure reduces the load on sites as follows - we select $max_connects ( ) internal URLs and one external URL for each domain at a time, as shown in the closure $do higher when constructing %urls hash. Accordingly, at the beginning of the scan_website function scan_website we saved the start page as follows:

 $to_be_scanned = { $start_page_url => { internal_urls => [$start_page_url], }, };

those. in this case, both the parent and the current page were the start page (in other cases, these pages are different).

The construction of this structure is as follows - if the site is internal, then we create the structure

 $new_urls->{$real_current_url}{internal_urls} //= []

otherwise, if the site is internal, the structure

 $new_urls->{$real_current_url}{external_urls}{$root_domain} //= []

and save one of these structures to the $urls variable, which we then use to write to the $new_urls structure.

 push @$urls, $href; # add it to to_be_scanned set

In this case, we use links to create and work with complex data structures. The $urls refers to the $new_urls structure, and accordingly, when the $urls variable changes, the $urls structure $new_urls . More information about data structures and algorithms in Perl can be found in the book "Jon Orwant - Mastering Algorithms with Perl".

Then we add a pair to the graph (real_current_url (parent) => href (current)) .

 $g->add_edge($real_current_url, $href);

After that, we check the structure of $new_urls - if the arrays of internal_urls or external_urls not empty, then output the data to the log and execute the callback, passing it the structure of $new_urls

 if (is_to_be_scanned($new_urls)) { $debug_log->(($parent_url // '')." -> $real_current_url ".p($new_urls)) if $debug; $cb->($new_urls); }

If we did not hit any of the options (error or parsing the internal page), i.e. the site is external and without errors, then we execute Kolbek

  else { $cb->(); }

This cobble call is needed when all external sites are on the current $current_urls list, but $to_be_scanned still $to_be_scanned in $to_be_scanned . Without this call, we will go through the $current_urls list, http_head , and http_head .

In the process_page function callback, we save the resulting $new_urls structure,

 process_page(\%urls, sub { my $new_urls = shift;

combine it with the $to_be_scanned variable.

 $to_be_scanned = merge($to_be_scanned, $new_urls) if $new_urls;

Next, we check if the number of elements in the graph is greater than or equal to the limit on the number of URLs, then we exit by removing the link to the anonymous subroutine and executing $cv->send() .

 if (scalar($g->vertices) >= $count_url_limit) { undef $do; $cb->(); $cv->send; }

Otherwise, if there are URLs to check,

  elsif (is_to_be_scanned($to_be_scanned)) {

then recursively call an anonymous subroutine

 $do->();

which call was considered above. $to_be_scanned process_page ( ).

, GraphViz — svg, png .. :

 $ perl bin/checker_with_graph.pl -u planetperl.ru -m 500 -c 5 \ -g -f svg -o etc/panetperl_ru.svg -l "broken link check" -r "http_//planetperl.ru/" $ perl bin/checker_with_graph.pl -u habrahabr.ru -m 500 -c 5 \ -g -f svg -o etc/habr_ru.svg -l "broken link check" -r "https_//habrahabr.ru/" $ perl bin/checker_with_graph.pl -u habrahabr.ru -m 100 -c 5 \ -g -f png -o etc/habr_ru.png -l "broken link check" -r "https_//habrahabr.ru/"

Where

 --url | -u   --max_urls | -m      --max_connects | -c     --graphviz | -g    --graphviz_log_level | -e       , . perldoc Log::Handler --format | -f    - png, svg, etc --output_file | -o     --label | -l   --root | -r     - ..   twopi

PERL_ANYEVENT_VERBOSE,

 $ export PERL_ANYEVENT_VERBOSE=n

5 (warn) — http
6 (note) — http ( $headers)
7 (info) — URLs
8 (debug) — ,

Conclusion

Perl, , — , . Perl C++, (callbacks) Perl - C++. AnyEvent::HTTP, .

Source: https://habr.com/ru/post/326028/

All Articles

Functional Perl Programming in the Examples

Anonymous routines

Closures

Perl

C ++

Perl:

C ++:

C ++:

Perl:

Callback Functions

Perl

C ++

The task of finding broken links

Conclusion

More articles: