⬆️ ⬇️

Functional Perl Programming in the Examples

This article will discuss functional programming using the example of a search script for broken links using AnyEvent :: HTTP . The following topics will be covered:





Anonymous routines



An anonymous subroutine is declared as well as the usual one, but there is no name between the sub keyword and the opening bracket of the program code block. In addition, this form of writing is regarded as part of an expression, therefore the declaration of an anonymous subroutine must be terminated with a semicolon or another expression separator, as in most cases:



 sub { ...   ... }; 


For example, we implement a subroutine tripling the value passed to it:



 my $triple = sub { my $val = shift; return 3 * $val; }; say $triple->(2); # 6 


The main advantage of anonymous subroutines is the use of "code as data." In other words, we save the code to a variable (for example, pass it to a function in the case of callbacks) for further execution.



Also, anonymous subroutines can be used to create recursions, including in combination with callbacks. For example, using the token __SUB__ , which appeared in the version of Perl 5.16.0 , and allows you to get a link to the current subroutine, we implement the calculation of factorial:



 use 5.16.0; my $factorial = sub { my $x = shift; return 1 if $x == 1; return $x * __SUB__->($x - 1); }; say $factorial->(5); # 120 


An example of using recursion in conjunction with callbacks will be shown below when considering the problem of finding broken links.



Closures



As stated in Wikipedia



A closure is a first-class function whose body contains references to variables that are declared outside the body of this function in the surrounding code and are not its parameters.

Essentially, a closure is an analogue of a class in OOP: it provides the functionality and data associated and packaged together. Consider an example of closure in Perl and a class in C ++:



Perl



 sub multiplicator { my $multiplier = shift; return sub { return shift() * $multiplier; }; } 


C ++



 class multiplicator { public: multiplicator(const int &mul): multiplier(mul) { } long int operator()(const int &n) { return n * multiplier; } private: int multiplier; }; 


Let's analyze the given code:





To use closures in Perl and class in C ++, they need to be defined, i.e. create object:



Perl:





my $doubled = multiplicator(2);



my $tripled = multiplicator(3);




say $doubled->(3); # 6



say $tripled->(4); # 12


C ++:





multiplicator doubled(2), tripled(3);




cout << doubled(3) << endl; // 6



cout << tripled(4) << endl; // 12

In C ++, the class object in which the definition operator () defined is often called a functional object, or functor. Functional objects are most often used as arguments for common algorithms. For example, in order to add elements of a vector, you can use the for_each algorithm, which applies the transferred function to each element of the sequence and the Sum class with an overloaded operand () , which adds all the elements of the sequence and returns the sum. Also, instead of the Sum class, you can use lambdas that appeared in C ++ 11.



C ++:



 #include <iostream> #include <vector> #include <algorithm> using std::cout; using std::endl; using std::vector; class Sum { public: Sum() : sum(0) { }; void operator() (int n) { sum += n; } inline int get_sum() { return sum; } private: int sum; }; int main() { vector<int> nums{3, 4, 2, 9, 15, 267}; Sum s = for_each(nums.begin(), nums.end(), Sum()); cout << "    Sum: " << s.get_sum() << endl; long int sum_of_elems = 0; for_each(nums.begin(), nums.end(), [&](int n) { sum_of_elems += n; }); cout << "   : " << sum_of_elems << endl; return 0; } 


Perl:



 sub for_each { my($arr, $cb) = @_; for my $item (@$arr) { $cb->($item); } } my $sum = 0; for_each [3, 4, 2, 9, 15, 267], sub { $sum += $_[0]; }; say $sum; 


As you can see from the example, in C ++ we declare the Sum class, which contains:





In the Perl example, we create a for_each function that accepts an array reference and an anonymous function. Next we go through the array, and perform an anonymous function (closure), passing it the next element of the array as a parameter.



When using the for_each function, we first define a lesbian variable $sum , initialized to zero. Then, in the for_each function, for_each pass the reference to the array and the function closure, in which we summarize each element of the array into the $sum variable. After the for_each function is for_each , the $sum variable will contain the sum of the array.



The analogue of the function closure from the example in Perl, in C ++ is the use of lambda, as shown in the code. In the Perl example, the closure function passed to a function is also called a callback, or callback function.



Callback Functions



As can be seen from the for_each example, the callback function is the transfer of executable code as one of the parameters of another code. Often, the passed function works as a closure, i.e. It has access to lexical variables and can be defined in other contexts of the program code and can not be accessed directly from the parent function (the function to which the closure / callback was transferred).



In essence, the callback function is analogous to the polymorphism of functions, namely, it allows you to create more general-purpose functions instead of creating a series of functions that are identical in structure but differ only in certain places by the executed subtasks. Consider an example of the task of reading from a file and writing to a file. To do this, we will create two functions reader and writer using Perl (the example was taken from Mikhail Ozerov’s presentation Lazy iterators to parse heterogeneous data ), and using C ++ we will create the classes Reader_base, Writer_base, ReaderWriter.



Perl



read_write_file.pl
 use strict; use warnings; sub reader { my ($fn, $cb) = @_; open my $in, '<', $fn; while (my $ln = <$in>) { chomp $ln; $cb->($ln); #       } close $in; } sub write_file { my ($fn, $cb) = @_; open my $out, '>', $fn; $cb->(sub { #        my $ln = shift; syswrite($out, $ln.$/); }); close $out; } write_file('./out.cvs', sub { my $writer = shift; # sub { my $ln = shift; syswrite() } reader('./in.csv', sub { my $ln = shift; my @fields = split /;/, $ln; return unless substr($fields[1], 0, 1) == 6; @fields = @fields[0,1,2]; $writer->(join(';', @fields)); #        }); }); 


C ++



Reader_base.hpp
 #pragma once #include <iostream> #include <string> #include <fstream> //   - using std::ifstream; using std::getline; using std::cout; using std::runtime_error; using std::endl; using std::cerr; using std::string; class Reader_base { public: Reader_base(const string &fn_in) : file_name(fn_in) { open(file_name); } virtual ~Reader_base() { infile.close(); } virtual void open(const string &fn_in) { infile.open(fn_in); //  ,       if (! infile.is_open()) throw runtime_error("can't open input file \"" + file_name + "\""); } virtual void main_loop() { try { while(getline(infile, line)) { rcallback(line); } } catch(const runtime_error &e) { cerr << e.what() << " Try again." << endl; } } protected: virtual void rcallback(const string &ln) { throw runtime_error("Method 'callback' must me overloaded!"); }; private: ifstream infile; string line; string file_name; }; 


Writer_base.hpp
 #pragma once #include <iostream> #include <string> #include <fstream> //   - using std::string; using std::ofstream; using std::cout; using std::runtime_error; using std::endl; using std::cerr; class Writer_base { public: Writer_base(const string &fn_out) : file_name(fn_out) { open(file_name); } virtual ~Writer_base() { outfile.close(); } virtual void open(const string &fn_out) { outfile.open(file_name); if (! outfile.is_open()) throw runtime_error("can't open output file \"" + file_name + "\""); } virtual void write(const string &ln) { outfile << ln << endl; } private: string file_name; ofstream outfile; }; 


ReaderWriter.hpp
 #pragma once #include "Reader.hpp" #include "Writer.hpp" class ReaderWriter : public Reader_base, public Writer_base { public: ReaderWriter(const string &fn_in, const string &fn_out) : Reader_base(fn_in), Writer_base(fn_out) {} virtual ~ReaderWriter() {} protected: virtual void rcallback(const string &ln) { write(ln); } }; 


main.cpp
 #include "ReaderWriter.hpp" int main() { ReaderWriter rw("charset.out", "writer.out"); rw.main_loop(); return 0; } 


Compile as follows:



 $ g++ -std=c++11 -o main main.cpp 


Let's analyze the code:





Next, we consider the complex practical task of finding broken links using AnyEvent :: HTTP, which will use the topics described above — anonymous subroutines, closures, and callback functions.



The task of finding broken links



In order to solve the problem of finding broken links (links with response codes 4xx and 5xx), you need to understand how to implement a crawl site. In essence, the site is a link graph, i.e. URLs can refer to both external pages and internal pages. To crawl the site, we will use the following algorithm:



 process_page(current_page): for each link on the current_page: if target_page is not already in your graph: create a Page object to represent target_page add it to to_be_scanned set add a link from current_page to target_page scan_website(start_page) create Page object for start_page to_be_scanned = set(start_page) while to_be_scanned is not empty: current_page = to_be_scanned.pop() process_page(current_page) 


The implementation of this task lies in the Broken link checker repository. Consider the checker_with_graph.pl script. First, we initialize the variables $start_page_url (url of the start page), $cnt (the number of URLs to download), create a hash $to_be_scanned and graph $g .



Then we create the scan_website, function scan_website, into which we pass the limit on the maximum number of URLs for downloading and callbacks.



 sub scan_website { my ($count_url_limit, $cb) = @_; 


First, we initialize the $to_be_scanned hash $to_be_scanned start page.



 # to_be_scanned = set(start_page) $to_be_scanned->{$start_page_url}{internal_urls} = [$start_page_url]; 


A complete analysis of the $to_be_scanned structure will be further, and now it’s worth paying attention to the fact that the link is internal (internal_urls).



Next, create an anonymous function and execute it. Record view



 my $do; $do = sub { ... }; $do->(); 


is a standard idiom and allows you to access the $do variable from a closure, for example, to create a recursion:



 my $do; $do = sub { ...; $do->(); ... }; $do->(); 


or delete circular reference:



 my $do; $do = sub { ...; undef $do; ... }; $do->(); 


In the closure of $do we create a %urls hash into which we add the urls from the $to_be_scanned hash.



 my %urls; for my $parent_url (keys %$to_be_scanned) { my $type_urls = $to_be_scanned->{$parent_url}; # $type_urls - internal_urls|external_urls push @{$urls{$parent_url}}, splice(@{$type_urls->{internal_urls}}, 0, $max_connects); while (my ($root_domain, $external_urls) = each %{$type_urls->{external_urls}}) { push @{$urls{$parent_url}}, splice(@$external_urls, 0, 1); } } 


The %urls hash structure is %urls follows:



 {parent_url1 => [target_url1, target_url2, target_url3], parent_url2 => [...]} 


Then we execute the function process_page , passing it a link to the %urls and hash.



 process_page(\%urls, sub { ... }); 


In the process_page function, we save the resulting hash and callback.



 sub process_page { my ($current_page_urls, $cb) = @_; 


Then we cycle through the url hash, getting a pair (parent_url => current_urls) and then go through the list of current urls (current_urls)



 while (my ($parent_url, $current_urls) = each %$current_page_urls) { for my $current_url (@$current_urls) { 


Before proceeding to the consideration of receiving data from the pages, we will make a small digression. The basic algorithm for parsing a page and getting URLs from it assumes one HTTP GET method, regardless of whether this is internal or external. In this implementation, two HEAD and GET calls were used to reduce the server load as follows:





So, first we perform the http_head function of the AnyEvent :: HTTP module, passing it the current URL, request parameters and callback.



 $cv->begin; http_head $current_url, %params, sub { 


In the callback, we get the headers (HTTP headers)



 my $headers = $_[1]; 


from which we get the real url (url after redirects)



 my $real_current_url = $headers->{URL}; 


Then we save the %urls_with_redirects pair (current_url => real_current_url) to the hash.



 $urls_with_redirects{$current_url} = $real_current_url if $current_url ne $real_current_url; 


Further, if an error occurs (status codes 4xx and 5xx), then we output the error to the log and save the header to the hash for further use



 if ( $headers->{Status} =~ /^[45]/ && !($headers->{Status} == 405 && $headers->{allow} =~ /\bget\b/i) ) { $warn_log->("$headers->{Status} | $parent_url -> $real_current_url") if $warn; $note_log->(sub { p($headers) }) if $note; $urls_with_errors{$current_url} = $headers; #      } 


Otherwise, if the site is internal and this is a web page,



  elsif ( #   ($start_page_url_root eq $url_normalization->root_domain($real_current_url)) #   - && ($headers->{'content-type'} =~ m{^text/html}) ) { 


then we execute the http_get function, with which we transfer the real current URL obtained above, the request parameters and the callback.



 $cv->begin; http_get $real_current_url, %params, sub { 


In the http_get function http_get we get the headers and body of the page, we decode the page.



 my ($content, $headers) = @_; $content = content_decode($content, $headers->{'content-type'}); 


Using the Web :: Query module, we perform the page parsing and getting URLs.



 wq($content)->find('a') ->filter(sub { my $href = $_[1]->attr('href'); #           ,   $href !~ /^#/ && $href ne '/' && $href !~ m{^mailto:(?://)?[A-Z0-9+_.-]+@[A-Z0-9.-]+}i && ++$hrefs{$href} == 1 #      if $href }) ->each(sub { # for each link on the current page 


At each iteration of the each method, we get a link in the callback.



 my $href = $_->attr('href'); 


and transform it



 $href = $url_normalization->canonical($href); #     '/', '/contact'    (//dev.twitter.com/etc) if ($href =~ m{^/[^/].*}) { $href = $url_normalization->path($real_current_url, $href) ; } $href = $url_normalization->without_fragment($href); 


Next, we check if there is no such link in the column.



 unless($g->has_vertex($href)) { # if tarteg_page is not already in your graph 


then we get the root domain of the link (or put it in 'fails')



 my $root_domain = $url_normalization->root_domain($href) || 'fails'; 


Then we fill in the $new_urls structure, which is similar to the $to_be_scanned structure and has the following form:



 $new_urls = $to_be_scanned = { parent_url => { external_urls => { root_domain1 => [qw/url1 url2 url3/], root_domain2 => [qw/url1 url2 url3/], }, internal_urls => [qw/url url url/], }, }; 


In the $new_urls structure, we create a pair (parent_url => target_url) , while target_url divide target_url into several parts, namely, we divide into internal URLs, which we save into an array, and external ones, which we also divide by domains and also save into an array. This structure reduces the load on sites as follows - we select $max_connects ( ) internal URLs and one external URL for each domain at a time, as shown in the closure $do higher when constructing %urls hash. Accordingly, at the beginning of the scan_website function scan_website we saved the start page as follows:



 $to_be_scanned = { $start_page_url => { internal_urls => [$start_page_url], }, }; 


those. in this case, both the parent and the current page were the start page (in other cases, these pages are different).



The construction of this structure is as follows - if the site is internal, then we create the structure



 $new_urls->{$real_current_url}{internal_urls} //= [] 


otherwise, if the site is internal, the structure



 $new_urls->{$real_current_url}{external_urls}{$root_domain} //= [] 


and save one of these structures to the $urls variable, which we then use to write to the $new_urls structure.



 push @$urls, $href; # add it to to_be_scanned set 


In this case, we use links to create and work with complex data structures. The $urls refers to the $new_urls structure, and accordingly, when the $urls variable changes, the $urls structure $new_urls . More information about data structures and algorithms in Perl can be found in the book "Jon Orwant - Mastering Algorithms with Perl".

Then we add a pair to the graph (real_current_url (parent) => href (current)) .



 $g->add_edge($real_current_url, $href); 


After that, we check the structure of $new_urls - if the arrays of internal_urls or external_urls not empty, then output the data to the log and execute the callback, passing it the structure of $new_urls



 if (is_to_be_scanned($new_urls)) { $debug_log->(($parent_url // '')." -> $real_current_url ".p($new_urls)) if $debug; $cb->($new_urls); } 


If we did not hit any of the options (error or parsing the internal page), i.e. the site is external and without errors, then we execute Kolbek



  else { $cb->(); } 


This cobble call is needed when all external sites are on the current $current_urls list, but $to_be_scanned still $to_be_scanned in $to_be_scanned . Without this call, we will go through the $current_urls list, http_head , and http_head .



In the process_page function callback, we save the resulting $new_urls structure,



 process_page(\%urls, sub { my $new_urls = shift; 


combine it with the $to_be_scanned variable.



 $to_be_scanned = merge($to_be_scanned, $new_urls) if $new_urls; 


Next, we check if the number of elements in the graph is greater than or equal to the limit on the number of URLs, then we exit by removing the link to the anonymous subroutine and executing $cv->send() .



 if (scalar($g->vertices) >= $count_url_limit) { undef $do; $cb->(); $cv->send; } 


Otherwise, if there are URLs to check,



  elsif (is_to_be_scanned($to_be_scanned)) { 


then recursively call an anonymous subroutine



 $do->(); 


which call was considered above. $to_be_scanned process_page ( ).



, GraphViz — svg, png .. :



 $ perl bin/checker_with_graph.pl -u planetperl.ru -m 500 -c 5 \ -g -f svg -o etc/panetperl_ru.svg -l "broken link check" -r "http_//planetperl.ru/" $ perl bin/checker_with_graph.pl -u habrahabr.ru -m 500 -c 5 \ -g -f svg -o etc/habr_ru.svg -l "broken link check" -r "https_//habrahabr.ru/" $ perl bin/checker_with_graph.pl -u habrahabr.ru -m 100 -c 5 \ -g -f png -o etc/habr_ru.png -l "broken link check" -r "https_//habrahabr.ru/" 


Where



 --url | -u   --max_urls | -m      --max_connects | -c     --graphviz | -g    --graphviz_log_level | -e       , . perldoc Log::Handler --format | -f    - png, svg, etc --output_file | -o     --label | -l   --root | -r     - ..   twopi      


PERL_ANYEVENT_VERBOSE,



 $ export PERL_ANYEVENT_VERBOSE=n 


n:





Conclusion



Perl, , — , . Perl C++, (callbacks) Perl - C++. AnyEvent::HTTP, .



')

Source: https://habr.com/ru/post/326028/



All Articles