📜 ⬆️ ⬇️

Parsim pod from Perl 5 with Perl 6

I’ve just finished developing the pod parser (plain old documentation) for Perl 5 written in Perl 6. The grammar is surprisingly easy to do - thanks to Perl 6, because I myself am not particularly a programming genius. With the help of the guys from # perl6, I learned a lot of interesting things along the way, and I want to share this with everyone. Well, the code, of course, is also attached.

By the way, I recommend reading first my introduction to grammars in Perl 6 , and much of this article will become clearer.

Grammar Development


In Perl 6, the grammar is a special type of class for parsing texts. The idea is to declare a sequence of regularizers and assign them tokens, which can then be used to parse the input. For Pod :: Perl5 :: Grammar, I worked out the perlpod specification in detail, adding the necessary tokens as the standards are studied.
')
Of course, there are a few problems. First, how to define a regular schedule for lists? In pod lists can contain lists - can the definition include itself? It turns out that recursive definitions are possible, unless they coincide with a string of zero length, which leads to an infinite loop. Here is the definition:

token over_back { <over> [ <_item> | <paragraph> | <verbatim_paragraph> | <blank_line> | <_for> | <begin_end> | <pod> | <encoding> | <over_back> ]* <back> } token over { ^^\=over [\h+ <[0..9]>+ ]? \n } token _item { ^^\=item \h+ <name> [ [ \h+ <paragraph> ] | [ \h* \n <blank_line> <paragraph>? ] ] } token back { ^^\=back \h* \n } 


The over_back token describes the entire list from beginning to end. Simply put, it says that the sheet should start with = over and end with = back, and there can be a lot of things in the middle, including another over_back!

For simplicity, I used to call tokens as they are written in pod, although sometimes this was not possible due to intersections of namespaces.

I especially like the following template, so I often turned to it:

 [ <pod_section> | <?!before <pod_section> > .]* 


It is useful if you need to find a template, but ignore everything else if it is not found. In our case, pod_section is a token that defines a section in pod, but pod is often written directly in Perl code, and then everything superfluous should be ignored. Therefore, in the second part of the definition, a negative lookahead?! Before is used to check that the next passage of the text is not equal to pod_section, and a point is used to hook “everything else”, including line breaks. Both conditions are grouped in square brackets with an asterisk outside to check the text character by character.

Grammar can be used to parse a pod, both individually written and included in the code. It cuts out all the pod sections and puts them into a match object, which you can then work with. It is easy to use:

 use Pod::Perl5::Grammar; my $match = Pod::Perl5::Grammar.parse($pod); #  my $match = Pod::Perl5::Grammar.parsefile("/path/to/some.pod"); 


Action classes


Action classes are normal Perl 6 classes that can be passed to the grammar during parsing. They allow you to assign behavior (actions) to the tokens to work at the time of the pattern match. You just need to name the methods in the class in the same way as the token on which it must be executed. I wrote a pod-to-HTML action class. Here is a method to convert = head1 to HTML:

 method head1 ($/) { self.add_to_html('body', "<h1>{$/<singleline_text>.Str}</h1>\n"); } 


Each time a grammar uses the token head1, this method is executed. It is passed the $ / variable containing the found sequence head1, from which the text string is extracted.

To convert to HTML, each action class simply extracts the text from the desired token, reformats it and displays it. Everything worked fine until I met the nested tokens, like formatting codes, inside the text paragraph. Instead:

 There are different ways to emphasize text, I<this is in italics> and B<this is in bold> 


It turned out:

 <i>this is in italics</i> <b>this is in bold</b> <p>There are different ways to emphasize text, I<this is in italics> and B<this is in bold></p> 


This is because italics and bold are regulars in the first place. I had to use a buffer to store HTML from second-level tokens. When a paragraph token is found, the parser substitutes the contents of this buffer instead of text. The class looks like this:

 method paragraph ($/ is copy) { my $original_text = $/<text>.Str.chomp; my $para_text = $/<text>.Str.chomp; for self.get_buffer('paragraph').reverse -> $pair # reverse,      { $para_text = $para_text.subst($pair.key, {$pair.value}); } self.add_to_html('body', "<p>{$para_text}</p>\n"); self.clear_buffer('paragraph'); } method italic ($/) { self.add_to_buffer('paragraph', $/.Str => "<i>{$/<multiline_text>.Str}</i>"); } method bold ($/) { self.add_to_buffer('paragraph', $/.Str => "<b>{$/<multiline_text>.Str}</b>"); } 


Particular attention should be paid to work with regulars. Each example action class uses $ /. This is a mistake - guess what happens as a result:

 method head1 ($/) { if $/.Str ~~ m/foobar/ #   { self.add_to_html('body', "<h1>{$/<singleline_text>.Str}\n"); } } Cannot assign to a readonly variable or a value 


Assigning a variable to read only or value.

Nuclear explosion. When $ / is passed to head1, it is read only. Execution of any regular schedule in the same lexical scope will attempt to overwrite $ /. I tried it a couple of times, and using the # perl6 channel I stopped at this option:

 method head1 ($/ is copy) { my $match = $/; if $match.Str ~~ m/foobar/ { self.add_to_html('body', "<h1>{$match<singleline_text>.Str}</h1>\n"); } } 


By adding is copy to the parameters, I make a copy of the value instead of pointing to $ /. Then I copy the match variable in $ match, and then the next regular can safely work with $ /. I think it is better to do this:

 method head1 ($match) { if $match.Str ~~ m/foobar/ { self.add_to_html('body', "<h1>{$match<singleline_text>.Str}</h1>\n"); } } 


Just do not call the parameter $ /, and everything will work. But I did not check it out yet.

To use the action class, we simply pass it to the grammar:

 use Pod::Perl5::Grammar; use Pod::Perl5::ToHTML; my $actions = Pod::Perl5::ToHTML.new; my $match = Pod::Perl5::Grammar.parse($pod, :$actions); #  my $match = Pod::Perl5::Grammar.parse($pod, :actions($actions)); 


The first example uses the positional argument: $ actions. He must be called actions. In the second example, I called the argument: actions ($ actions), and in this case the object of the action class can be called anything.

We improve pod


PerlTricks.com articles are written in HTML, with their own class names and span tags. It is difficult to edit and difficult to write. I would like to use pod for editing - it would be easier for writers and for the editor. Therefore, I would like to expand pod by adding to it all the useful features for blogs. For example, formatting is done through B <...> and similar functions. Why not add @ <...> for links to Twitter, or M <...> for links to MetaCPAN?

Since grammars in Perl 6 are classes, they can be inherited and redefined. So I can add my own codes like this:

 grammar Pod::Perl5::Grammar::PerlTricks is Pod::Perl5::Grammar { token twitter { @\< <name> \> } token metacpan { M\< <name> \> } } 


You also need to override the format_codes token to include new ones:

 token format_codes { [ <italic>|<bold>|<code>|<link> |<escape>|<filename>|<singleline> |<index>|<zeroeffect>|<twitter|<metacpan> ] } 


That's how simple it is. New grammar will be able to parse pod and work with my new formatting codes. Of course, the Pod :: Perl5 :: Pod class can also be expanded and redefined, and the result will be something like:

 Pod::Perl5::ToHTML::PerlTricks is Pod::Perl5::ToHTML { method twitter ($match) { self.add_to_buffer('paragraph', $match.Str => "<a href="http://twitter.com/{$match<name>.Str}">{$match<name>.Str}</a>"); } method metacpan ($match) { self.add_to_buffer('paragraph', $match.Str => "<a href="https://metacpan.org/pod//{$match<name>.Str}">{$match<name>.Str}</a>"); } } 


That's not all


There is a more visual way to work with groups of tokens, multi-dispatch. Instead of defining format_codes as a list of alternative tokens, we declare a prototype method, and declare each formatting method as a variant of a multi prototype.

 proto token format_codes { * } multi token format_codes:italic { I\< <multiline_text> \> } multi token format_codes:bold { B\< <multiline_text> \> } multi token format_codes:code { C\< <multiline_text> \> } ... 


When inheriting a grammar, there is no need to override the format_codes. You can add new through multi:

 grammar Pod::Perl5::Grammar::PerlTricks is Pod::Perl5::Grammar { token format_codes:twitter { @\< <name> \> } token format_codes:metacpan { M\< <name> \> } } 


This approach also simplifies working with a match object in terms of the way to extract data. For example, the following code selects the link section from the third paragraph of the pod block:

 is $match<pod_section>[0]<paragraph>[2]<text><format_codes>[0]<link><section>.Str #   is $match<pod_section>[0]<paragraph>[2]<text><format_codes>[0]<section>.Str #   multi dispatch 


In the first example, a reference to the name of the token format is required. But with the help of multi-dispatch this can be avoided, as shown in the second example.

Conclusion


In general, writing a Perl 6 pod parser was a fairly straightforward and straightforward exercise. If you have any questions when programming in Perl 6, I highly recommend the irc channel # perl6 on the freenode server, people gathered there quite friendly and responsive.

Source: https://habr.com/ru/post/264225/


All Articles