📜 ⬆️ ⬇️

Metaregular expressions on D

I ran through the hubs and did not find anything written at the same time in the "D" and "Abnormal Programming" hubs. There can be a completely false idea that only normal people write to D, or even worse, that knowledge of D automatically makes a normal person out of any programmer. I hasten to refute.


Although I myself, strictly speaking, am not a D programmer, I don’t have a single industrial project, but occasionally I’m happy to dig into someone else’s code for picking up delicious zest. I also write small utilities for myself, most often for processing text data, what is usually done in scripting languages, the benefit of D offers a very weak set of tools for working with strings.
Well, where there are word processors, there are regular expressions, as without them. And here D is again at its best, its ease of use and ease of use for its regular expression library approaches Perl. But in Perl, regulars are part of the syntax, it can be said that the language itself is built largely around them, and in D it is a completely independent module - std.regex from the standard library written by Dmitry Olshansky. Another great thing is that the expression parser can be built at compile time (of course, if the expression itself is specified by a literal), and of course I could not help but see how it works inside.
And here that, understanding in details at me hat flew off there was a thought, but is it possible to call one regular expression from within another? Do not insert a literal (as trivially you can do in Perl for example), but directly call the compiled code of one expression from the inside of another. Enough in my opinion a stupid idea that it was worth playing with it.
So what do we want? Something like this (for now this is pseudocode):


INT=regexp("\d+"); LIST=regexp("INT(,INT)*"); 

Since the post is designed for a wide audience of abnormal programmers, not only those who are familiar with D, I probably will first describe those language constructs that we will use.


spoiler

I myself prefer to have a cheat sheet before my eyes, just like a first grader who has learned a quatrain to a lesson.


In general, D offers the broadest choice of basic programming and metaprogramming techniques, but we will try to limit ourselves to the minimum necessary set, but we will try to deal with each in detail. For those who are interested, I highly recommend reading the wonderful article Templates Revisited written by the author of the language, it is dedicated to templates as a whole, but as an example just shows the compile-time regular expression parser which I took as a basis. Yes, of course I wrote my little parser for experiments, far from complete and not entirely correct, but ideally suited as a guinea pig.
So:


compile time calculations.


With this, everything is simple, roughly speaking any expression that can be calculated at compile time will be calculated. The second fundamental construct is static if (...) , is checked at compile time (or is not compiled at all), and incredibly simplifies metaprogramming.


templates


In D, the templates of both functions and classes are declared using an additional set of parameters:


 int f(T)(T x int y); struct(T,U)  { T x; U y; } 

and called as


  f!(int)(1, 2); A!(int,string) a; 

A template parameter can also be roughly speaking anything, including floating and string literals and any characters . However, the template keyword is also saved and creates parameterized namespace's which we will use extensively:


 template X(T) { const T x=0; } 

mixin


This construct inserts a literal directly into the code: mixin("static int x=0;"); also useful.


string operations


There are a million of them, but we really only need slicing : "ABCDEFGH"[2..4] == "CDE"; Note that the above example is also computed at compile time.


alias


This means just a synonym, but with the widest range of applications, especially in the parameters of the templates (the very any symbol mentioned)


famous eponymous trick


I will deliberately not use it, because at first the newcomers slightly blow the roof from it. In the end, it is nothing more than syntactic sugar, and you can always do without it.


That's actually all of our tools, go to the parser.


View from above:


 1. template regex(string re) 2. { 3. static if(re.length) { 4. // parse escaped sequences 5. static if(re[0] == '\\') { 6. alias atom=compile_escape!re; 7. 8. // parse character class 9. } else static if(re[0] == '[') { 10. alias atom=compile_class!re; 11. 12. // parse atom 13. } else static if(re[0] == '(') { 14. alias atom=compile_atom!re; 15. 16. // parse dot 17. } else static if(re[0] == '.') { 18. alias atom=compile_anychar!re; 19. 20. // parse regular literals 21 } else { 22. alias atom=compile_char!re; 23. } 24. 25. // parse predicate (if any) *+? 26. alias pred=compile_pred!(atom, re[atom.skip..$]); 27. alias result=both_of!(pred, regex!(re[pred.skip..$])); 28. 29. // public result 30. static const size_t skip=result.skip; 31. alias match=result.match; 32. 33. } else { 34. static const size_t skip=0; 35. // end of regular expression 36. alias match=test_empty!(re); 37. } 38. } 

First, what is it? Not a function, not a class, a template in its pure form, just a piece of code parameterized by a literal. However, the alias on him is regularly taken.
The second thing to notice is that from all this abundance of code only one small branch is compiled (pay attention to the static if cascade)
Return Values? There are two of them (pp. 30-31 and 34-36):
static const size_t skip; - shows how many characters from the regular expression we used.
alias match; Is something ( alias ) that can be called as a function with the string parameter.
Well, it is proposed to use it like this:


 alias myRe=regex!"A?[az]+\d*"; auto result=myRe.match(some_string); 

Looking ahead, I will say that the return value is a structure of two elements:


 struct Match { bool _success; ulong length; bool opCast(T : bool)() const { return _success; } } 

the first shows whether the string satisfies the regular expression, and the second shows the length of the passed substring. Although now, at the time of compilation, we are deeply on the drum.


Let's walk now along the branches, it is more convenient to start from the bottom:


If the regular expression is empty (re.length == 0) (p. 33) , we return the reference to the function that returns always true , that is, Match (true, 0) . This is the successful end of the parsing.
If the regular expression is not empty (page 3), we first choose a subparser from the compile XXX set depending on the first character: service \. [(_ Or any other character. The subparser has the same interface as the top level parser - defines constants inside itself) skip and match and in the last (simplest, p. 22) branch returns skip = 1 and a link to the function that checks the first characters of two lines for equality, like this:


 template compile_char(string re) { static const size_t skip=1; alias match=test_char!re; } // match single character Match test_char(string re)(string s) { static if(re.length) { if(s.length && s[0] == re[0]) return Match(1,1); } return Match(0,0); } 

The second construction here is a regular template function that will be called during execution.
The rest of the subparsers are not much more complicated and similarly arranged, but what happens after the parser is selected? Then a predicate can go, one of ? * + , Or not, so the selected subparser is passed to another parser: compile_pred (page 26) :


 // modify previous (term) regex with one of *+? predicates, or none template compile_pred(alias term, string re) { static if(re.length) { static if(re[0] == '*') { static const size_t skip=term.skip+1; alias match=zero_or_more!term; } else static if(re[0] == '+') { static const size_t skip=term.skip+1; alias match=one_or_more!term; } else static if(re[0] == '?') { static const size_t skip=term.skip+1; alias match=zero_or_one!term; } else { // no predicate, return term static const size_t skip=term.skip; alias match=term.match; } } else { // no predicate, end of regex, also return term static const size_t skip=term.skip; alias match=term.match; } } // conditionally match the expression, always match but matched length may be different Match zero_or_one(alias term)(string s) { Match r=term.match(s); return r? r : Match(1,0); } 

Here, according to the already run-in scenario, depending on the predicate, skip is incremented by one and the match function is set to one of the _zero_or_more (*), zero_or_one (?) Or one_or more (+) template functions, I gave one of them for example. Or, if the predicate is absent, both of these values ​​are taken unchanged from the atomic parser.


As always with patterns, there must be a recursion at the end.


Returning to the main parser, we got the complete expression for the subparser with the predicate and now finally pass it to the last subparser, which calls both parsers in turn and calculates the total length (page 27):


  alias result=both_of!(pred, regex!(re[pred.skip..$])); static const size_t skip=result.skip; alias match=result.match; ... // match both expressions template both_of(alias re1, alias re2) { static const size_t skip=re1.skip+re2.skip; alias match=test_join!(re1,re2); } // match both of expressions, return matched length Match test_join(alias re1, alias re2)(string s) { auto m1=re1.match(s); if(m1) { auto m2=re2.match(s[m1.length..$]); return Match(m1 && m2, m1.length+m2.length); } return m1; } 

Note that the second parameter for the sub_parser both_of the main parser passes itself, but with the trimmed regular expression regex! (re [pred.skip .. $]) , the first characters already used by the subparser are cut off. Thus, the regular expression will be recursively passed to the end, a chain of sample functions such as test_char, test_space, test_range, one_or_more , etc. will be built, and we will return to what we started with - an empty expression for which the test_empty function is always returned that returns (true, 0) .
Parsing is over.


Before going to the essence of the post, I want to give another auxiliary template (it has already been used in parsing the expressions "[]" and "()"). As it is easy to understand, it recursively calculates the length of a string to a given character. Soon it will be useful to us:


 // misc., return length of literal up the to terminator template extract_until(string s, char terminator) { static assert(s.length, "missing terminator"); static if(s[0] == terminator) { static const length=0; } else { static const length=1+extract_until!(s[1..$], terminator).length; } } 

Finally we got to the bottom


So, we built a model parser that understands the \ s \ d \ w \ x constructions . [az] (AB) * +? it is now quite easy to add one more branch to it. If anyone noticed, along with other missing expressions, I was too lazy to build processing the predicate {n, m}. Since it’s a sin not to use such beautiful brackets, I usurp them to express {SUBEXPR} . See, first add a branch to the main parser:


  // parse meta expression } else static if(re[0] == '{') { alias atom=compile_meta!re; 

Now we add a subparser for it, surprisingly simple:


 template compile_meta(string re) { static const skip=2+extract_until!(re[1..$], '}').length; mixin("alias match="~re[1..skip-1]~".match;"); } 

the first line calculates the length of the subexpression up to the closing bracket, and the second uses mixin to insert the following link into the code: alias match=SUBEXPR.match; . And it's all! It starts to work.
Now we can finally write:


 alias NUM=regex!"\\d+"; alias INT=regex!"[+-]?{NUM}"; alias LIST=regex!"{INT}(,{INT})*"; assert(NUM.match("1234") == Match(true, "1234".length)); assert(INT.match("-12") == Match(true, "-12".length)); assert(LIST.match("1,+2,-3,4") == Match(true, "1,+2,-3,4".length)); 

Just some buffalo turned out.


And then I thought ...


If I were a character in a Hollywood film, a sort of crazy genius, I would, according to the laws of the genre, must have missed something, some trifle that would have destroyed my evil plans at the end.
And just thought how here it is - the cyclic dependence of the modules . The fact is that this all works fine inside a single file, and if I tried to isolate this code into a separate library or module, nothing would have happened. In the example above, the NUM and INT characters defined in user-code should be exported down to the metare module level, which is obviously impossible. But nothing, someday, in any language, someone will suggest a way around and then I will finally conquer this world .
Thank you all, I hope it was fun.
As usual, if anyone needs it - a link to the code


')

Source: https://habr.com/ru/post/323186/


All Articles