📜 ⬆️ ⬇️

Parsing a function from the standard library D

Hi Habr, I want to invite everyone on a small tour of the language D. Why? Well, why do people go on excursions at all - to have fun, to see something new and in general it is interesting. D can hardly be called new or at least young, but in the last couple of years there was a rapid development, Andrei Alexandrescu came into the community and quickly became a leading developer, with his ability to anticipate trends, he made a great contribution to the concepts of the language itself and especially to the standard library.

Since its inception, D has been positioned as an improved C ++ (at least in my reading of the story). The ability to discard some outdated constructions and introduce in their place something new that could not be implemented in classic C ++, while at the same time carefully preserving low-level features such as built-in assembler, pointers and using C libraries, make D a unique contender for the title of “next in C” - C ++ - ... ". Well, from my point of view, I myself (probably it would be polite to add “unfortunately”) is absolutely monolingual, I have been writing in C ++ for many years and any attempts to get acquainted with other languages ​​inevitably ended in a sound healthy sleep. However, I heard from representatives of other faiths that D is also interesting for them as a language, so I invite everyone to the excursion.

What will I show? Several very good books have already been written on D, so I decided to just take the getopt () function from the standard library and look at its code, an invaluable exercise that allows you to revitalize what you read in the books. Why this particular feature? Well, she is familiar to everyone and is systemically independent, I personally use it 3-4 times a week and can imagine in detail how it could be written in 3 different languages. In addition, the author of the code is Alexandrescu, I have seen training examples of his code in books many times and have never seen the code written in production , it’s curious. In the end, I certainly could not resist and wrote my bicycle (naturally improved), in this case it is absolutely appropriate and no less useful than parsing someone else's code.
')
We will see far from all of what it would be worth seeing, and I myself am far from being an expert, so read it yourself who is interested in the links at the end.

External examination

Something like this code illustrates how to use the function:
void main(string[] args) { // placeholders string file; bool quiet; enum Count { zero, one, two, three }; Count count; int selector; int[] list; string[string] dict; std.getopt.arraySep=","; auto help=getopt(args, , std.getopt.config.bundling , "q|quiet", "opposite of verbose", &quiet , "v|verbose", delegate{quiet=false;} , "o|output", &file , "on", delegate{selector=1;} , "off", delegate{selector=-1;} , std.getopt.config.required, "c|count", "counter", &count , "list", &list , "map", &dict ); if(help.helpWanted) defaultGetoptPrinter("Options:", help.options); } 

The first thing we see is “almost C”, then we notice the presence of dynamic arrays — string [] and int [], and associative arrays — string [string]. Then some kind of suspicious assignment - std.getopt.arraySep = "," , is there really a global variable!?, Are we going to the kunstkamera or where? All so, dynamic and associative arrays are present in the language and constitute one of its foundations (I personally immediately recall Perl, in a good sense of the word). But std.getopt.arraySep is really a global variable belonging to a module and the assignment to it is probably terrible from the point of view of a purist, even in such a specific function as getopt () . However, this is not so simple, arraySep could be defined as a couple of functions :

 @property string arraySep() { return ... } @property void arraySep(string separator) { .... } 
and look like a variable, while meeting the most stringent standards for data encapsulation. This is a kind of branded chip D - syntactic sugar brought to perfection and forming the unique look of the language. Moreover, this call might look like
 ",".arraySep; 

seems like a contrived perversion? And how about this design:
 auto helloWorld="dlrowolleh".reverse.capitalize_at(0).capitalize_at(5).insert_at(5,' '); 

This is of course a speculative example, just to show that such syntax makes sense, but this construct is used in D as widely and with the same success as a pipe (the | sign) in bash squeaks. It has its beautiful name: Uniform Function Call Syntax , although in fact it is nothing more than syntactic sugar that allows you to call fun (a, b, c) as a.fun (b, c) .
Then we see the actual function call and the incredible flexibility of the interface immediately catches the eye, an arbitrary number of configuration parameters, including an arbitrary handler and a description, are passed directly to the function. The suspicion that D is a language with dynamic typing involuntarily creeps in. Nothing like this, as we will see later, this is just a perfected pattern technique.
In general, the description of the option is given by the following line:
[modifier] options options, [description,] & handler
The most trivial part here is the option options, just a string of the form “f | foo | x | something-else” that defines possible synonyms, both short and long. Description (syntax help string) is also just a string, but it is no longer necessary, which already implies some work with types at the compilation stage.
Real magic begins with a handler, it should be an address, but the address is almost anything including enum (in this place my internal C ++ - my nickname wrinkled my forehead), as well as the address of a function or a lambda function (well, it's simple, yes?).
More details:

The function returns a tuple of two elements — a list of options that can be printed, and the logical variable helpWanted, = true if the -h or --help option (which is automatically added to the list) is present on the command line.
Well, to complete the picture, each option may have a modifier, for example, required or caseInsensitive . In addition, the module defines several global variables, such as optionChar = '-', endOfOptions = "-" and arraySep = ",", the assignment of which changes the syntax of the command line.
As a result, we obtain a universal and convenient function, it is obvious that this is a template and it is approximately clear how to implement something similar in C ++, but how exactly is this done in D?

Open the hood

The first thing that attracts attention is an extremely simple and natural way of defining template functions, the difference in the syntax of ordinary and template functions is so subtle that it changes perception — you write not “ordinary” and “sample” functions, but simply functions, some of whose formal parameters can be patterned. Looking ahead, I’ll say that the opts arguments can be accessed as an array — opts [0], opts [$ - 1] or opts [2..5];
 GetoptResult getopt(T...)(ref string[] args, T opts) { ... getoptImpl(args, cfg, rslt, opts); return rslt; } 

There is nothing more to say about the top-level function, because it immediately transfers the control to getoptImpl () to which we will now look.
  1 private void getoptImpl(T...)(ref string[] args, ref configuration cfg, ref GetoptResult rslt, T opts) 2 { 5 static if(opts.length) { 6 static if(is(typeof(opts[0]) : config)) { 7 // it's a configuration flag, act on it 8 setConfig(cfg, opts[0]); 9 return getoptImpl(args, cfg, rslt, opts[1 .. $]); 10 } else { 11 // it's an option string ... 16 static if(is(typeof(opts[1]) : string)) { 17 auto receiver=opts[2]; 18 optionHelp.help=opts[1]; 19 immutable lowSliceIdx=3; 20 } else { 21 auto receiver=opts[1]; 22 immutable lowSliceIdx=2; 23 } ... 34 bool optWasHandled=handleOption(option, receiver, args, cfg, incremental); 41 return getoptImpl(args, cfg, rslt, opts[lowSliceIdx .. $]); 42 } 43 } else { 44 // no more options to look for, potentially some arguments left ... 68 } 75 } 76 } 

As you can see by the numbers, I threw out very few lines, but the whole structure of this code is in full view.
The first thing that attracts attention is the static if () {} else static if () {} else {} construct , yes, that's exactly what you probably thought of. The branch of the static if expression is selected at compile time ; naturally, the condition must also be known at compile time. Thus, this code (slightly giving away spaghetti to my picky taste) during compilation is truncated to several lines making sense for this particular set of function arguments. As I said before, template parameters can be treated as an immutable array, static if (opts.length) returns 0 if the list of options is empty, so the code starting at line 43 replaces the template specialization for this case.
Another interesting point is that braces after static if () do not change the scope , take a look:
 16 static if() { 19 immutable lowSliceIdx=3; 20 } else { 22 immutable lowSliceIdx=2; 23 } 41 return getoptImpl(args, cfg, rslt, opts[lowSliceIdx .. $]); 

The variable lowSliceIdx is defined in one of the blocks, but it is used outside of them, very logical in my opinion. Since this variable is defined as immutable (= constexpr) , it is also available at compile time and can be used in templates.
Let's take a look deeper, where the analysis of options and the work with types actually begin:
  6 static if( is(typeof(opts[0]) : config)) { 7 // it's a configuration flag, act on it 8 setConfig(cfg, opts[0]); 9 return getoptImpl(args, cfg, rslt, opts[1 .. $]); 10 } else { ...... 42 } 

Ohhh, here it is! D did the long-awaited C ++ typeof (expr) and it works exactly as intended. But that's not all, the expression is (T == U) is true if and only if (naturally at compile time) when types T and U are equal, and with template parameters and other uses, is simply turned into a Swiss knife for working with types. Generally speaking, is () is a built-in SFINAE that returns true if and only if the argument is any type, that is, the expression is syntactically correct. For example, is (arg == U [], U) checks that arg is an array, and is (arg: int) - that arg can be automatically converted to an int , the colon unobtrusively hints at inheritance. Later there will be more examples. Thus, the expression on line 6 statically checks if the type of the first parameter ( typeof (opt [0]) is reduced to a certain type of config . And config is simply an enumeration of all possible option modifiers:
 enum config { /// Turns case sensitivity on caseSensitive, /// Turns case sensitivity off caseInsensitive, /// Turns bundling on bundling, /// Turns bundling off noBundling, /// Pass unrecognized arguments through passThrough, /// Signal unrecognized arguments as errors noPassThrough, /// Stop at first argument that does not look like an option stopOnFirstNonOption, /// Do not erase the endOfOptions separator from args keepEndOfOptions, /// Makes the next option a required option required } 

after which getoptImpl () saves the value (saves + value => runtime) of the modifier and recursively calls itself, removing the first argument from the options ( opt [1 .. $] ). Thus, we dealt with the first case of type handling and it turned out surprisingly simple. If you dismiss these endless compile time / runtimes and read the code as it is, and meeting typeof (T) to look a couple of pages up, where this type is defined (in our case, in the list of actual parameters getopt () , then even to the offensive it's just that in C ++ it is much more like magic. Or maybe it was conceived? In the end, the compiler has all the same information as me - in the form of input code.
Next, by recursively pulling one element from the input array, the compiler will get to the first string parameter, which must be a list of tags for this option, line 11. Here begins options that are again very easily resolved: if the second (next) parameter is a string, this is the description, and the third is the address of the processor; otherwise (not a string), this is a handler. Accordingly, we pull out from the list either three or two parameters and pass them to the next function - handleOption (), which already parses the command line itself, and then naturally recursively call ourselves and everything starts all over again.
Further, nothing new happens in comparison with what we have already seen. The handleOption () function, a template with a single parameter — the type of handler, passes through the entire command line, checking whether it fits the description and, if it finds it, performs an action corresponding to its handler. I will briefly review the most interesting from my point of view points.
First, a general view from above:
 static if(is(typeof(*receiver) == bool)) { *receiver=true; } else { // non-boolean option, which might include an argument static if(is(typeof(*receiver) == enum)) { *receiver=to!(typeof(*receiver))(val); } else static if(is(typeof(*receiver) : real)) { *receiver=to!(typeof(*receiver))(val); } else static if(is(typeof(*receiver) == string)) { *receiver=to!(typeof(*receiver))(val); } else static if(is(typeof(receiver) == delegate) || is(typeof(*receiver) == function)) { // functor with two, one or no parameters static if(is(typeof(receiver("", "")) : void)) { receiver(option, val); } else static if(is(typeof(receiver("")) : void)) { receiver(option); } else { static assert(is(typeof(receiver()) : void)); receiver(); } } else static if(isArray!(typeof(*receiver))) { foreach (elem; ...) *receiver ~= elem; } else static if(isAssociativeArray!(typeof(*receiver))) { foreach (k, v; ...) (*receiver)[k]=v; } else { static assert(false, "Dunno how to deal with type " ~ typeof(receiver).stringof); } } 

Note repeating design
 static if(is(typeof(*receiver) == ...)) { *receiver=to!(typeof(*receiver))(val); 
actually means “if a pointer to something is passed as a handler, try converting the argument to this type and assign it to the pointer”.
Pointers to bool are processed separately, which may have no argument; arrays and associative arrays, where the argument is added to the container; as well as functions and lambda functions, which may have one, two, or no arguments. Pay attention to the internal function type selector:
  static if(is(typeof(receiver("", "")) : void)) { receiver(option, val); } else static if(is(typeof(receiver("")) : void)) { receiver(option); } else { static assert(is(typeof(receiver()) : void)); receiver(); } 

This is one more use case of the is (T) expression, it is reduced to true only if T is some existing type. In this particular case, it looks at the type returned by the functions (* receiver) (), (* receiver) ("") or (* receiver) ("", "") , if such a signature of the function exists, the type also exists, otherwise - SFINAE . ( void is a full type)
It is also useful to get acquainted with the universal converter D from the std.conv module: to! (T) (<lexical>) , it works like boost :: lexical_cast but unlike it can even convert a string to enum since D shamelessly uses all the information available compile time, which we see in the code above.
That's all, about 400 significant lines of code implemented a fairly complex function, and with the result that it is very difficult, if not impossible, to reproduce in C ++. Well, we, in our turn, got acquainted with the peculiarities of working with types in D - template functions with a variable number of arguments, choice of type and code branch at compile time, as well as type conversion. In fact, this is only a small part of the toolkit that D offers developers, the site has a huge collection of articles on various topics. I do not urge anyone to switch to D or teach D, but if you have a spark of curiosity and interest in the new, this is certainly the language with which you should get acquainted at least superficially.

Criticism of pure reason

However, I can not refrain from criticism, I absolutely do not like something in the proposed implementation. By and large, this has nothing to do with the language itself, nevertheless it is interesting to discuss it from general programming positions.
Firstly, this implementation is made one-pass , that is, the option is retrieved from the list and the command line is immediately passed; the first match found terminates the loop. This means that you cannot write -qqq as a synonym for “quieter, quieter, even quieter”, or --map A = 1 --map B = 2 --map C = 3 instead of --map A = 1, B = 2, C = 3 . Generally speaking, this is not a bug, but it violates some existing conventions when using getopt () and I would like to see more traditional behavior.
Secondly, and this is already a serious architectural error in my opinion, the function returns a certain structure with syntactic help, which is usually printed with the -h | --help key, but the same function throws an exception in case of an error. That is, if you made a mistake on the command line, the program will no longer be able to tell you how to properly. Generally speaking, this is obtained from the same single-pass implementation.
UPD: Does Alexandrescu read Habr?
In the last commit it was fixed, not quite as I would have done, but nonetheless.
In addition, there are several minor flaws, for example, the option can have any number of synonyms, but only the first two are included in the syntax hint: in the “x | abscissa | initialX” option, the last value can be detected only by looking at the code. Well, and similar annoying little things.
Therefore, I did my own implementation as an exercise where I fixed these flaws and did my own different bells and whistles (only as an exercise), in general I had fun as I wanted.

Here was my bike! Where is my bike?

But no, the bike is now there , it happens. I decided that a good guide should know where to stay, so the tour ends here.
I hope it was interesting

Bibliography

I read the first three books, so I give each a separate number. They are all good, but none for my taste is perfect, so I read with a sandwich - a chapter from one, the corresponding chapters from two others.
  1. Probably the historically first book on D
  2. Book Alexandrescu
  3. Very good cookbook
  4. All other currently existing books on D
  5. A bunch of articles, even more interesting than books, but each on a separate topic
  6. The wiki is just as interesting as articles, but unfortunately you need to know what you are looking for.
  7. Main site D
  8. Russian version also exists
  9. Standard library on github. Sources getopt () ibid.

Source: https://habr.com/ru/post/263515/


All Articles