📜 ⬆️ ⬇️

Metaprogramming: what it is and how it should be


Metaprogramming is a type of programming associated with the creation of programs that spawn other programs as a result of their work ( wiki ). This is a rather general term, to which, according to the same Wikipedia, the source code is generated by external tools, various preprocessors, and “compiler plugins” - macros with the ability to modify the syntax tree directly in the compilation process, and even such an exotic possibility, as a self-modifying code, a program modification by the program itself at runtime.

Although, of course, self-modifying code is rather a separate large topic worthy of a separate study; here, by metaprogramming, we will understand the processes that occur during compilation of a program.

Metaprogramming is implemented in one way or another in very different languages; if you do not consider exotic and close to them languages, then the most famous example of metaprogramming is C ++ with its template system. From the "new" languages ​​can be considered D and Nim. One of the most successful attempts to implement metaprogramming is the Nemerle language. Actually, on the example of this four, we consider the subject.
')
Metaprogramming is an interesting topic; In this article I will try to figure out what it is and how it should be in an ideal case.

Compilation stages


Before we start discussing a topic, we should look at how the compiler works.
So, at the input of the compiler - the source code as a program text.

The first stage of compilation is lexical analysis . At this stage, the program from the solid text is divided into tokens (lexemes) - each variable, literal, operator, keyword, comment becomes a separate object. Of course, working with such objects is much more convenient than directly with the source lines.

The next stage is parsing , building a syntax tree. At this stage, the linear structure becomes hierarchical; in the way we actually represent it when writing programs. Classes, functions, code blocks, operations become nodes of an abstract syntax tree (AST). The syntax analysis itself consists of many stages; which includes working with types (including type inference), and various optimizations.

The final stage is code generation . Based on the syntax tree, virtual and / or machine code is generated; everything depends on the target architecture here - registers and memory are allocated, tree nodes are turned into sequences of commands, additional optimization is carried out.

First of all, we will be interested in lexical and syntactic analysis (although at the stage of code generation, metaprogramming is also possible ... but this is a separate, large and completely unexplored topic).


Lexical Macros


One of the oldest metaprogramming tools that has survived to the present day is the sshny preprocessor. Probably, the first preprocessors were indeed separate programs, and after preprocessing they returned the result of their work back to the source text format. The preprocessor is lexical, because it works at the level of tokens - that is, after receiving a sequence of tokens (although in fact the preprocessor still produces the simplest syntax analysis for its own purposes - for example, for its own macros with arguments). A preprocessor can replace one sequence of tokens with another; he knows nothing about the syntactical structure of the program - that’s why it’s so easy to make famous
#define true false 

The problem of the "sishny" preprocessor is that, on the one hand, it works at a too early stage (after lexical analysis, when the program still differs little from simple text); on the other hand, this is not a full-fledged programming language, but merely a system of conditional compilation and replacement of some sequences of lexemes with others. Of course, a lot can be done on this (see boost.preprocessor ), but it’s still not up to the full-fledged - and most importantly clear and convenient - metaprogramming.

C ++ Templates


The next most famous metaprogramming tool is C ++ templates - constructs that allow you to create parameterized classes and functions. Templates, unlike sishnyh macros, already work with the syntax tree. Consider the most common pattern in C ++
 template<class T, int N> struct Array { T data[N]; }; 

and its application (instantiation):
 Array<int,100> arr; 


What happens here from the point of view of the compiler? The template structure is a separate, special AST node; The template has two parameters - type and integer constant. At the point of instantiation of a template, template parameters (which are actually also AST nodes) are substituted into the template body instead of formal parameter names; the result is the creation (or search for a previously created) node, which is used mainly in the syntactic tree. Here the following is important: both the template itself and the template parameters at the point of instantiation are not just a type and number, they are nodes of the syntax tree . That is, by passing the int type and the number 100, you actually construct and transfer two small syntactic trees (in this case, with one single node) to the larger syntax tree (the template body) and you end up with a new tree that is inserted into the main syntax tree It looks like a macro macro substitution mechanism, but already at the level of syntax trees.

Of course, template parameters can also be more complex constructs (for example, std :: vector <std :: set <int>> can be passed as a type). But here is the time to pay attention to the fundamental incompleteness of the capabilities of C ++ templates. In accordance with clause 14.1 of the standard, only types and some non-types can be template parameters: integers, enumeration elements, pointers to class members, pointers to global objects, and pointers to functions. In general, the logic is clear - there is something in the list that can be determined at the compilation stage. But for example, for unknown reasons there are no lines and floating point numbers in it. And if you remember that the parameters are AST nodes, it becomes obvious that there are not many other useful things. So, what prevents you from passing an arbitrary AST node, such as a variable name or a code block, as a parameter? Similarly, the templates themselves can only be classes (structures) or functions. And what prevents to make a template an arbitrary block of code - both imperative (control statements, expressions) and declarative (for example, a fragment of a structure or enumeration)? Nothing but the lack of such features in C ++ itself.

From templates to syntax macros


The template mechanism, even in the form in which it exists in C ++, provides quite ample opportunities for metaprogramming. And yet, it is just a system of substitutions of some AST fragments in others. And what if we go further, and, apart from substitutions, allow something else - in particular, the execution of arbitrary actions on AST nodes using a script? These are syntax macros, the most powerful metaprogramming tool to date. Arbitrary code written by a programmer and executed at the compilation stage of the main program, having access to the compiler's API and to the structure of the compiled program in the form of AST, in fact, full-fledged compiler plugins embedded in the compiled program. Not many programming languages ​​implement this feature; One of the best implementations in my opinion is in the Nemerle language, so we will consider it in more detail.
Here is the simplest example from almost official documentation:
 macro TestMacro() { WriteLine("compile-time\n"); <[ WriteLine("run-time\n") ]>; } 

If you insert a macro call into another file (which by the way is no different from a function call)
 TestMacro(); 

then at compilation we will get the message “compile-time” in the compiler's log. And when you start the program, the message “run-time” will be displayed in the console.

As we can see, the macro is the usual code (in this case, in the same Nemerle language as the main program); The difference is that this code is executed at the stage of compiling the main program. Thus, compilation is divided into two phases: first, macros are compiled, and then the main program, upon compilation of which macros compiled earlier can be invoked. The first line is executed at compile time. The second line contains an interesting syntax - special brackets <[]>. With the help of such brackets, you can take code fragments as in quotes, by analogy with ordinary strings. But unlike lines, these are fragments of AST, and they are inserted into the main syntax tree of the program - just like templates when instantiated.

And special brackets are needed because macros, unlike templates, are located in a different context, in a different dimension; and we need to somehow divide the macro code and the code with which the macro operates. Such strings in terms of Nemerle are called quasiquotes. “Quasi” - because they can be constructed on the fly using interpolation - a feature known to everyone who writes in scripting languages, when you can insert the names of various variables into a string using a special syntax, and they turn into the values ​​of these variables. Another example from the documentation:
 macro TestMacro(myName) { WriteLine("Compile-time. myName = " + myName.ToString()); <[ WriteLine("Run-time.\n Hello, " + $myName) ]>; } 

The argument is an AST node (as well as for templates); To insert a node into a quasiquote, use the $ symbol in front of its name.

Of course, instead of such a line, it was possible to construct the AST fragment inserted manually using the compiler API and the types available in the context of the macro corresponding to the AST nodes. Something like
 new FunctionCall( new Literal("run-time\n") ) 

but it’s much easier to write the code “as it is” and entrust the work on building AST to the compiler - after all, it is for this purpose that it is intended!

In D, metaprogramming is presented using templates (which are generally similar to C ++ templates, although there are some improvements) and mixins. Consider them in more detail. The first type is patterned mixins; the very extension of templates with the ability to make arbitrary code fragments template. For example, this program will display the number 5.
 mixin template testMixin() { int x = 5; } int main(string [] argv) { mixin testMixin!(); writeln(x); return 0; } 

the variable declared in mixin becomes available after the mixin is included in the code.

The second type of mixins is string mixins; in this case, the string argument with the code in D becomes the argument of the mixin keyword:
 mixin (`writeln("hello world");`); 

The string, of course, must be known at compile time; and this can be not only a constant explicitly defined string (otherwise there would be no point in this), but also a string formed programmatically at compile time! In this case, the usual functions of the D language can be used to form a string — the same ones that can be used in runtime. Of course, with certain restrictions - the compiler must be able to execute the code of these functions at compile time (yes, a rather powerful interpreter of the D language itself is built into the D compiler).

In the case of string mixins, we do not work with AST nodes in the form of quasiquote; instead, we work with the source code of a language that is formed explicitly (for example, by concatenating strings) and traverses the full path of lexical and syntactic analysis. There are advantages and disadvantages to this method. Personally, direct work with AST seems to me more “clean” and ideologically correct than generating source code lines; however, working with strings may be useful in some situation.

You can also briefly mention the language of Nim . In it, the template keyword works similarly to the mixin template from D (and for classic templates in the early C ++ style, another concept is used - generic). Using the macro keyword, syntax macros are declared that are somewhat similar to Nemerle macros. In Nim, an attempt was made to formalize the phases of calling patterns — using special pragmas, you can specify whether to call a pattern before or after resolving variable names. Unlike D, there is some API to the compiler with which you can explicitly create AST nodes. The issues of “hygiene” of macros are touched upon (a macro is “hygienic” if it is guaranteed that it does not affect identifiers at the point of its use ... I should consider these issues in more detail, but probably another time).

findings


Looking at the implementation of metaprogramming in different languages, there are some thoughts about how it should look like in the ideal case. Here are some thoughts:

Metaprogramming should be explicit.

A macro call is a very specific thing (actually a VERY specific thing!), And the programmer must unambiguously visually identify such macros in the code (even without syntax highlighting). Therefore, the syntax of macros must clearly differ from the syntax of functions. More or less, this requirement is fulfilled only in D (the special keyword mixin at the calling point); Nemerle and Nim macros are indistinguishable from functions. Moreover, in Nemerle there are several other ways to call a macro - macro attributes and the ability to override the syntax of the language ... here you can digress a little and note that the syntax, unlike functions and classes, is global; and I rather negatively relate to this possibility, because it leads to the erosion of the language and its transformation into a generator of languages, which means that for each new project you will have to learn a new language ... the perspective, I must say, is doubtful)

Using the same language for programs and metaprograms is not necessary.
In all the considered examples, metaprograms (macros) were written in the same language as the main program. It seems that the developers of languages ​​did not think about the possibility of using another programming language for macros, more adapted for interpretation than for compilation.

Meanwhile, an example of an alternative approach has always been on the surface: in web programming, the html markup language and the javascript programming language are used; javascript is executed during the rendering (compilation analog) html, from scripts the object model of the html document (HTML DOM) is available - a close enough analog of AST. Using the appropriate functions, you can add, modify and delete HTML DOM nodes, and at different levels - both as source code html, and as nodes of the DOM tree.

 //  html   ,  mixin  D document.getElementById('myDiv').innerHTML = '<ol><li>html data</li></ol>'; //  html    ,  Nim var link = document.createElement('a'); link.setAttribute('href', 'mypage.htm'); document.getElementById('myDiv').appendChild(link); 

What are the advantages and disadvantages of this approach?
Obviously, the metaprogram should not be able to crash or suspend the compiler. Obviously, if you write metaprograms on SI - with pointers and other low-level things, it will be very easy to bring it down. On the other hand, using common code in programs and metaprograms is convenient. Although this “common code” is still limited to some very common things like pure algorithms: the compiler's API is not applicable in programs, and the “real OS” libraries (including graphics, window system, network ...) are very poorly applicable in metaprograms . Of course, you can create a couple of your windows during compilation and display them on the screen, but why?

Thus, it is not necessary that programs and metaprograms be in the same language. Programs and metaprograms have completely different tasks and completely different execution environments. Probably, the best solution is to leave the programmer free and use several languages: both a safe subset of the main language and some common scripting language - the same javascript is fine.

Standardization API compiler
The appearance and distribution in a language of full metaprogramming will inevitably require standardization of the compiler's API. Of course, this would have a positive impact on the quality of the compilers themselves, on their compliance with the Standard, and compatibility with each other. And it seems that the example of html and browsers in itself is quite good. Although the structure of AST is more complicated than html markup (incompatibility of some nodes and other features), it would be quite good to take as a basis for building such an API the experience of browser technologies combined with the experience of existing implementations of metaprogramming.

IDE support
Metaprogramming can be quite complicated. Until now, all the implementations known to me did not suggest any means of facilitating the work of a programmer: to compile in mind is still a undertaking (of course there are fans ...). Although metaprogrammers in C ++, for example, do just that. Therefore, I consider it necessary the appearance of such means as the disclosure of templates and macros in a special IDE mode - both in debug mode and in code editing mode; some analogue of executing the code "from the command line" REPL for macros. A programmer must have a full set of tools for visual access to AST, for separate compilation and test run of macros, for “compiling step by step” (just like that - for viewing in a special debugger how the macro works when compiling the main code step by step).

Well, perhaps that's all. A lot of questions remained behind the scenes, but I think even this is enough to appreciate the incredible power of metaprogramming. Now imagine that all this would already be in C ++. Look at the Boost library, at those amazing and incredible things that people do even on existing templates and lexical macros ...

If this were so ... what truly hacking opportunities would open up to us!

Source: https://habr.com/ru/post/258667/


All Articles