AST analysis using patterns

I'm currently working on senjin / gglsl , a library for programming shaders using Groovy, which I recently wrote about .

Here I will describe three approaches to the analysis of AST (abstract syntax tree), all with examples of sub-tasks arising from one another and related to a common context: recursive functions, pattern Visitor, and pattern matching.
Pattern matching is implemented in Java and is available on GitHub .

Recursive functions

90% of the functionality is done in 1% of the time (or something like that)

At the first stage, it was enough for me to simply translate the Groovy code into glsl code. For this, I used simple tree parsing functions that recursively call each other. Each takes an AST node of a parsed Groovy as input and returns a line representing the translated code already on glsl. This is quite enough. languages are very similar, in the process there is no need to accumulate any information, and in each node all necessary information is locally available.

An example of the functions of translation of a binary expression and unary minus:
')

private String translateExpression(BinaryExpression e) { def opName = e.operation.getText() if (opName == "[") return translateExpression(e.leftExpression) + "[" + translateExpression(e.rightExpression) + "]" return "(" + translateExpression(e.leftExpression) + " " + opName + " " + translateExpression(e.rightExpression) + ")" } private String translateExpression(UnaryMinusExpression e) { return "-" + translateExpression(e.expression) }

Visitor pattern

The devil is in the details

It would hardly be convenient to create shaders if functions cannot be created. So after a couple of working shaders, I decided to implement the broadcast functions. Here from all the cracks climbed subtleties and details.
In glsl, passing arguments is by value, and in Groovy by value we work only with primitives. In addition, using the out qualifier in glsl, you can pass a primitive or vector back, and in Groovy this is not possible. Annotations will not help either, since I want to have not just the “looking” code correctly, but both the one being launched and the one being debugged. This is just a couple of examples, in fact there are many subtleties.

After the despair stage, several decisions were made. For example, to prohibit rewriting non-primitive parameters, prohibiting the return of them if they are method parameters, etc.
I will give the code for checking the presence of a modification of non-primitive parameters using a visitor:

 for (Parameter parameter : methodNode.getParameters()) if (!isPrimitive(translateType(parameter.getText()))) { YList<Boolean> rewritten = al(false); CodeVisitorSupport detectRewrittenVisitor = new CodeVisitorSupport() { @Override public void visitBinaryExpression(BinaryExpression expression) { if (expression.getOperation().getText().equals("=")) { Expression left = expression.getLeftExpression(); if (left instanceof VariableExpression) { if (((VariableExpression) left).getName().equals(parameter.getName())) { rewritten.set(0, true); } } } super.visitBinaryExpression(expression); } }; detectRewrittenVisitor.visitBlockStatement((BlockStatement) methodNode.getCode()); if (rewritten.get(0)) System.out.println(parameter.getName() + " rewritten!"); }

The code is simple, but the task is primitive. And although the visitor for such tasks and thought - already here the code prohibitively much. Especially depressing checks of the types of child nodes and extracting additional information from them. The problem will be aggravated when the task becomes more difficult - when it is necessary, for example, to check for the presence of a field in the code. When parsing Groovy, you will first have to make sure that we are in the right branch of the assignment (or a number of other nodes), and it is only DEEPER to find, in fact, the node in which reading takes place. Those. here, you either have to enter a state into the visitor, or make two visitors, one to search for the “right part of assignment” branch, and the second to search for the “read field” node somewhere in the depth of the expressions of this branch.

Pattern matching

When it becomes easier to breathe, and there is no headache from changes in the TK

Unfortunately, the previous options are suitable for simple cases, but make you despond at difficult tasks. You have to write a lot of code for simple concepts, the existing code begins to interfere with the new, you have to add different drives and intermediate states. And I want to - just specify a piece of AST tree with variables, and say what to do if one is found ...

No sooner said than done! The final version of the pattern for finding the rewriting of the field in a given variable:

 Object varWritePattern = stairs( deeper(G_BODY_ACCESSORS), p(BinaryExpression.class, "operation", p("text", "="), "leftExpression"), p(VariableExpression.class, "variable", var("VAR_NAME"))); boolean wasWritten = match(methodNode, varWritePattern).notEmpty();

(DO NOT PANIC, a little further I will describe the details, now you just need to enjoy the spectacle)

There is only one variable (var ("VAR_NAME")), and it points to the name of the variable to be rewritten (in the source code).

The most interesting thing here is that the addition of quite complex (for the visitor) conditions - results in the addition of one line for each condition! Those. amazing result - now the size of the code linearly depends on the complexity of the TOR (technical assignment) and this complexity does not accumulate! More complex pattern:

 Object G_FIELD_AS_ARG_PATTERN = stairs( deeper(G_BODY_ACCESSORS), p(MethodCallExpression.class, "getMethod", p("getText", var("METHOD_NAME")), "getArguments"), p(ArgumentListExpression.class, "expressions"), i(), p(PropertyExpression.class, "getObjectExpression", p(VariableExpression.class, "variable", var("OBJ_NAME")), "getProperty", p("value", var("FIELD_NAME")))); Matcher.match(tree, G_FIELD_AS_ARG_PATTERN)

Here, the place in the tree is calculated, where the field is passed to some variable as an argument to some function. There are already 3 variables at once - the name of the method (METHOD_NAME), the name of the object (OBJ_NAME), and the name of the field in it (FIELD_NAME). If desired, you can easily add more variables (for example, var ("ARG_INDEX") - to get or bind the argument number). The great fact is that you can pass previously bound variables to the match function! Then, without changing the pattern, we can find a subtree with a specific object name, or with a specific index, individually or all at once. One can only imagine how much it would be necessary to enter a visitor in different places in order to achieve the same goals.

Passage backstage

When magic is explained, it ceases to be magic (but remains a fairly advanced technology)

The pattern is a cast of a piece of wood to find. Matcher takes this cast and searches in a tree - which piece would fit.

 Matcher.match(tree, pattern)

This function returns a list of all matches. Each record is a mapping of a variable to that element of the tree that corresponded to it during the match. If there were no variables in the pattern, then the records will be empty, but by their number one can see if there were matches and how many (which can also be the purpose of the query).
In addition, you can pass pre-connected variables to the match to limit the options:

 boolean wasWritten = match(methodNode, varWritePattern, hm("VAR_NAME", parameterName)).notEmpty();

Here we state in advance that only such a sub-tree is suitable, in which the variable VAR_NAME corresponds to a certain parameterName

It should be noted that by a tree here it is meant not some specific data structure (tree), but anything. It may just be an array with numbers or objects. Or just one object of any type.

The pattern itself does not contain an exhaustive 100% matching piece. It only indicates what matters. In addition - it consists not of the same classes as the tree, but of special ones. Next I will give examples of such classes with explanations.

 new Property("getClass", ReturnStatement.class, "getExpression", var("EXPR"));

Property is a class containing a set of name-value pairs. Its meaning is that the node it is looking for must contain a field or method with the specified name, and the value (or result) must match the specified one. Values can be:

other elements of the pattern (maybe again Property, for example), then the search will continue on them
ordinary objects (string, number, or any class), then it will just be checked for a match

 new Var("VAR_NAME", rest);

Var is the same variable that contains the name and the continuation of the pattern. As a result, it is by this name that the link to the corresponding tree node will be contained.

 new ByIndex(3, rest);

ByIndex - allows to match List and arrays. Moreover, the index itself is also considered a “node”, so you can match, for example, “the same values in two arrays located in the same place”.

 new Deeper(G_BODY_ACCESSORS, rest);

Deeper is an interesting item. Lets say that "no matter what we meet next, but in the end we will find ...". It allows you to go down the data structure to an unspecified depth initially. A good example is expressions in a language — it is not known how many levels of BinaryExpression and other branches are located between the declaration of a method and the specific use of a variable. Also in it you need to specify a set of mini-patterns, accessors, with which it will work its way.

I am not a fan of records like new LongJavaClassName (..., so if something is created more than a few times, I write a static method. So by rewriting new Property to p (), new Variable to var (), etc. ., the pattern takes on such a neat look:

 Object varWritePattern = deeper(G_BODY_ACCESSORS, p( "getClass", BinaryExpression.class, "operation", p("text", "="), "leftExpression", p( "getClass", VariableExpression.class, "variable", var("VAR_NAME"))));

As you can see, here is a declaratively described template for finding assignments. You can use it like this:

 YSet<YMap<String, Object>> match1 = match(node, varWritePattern); YSet<YMap<String, Object>> match2 = match(node, varWritePattern, hm("VAR_NAME", varName));

In the first case, we get all the options, and in the second, only those in which the assignment occurs in a variable with the transferred value varName (the binding of the variable).

We get rid of steps of indents

When the stairs in the house more than floors

I never liked this style in Lisp when you always need to open one more bracket (apparently not only to me, because Clojure has -> and - >>). It is much more convenient when I “call a method” on the result of the previous call, as in Scala, Xtend, my YCollections , and many more:

  String names = al(new File("/home/user/").listFiles()) .filter(File::isDirectory) //only dirs .map(File::getName) //get name .filter(n -> n.startsWith(".")) //only invisible .sorted() //sorted .foldLeft("", (r, n) -> r + ", " + n); //to print fine System.out.println(names);

To achieve the same effect, I added the “stairs” function, which (by analogy with “- >>” from Clojure) - simply adds what comes later to the rest field of what comes before (not so easy, but the essence is ). Now a rather complex pattern begins to look quite acceptable:

 Object G_FIELD_AS_ARG_PATTERN = stairs( deeper(G_BODY_ACCESSORS), p(MethodCallExpression.class, "getMethod", p("getText", var("METHOD_NAME")), "getArguments"), p(ArgumentListExpression.class, "expressions"), i(var("ARG_INDEX")), p(PropertyExpression.class, "getObjectExpression", p(VariableExpression.class, "variable", "OBJ_NAME"), "getProperty", p("value", var("FIELD_NAME"))));

In this form, it is very convenient to read the pattern - just from the top down, the branches - to the right. It is easy to save and paste logic from neighboring patterns.

Same commentary example

 Object G_FIELD_AS_ARG_PATTERN = stairs( // -  deeper(G_BODY_ACCESSORS), //    p(MethodCallExpression.class, //   "getMethod", p("getText", var("METHOD_NAME")), //  getArguments ... "getArguments"), //  ,     expressions, ... p(ArgumentListExpression.class, "expressions"), // ,    (  ) ... i(var("ARG_INDEX")), //  p(PropertyExpression.class, // -  "getObjectExpression", p(VariableExpression.class, "variable", var("OBJ_NAME")), //    "getProperty", p("value", var("FIELD_NAME"))));

Now use this pattern to analyze the following Groovy code:

 public void foo(Vec3f vecA, Vec3f vecB) { bar(vecA.x, vecA.y) bar(vecA.x, vecB.x) }

Razpars him:

 String src = IO.readFile("src/main/java/yk/senjin/shaders/gshader/analysis/HabraExample.groovy"); //parse kotlin file Object node = new AstBuilder().buildFromString(CompilePhase.INSTRUCTION_SELECTION, src);

And finally, we will make several requests:

 //select method "foo" for (YMap<String, Object> m : match(node, stairs(deeper(G_METHOD_ACCESSORS), var("method"), p(MethodNode.class, "name"), "foo"))) { //getting methodNode object from select result Object methodNode = m.get("method"); System.out.println("all variables are free:"); System.out.println(match(methodNode, G_FIELD_AS_ARG_PATTERN).toString("\n")); System.out.println("fixed OBJ_NAME:"); System.out.println(match(methodNode, G_FIELD_AS_ARG_PATTERN, hm("OBJ_NAME", "vecB")).toString("\n")); System.out.println("fixed ARG_INDEX:"); System.out.println(match(methodNode, G_FIELD_AS_ARG_PATTERN, hm("ARG_INDEX", 0)).toString("\n")); }

Output to console:

 all variables are free: {METHOD_NAME=bar, OBJ_NAME=vecA, FIELD_NAME=x, ARG_INDEX=0} {METHOD_NAME=bar, OBJ_NAME=vecA, FIELD_NAME=y, ARG_INDEX=1} {METHOD_NAME=bar, OBJ_NAME=vecB, FIELD_NAME=x, ARG_INDEX=1} fixed OBJ_NAME: {OBJ_NAME=vecB, METHOD_NAME=bar, FIELD_NAME=x, ARG_INDEX=1} fixed ARG_INDEX: {ARG_INDEX=0, METHOD_NAME=bar, OBJ_NAME=vecA, FIELD_NAME=x}

Another example, for a very simple data structure - an array of vectors:

 Vec3f[] vv = new Vec3f[]{new Vec3f(0, 0, 0), new Vec3f(1, 1, 1)}; System.out.println(match(vv, i(p("x", var("X_VALUE"))))); System.out.println(match(vv, i(var("OBJ_INDEX"), p("x", var("X_VALUE")))));

The console will appear:

 [{X_VALUE=0.0}, {X_VALUE=1.0}] [{OBJ_INDEX=0, X_VALUE=0.0}, {OBJ_INDEX=1, X_VALUE=1.0}]

For completeness, here is an example of accessory for Deeper:

 public static final YArrayList<Object> G_BODY_ACCESSORS = al( i(var("access")), p("methodsList", var("access")), p(MethodNode.class, "code", var("access")), p(MethodCallExpression.class, "getReceiver", var("access")), p(BlockStatement.class, "getStatements", var("access")), p(ExpressionStatement.class, "expression", var("access")), p(BinaryExpression.class, "leftExpression", var("access")), p(BinaryExpression.class, "rightExpression", var("access")), p(DeclarationExpression.class, "getLeftExpression", var("access")), p(DeclarationExpression.class, "getRightExpression", var("access")), p(UnaryMinusExpression.class, "getExpression", var("access")), p(UnaryPlusExpression.class, "getExpression", var("access")), p(ReturnStatement.class, "getExpression", var("access")), p(ConstructorCallExpression.class, "arguments", p("expressions", i(var("access")))), p(IfStatement.class, "booleanExpression", var("access")), p(IfStatement.class, "ifBlock", var("access")), p(IfStatement.class, "elseBlock", var("access"))

We can say that it performs a role similar to the one that the visitor's interface performs, only much simpler and more flexible. For example, the line i (var ("access")) allows you to descend into any list or array, i.e. it is not tied specifically to this tree, but is common. So It is easy to set rules about where you can and where you cannot go down with the help of Deeper - just by specifying the appropriate set of accessors.

Interesting idea. It seems that using the pattern-match, you can analyze not only the AST, but also the control-flow , for this you need to properly describe the accessors in Deeper and ...

About optimization

It is time

It may seem that the approach is good, but not suitable for high-loaded areas (for example, parsing millions of lines of code). This is true, but not true.

Firstly, it is very easy to describe the subject area with patterns. And this allows us to divide the tasks - into “the implementation of the TOR”, and into “code optimization”. Having a “slow”, simple, correct variant, well-adjusted and covered with tests, an optimized variant is much easier and faster to create than to accomplish by solving two problems at the same time - the implementation of the TOR and optimization. Those. here it can be divided - the ease of implementation of the TZ and the optimization of the working code.

And this is much more important than it may seem. I had tasks in which the implementation of TZ, and THEN optimization can take a week, and implementation TOGETHER with optimization can take a month. Yes, the difference may be the order, and on the scale of the project even the orders.

Secondly, the very fact of declarativeness in the description of the subject area opens up possibilities. According to the declarations (theoretically), you can generate an optimized code that stores all the necessary state in variables, traverses the tree once, performing many parallel tasks, etc. Those. In theory, you can make an optimized, “unpleasant” code generator. Unpleasant - because a lot of state, a lot of parallel processes. As a parser generator.

Sum up the pros

You can't praise yourself - how will Google find it?

declarativeness (you can draw, you can analyze, ...)
easy syntax
non-accumulative code complexity
variable descent depth
no binding to a specific data type
two roles of variables (result / link)
potential for optimizations
easy reuse and copy-paste templates
can be used as a normal select
separation of tasks

Wiki with mavena github.com/kravchik/jcommon/wiki/pattern-matching

PS you can once again admire Lisp, in which pattern matching for AST analysis would be completely natural and not worth a separate article.

PPS and how do you disassemble the trees? What are my industrial bike analogues?

Thanks to kleshney and oshyshko for reading the draft

Source: https://habr.com/ru/post/270173/

All Articles