JavaScript compiler using ANTLR

Earlier this year, I joined the team working on MongoDB Compass , a GUI for MongoDB. Compass users through Intercom have requested a tool that allows them to write database queries using any convenient programming language supported by the MongoDB driver . That is, we needed the ability to transform (compile) the Mongo Shell language into other languages and vice versa.

This article can be a practical guide to help when writing a compiler in JavaScript, as well as a theoretical resource that includes the basic concepts and principles of compiler creation. At the end is not only a complete list of all materials used in writing, but also links to additional literature aimed at a deeper study of the issue. Information in the article is submitted sequentially, starting with the study of the subject area and then gradually complicating the functionality of the application being developed as an example. If during the reading it seems to you that you do not catch the transition from one step to another, you can refer to the full version of this program and perhaps this will help to eliminate the gap.

')

Content

Terminology

The compiler converts the source program in a high-level programming language into an equivalent program in another language that can be executed without the compiler.

Lexer - A compiler element that performs lexical analysis.

Tokenizing - The process of analyzing the input sequence of characters into recognized groups (tokens, tokens).

Parser (Parser) - The element of the compiler that performs parsing, the result of which is the parse tree.

Parsing - The process of comparing a linear sequence of lexemes of a natural or formal language with its formal grammar.

Lexeme or Token - A sequence of valid characters in a programming language that makes sense to the compiler.

Visitor (Visitor) - The pattern of working with a tree, in which, to continue processing of descendants, you must manually call a workaround.

Listener (Walker or Walker) - Pattern of working with a tree, when the methods of visiting all descendants are called automatically. In the Listener, there is a method called at the beginning of a site visit (enterNode) and a method called after a site visit (exitNode).

Parse Tree or Parse Tree - A structure representing the result of the work of the parser. It reflects the syntax of the structures of the input language and clearly contains the complete interrelation of operations.

The abstract syntax tree (AST - Abstract Syntax Tree) differs from the parse tree in that there are no nodes and edges for those syntax rules that do not affect the semantics of the program.

Universal Abstract Syntax Tree (UAST - Universal Abstract Syntax Tree) is a normalized form of AST with language independent annotations.

Traversing a tree in depth (DFS - Depth-first search) - One of the methods of traversing the graph. The strategy of searching in depth, as the name implies, is to go "in depth" of the graph as far as possible.

Grammar (Grammar) - A set of instructions for constructing lexical and syntactic analyzers.

Root - The topmost node of the tree from which to start the walk. This is the only tree node that does not have an ancestor, but is itself an ancestor for all other tree nodes.

Leaf, leaf or terminal node (Leaf) - A node that has no descendants.

Parent - A node that has a child. Each tree node has zero or more descendant nodes.

Literal - Representation of some fixed value or data type.

The code generator receives as input the intermediate representation of the source program and transforms it into the target language.

Study

You can consider three ways to solve the problem of transforming one programming language into another (Source-to-source transformation):

Use existing parsers for specific programming languages.
Create your own parser.
Use a tool or library to generate parsers.

The first option is good, but covers only the most famous and supported languages. For JavaScript, there are such parsers as Esprima , Falafel , UglifyJS , Jison and others. Before you write something yourself, it is worth exploring the existing tools, perhaps they completely cover the functionality you need.

If you are unlucky and have not found a parser for your language or the found parser does not satisfy all your needs, you can resort to the second option and write it yourself.

A good start for understanding compiler writing from scratch can be Super Tiny Compiler . If you remove comments from the file, then only 200 lines of code remain, which contain all the basic principles of a modern compiler.

The author of the article Implementing a Simple Compiler on 25 Lines of JavaScript shares its experience of creating a JavaScript compiler. And covers concepts such as lexical analysis, parsing, and code generation.

How to write a simple interpreter in JavaScript is another resource that considers the process of creating an interpreter using the example of a simple calculator.

Writing a compiler from scratch is a time-consuming process that requires a thorough preliminary study of the syntactic features of the languages being compiled. In this case, it is necessary to recognize not only keywords, but also their position relative to each other. The rules for analyzing the source code should be unambiguous and give an identical result at the output, under the same initial conditions.

With this, tools and libraries for parser generation can help. They take the raw source code, break it into tokens (lexical analysis), then match the linear sequences of tokens to their formal grammar (syntactic analysis) and put them into a tree-like organized structure, according to which a new code can be built. Some of these are covered in the article Parsing in JavaScript: Tools and Libraries .

At first glance it may seem that the solution to the problem is in our pocket, however, in order to teach the parser to recognize the source code, we will have to spend many more man-hours on writing instructions (grammar). And if the compiler has to support several programming languages, this task becomes much more complicated.

It is clear that we are not the first developers who are faced with a similar task. The work of any IDE is related to code analysis, Babel transforms modern JavaScript into a standard supported by all browsers. This means that there must be grammars that we could reuse and thus not only facilitate our task, but also avoid a number of potentially serious errors and inaccuracies.

Thus, our choice fell on ANTLR, which is best suited to our requirements, since it contains grammars for almost all programming languages.

As an alternative, you can consider Babelfish , which analyzes any file in any supported language, extracts AST from it, and converts it to UAST, in which nodes are not tied to the syntax of the source language. At the entrance we can have javascript or c #, but at the level of UAST we will not see any differences. In the terminology of compilers, the transformation process is responsible for the process of converting AST to a universal type.

The novice compiler author may also be interested in Astexplorer , an interface that allows you to see what the syntax tree for a given code fragment and corresponding to the chosen language of the parser can be. It can be useful for debagging or forming a general idea of the structure of AST.

Compiler building

ANTLR (Another Tool For Language Recognition - another language recognition tool) is a parser generator written in Java. At the entrance, it takes a piece of text, then analyzes it on the basis of grammars and converts it into an organized structure, which in turn can be used to create an abstract syntax tree. ANTLR 3 took on this task too - generated AST. However, ANTLR 4 eliminated this functionality in favor of using StringTemplates and operates only with such a concept as “Parse Tree”.

For more information, you can refer to the documentation , or read the excellent article The ANTLR Mega Tutorial , which explains what a parser is, why you need it, how to configure ANTLR, useful ANTLR functions and much more with a ton of examples.

In more detail about the creation of grammars can be read here:

To transform one programming language into another, we decided to use ANTLR 4 and one of its grammars, namely ECMAScript.g4. We chose JavaScript because its syntax corresponds to the Mongo Shell language and is also the development language of the Compass application. Interesting fact: we can build a parse tree using Lexer and Parser C #, but bypassing it using ECMAScript grammar nodes.

This question requires more detailed research. It’s safe to say that not every code structure will be correctly recognized by default. It will be necessary to expand the workaround functionality with new methods and checks. However, it is already clear that ANTLR is an excellent tool when it comes to supporting multiple parsers within a single application.

ANTLR creates for us a list of auxiliary files for working with trees. The content of these files is directly dependent on the rules specified in the grammar. Therefore, with any grammar changes, these files must be regenerated. This means that we should not use them directly for writing code, otherwise we will lose our changes during the next iteration. We must create our classes and inherit them from the classes produced by ANTLR.

The code generated as a result of ANTLR helps to create a parse tree, which in turn is the fundamental tool for generating new code. The bottom line is to call the child nodes from left to right (assuming that this is the order of the source text) to return the formatted text that they represent.

If the node is a literal, we need to return its actual value. It's harder than it looks if you want the result to be accurate. In this case, you should consider the possibility of displaying floating-point numbers without loss of accuracy, as well as numbers in different number systems. For string literals, you should consider which type of quotes you support, and also not forget about the escape sequences of characters that should be escaped. Do you support code comments? Do you want to save the format of user input (spaces, blank lines), or do you want to bring the text to a more standard easier to read form. On the one hand, the second option will look more professional, on the other - the user of your compiler may not be satisfied in the end, since he probably wants to get the output format identical to his input. For these problems there is no universal solution, and they require a more detailed study of the scope of your compiler.

We will consider a more simplified example to concentrate on the basics of writing a compiler using ANTLR.

Install ANTLR

$ brew cask install java $ cd /usr/local/lib $ curl -O http://www.antlr.org/download/antlr-4.7.1-complete.jar $ export CLASSPATH=".:/usr/local/lib/antlr-4.7.1-complete.jar:$CLASSPATH" $ alias antlr4='java -Xmx500M -cp "/usr/local/lib/antlr-4.7.1-complete.jar:$CLASSPATH" org.antlr.v4.Tool' $ alias grun='java org.antlr.v4.gui.TestRig'

To make sure that all of your settings were successful, type in the terminal:

 $ java org.antlr.v4.Tool

You should see information about the version of ANTLR, as well as help on commands.

Creating a project on Node.js using ANTLR

 $ mkdir js-runtime $ cd js-runtime $ npm init

Install JavaScript runtime , for this we need npm package antlr4 - JavaScript target for ANTLR 4 .

 $ npm i antlr4 --save

Download the ECMAScript.g4 grammar, which we will further feed to our ANTLR.

 $ mkdir grammars $ curl --http1.1 https://github.com/antlr/grammars-v4/blob/master/ecmascript/ECMAScript.g4 --output grammars/ECMAScript.g4

By the way, on the ANTLR site in the Development Tools section, you can find links to plugins for IDEs such as Intellij, NetBeans, Eclipse, Visual Studio Code, and jEdit. Color themes, semantic error checking, and visualization with diagrams make it easier to write and test grammars.

Finally, run ANTLR.

 $ java -Xmx500M -cp '/usr/local/lib/antlr-4.7.1-complete.jar:$CLASSPATH' org.antlr.v4.Tool -Dlanguage=JavaScript -lib grammars -o lib -visitor -Xexact-output-dir grammars/ECMAScript.g4

Add this script to package.json to always have access to it. With any change to the grammar file, we need to restart ANTLR to apply the changes.

Go to the lib folder and see that ANTLR has created a list of files for us. Let's look at three of these in more detail:

ECMAScriptLexer.js splits the stream of source code symbols into a stream of tokens in accordance with the rules specified in the grammar.
ECMAScriptParser.js from a stream of tokens builds an abstract connected tree structure called a parse tree.
ECMAScriptVisitor.js is responsible for traversing the constructed tree. Theoretically, we could independently process this tree using recursive traversal of descendants in depth. However, with a large number of types of nodes and complex logic of their processing, it is better to visit each type of node with its unique method, as does the visitor.

Note that the default ANTLR does not create * Visitor.js. The ANTLR deals with the listener in the standard way to traverse trees. If you want to generate and then use Visitor instead of Listener, you need to explicitly specify this using the ' -visitor ' flag, as we did in our script:

 $ java -Xmx500M -cp '/usr/local/lib/antlr-4.7.1-complete.jar:$CLASSPATH' org.antlr.v4.Tool -Dlanguage=JavaScript -lib grammars -o lib -visitor -Xexact-output-dir grammars/ECMAScript.g4

The essence of the work of both methods, as well as the result, is very similar, however with Visitor your code looks cleaner and you have more control over the transformation process. You can set in what order to visit the nodes of the tree and visit them at all. You can also modify existing nodes during a crawl and you do not need to store information about visited nodes. In the article Antlr4 - Visitor vs Listener Pattern, this topic is discussed in more detail and by examples.

Source code analysis and syntax tree construction

Let's finally get started on programming. You will find an example of similar code literally everywhere if you search by a combination of ANTLR and JavaScript.

 const antlr4 = require('antlr4'); const ECMAScriptLexer = require('./lib/ECMAScriptLexer.js'); const ECMAScriptParser = require('./lib/ECMAScriptParser.js'); const input = '{x: 1}'; const chars = new antlr4.InputStream(input); const lexer = new ECMAScriptLexer.ECMAScriptLexer(chars); lexer.strictMode = false; //   JavaScript strictMode const tokens = new antlr4.CommonTokenStream(lexer); const parser = new ECMAScriptParser.ECMAScriptParser(tokens); const tree = parser.program(); console.log(tree.toStringTree(parser.ruleNames));

What we have done now: using the lexer and parser that ANTLR generated for us, we brought our source line to a tree view in LISP format (this is the standard tree output format in ANTLR 4). According to the grammar, the root of the ECMAScript tree is the ` program rule, so we chose it as the starting point of the detour in our example. However, this does not mean that this node should always be the first. For the original string ` {x: 1} ` it would be perfectly fair to start a crawl with ` expressionSequence .

 const tree = parser.expressionSequence();

If you remove the formatting ` .toStringTree() `, then you can see the internal structure of the tree object.

Conventionally, the whole process of transforming one programming language into another can be divided into three stages:

Source code analysis.
Building a syntax tree.
Generate new code.

As we can see, thanks to ANTLR, we have significantly simplified our work and in a few lines of code covered two whole steps. Of course, we will come back to them, update the grammar, transform the tree, but nevertheless a good foundation has already been laid. Grammars that are downloaded from the repository may also be incomplete or even with errors. But perhaps they have already corrected the mistakes that you would have made if you started writing grammar from scratch yourself. The meaning of this is, do not blindly believe code written by other developers, but you can save time writing identical rules for the improvement of existing ones. Perhaps the next generation of young ANTLR students will already download your more perfect version of the grammar.

Code generation

Create a new directory ` codegeneration and a new file PythonGenerator.js in the project.

 $ mkdir codegeneration $ cd codegeneration $ touch PythonGenerator.js

As you have already guessed, as an example, we are transforming something from JavaScript into Python.

The generated ECMAScriptVisitor.js file contains a huge list of methods, each of which is automatically called during a tree traversal if the corresponding node is visited. And at this very moment we can change the current node. To do this, create a class that will inherit from ECMAScriptVisitor and override the methods we need.

 const ECMAScriptVisitor = require('../lib/ECMAScriptVisitor').ECMAScriptVisitor; /** * Visitor   ,  ANTLR *    JavaScript   Python  * * @returns {object} */ class Visitor extends ECMAScriptVisitor { /** *     * * @param {object} ctx * @returns {string} */ start(ctx) { return this.visitExpressionSequence(ctx); } } module.exports = Visitor;

In addition to methods that conform to the syntax rules of grammar, ANTLR also supports four special public methods :

visit () is responsible for traversing the tree.
visitChildren () is responsible for crawling sites.
visitTerminal () is responsible for bypassing tokens.
visitErrorNode () is responsible for bypassing invalid tokens.

We implement in our class the methods ` visitChildren() `, ` visitTerminal() ` and also ` visitPropertyExpressionAssignment() `.

 /* *     * * @param {object} ctx * @returns {string} */ visitChildren(ctx) { let code = ''; for (let i = 0; i < ctx.getChildCount(); i++) { code += this.visit(ctx.getChild(i)); } return code.trim(); } /** *    (  )    * * @param {object} ctx * @returns {string} */ visitTerminal(ctx) { return ctx.getText(); } /** *  ,      * * @param {object} ctx * @returns {string} */ visitPropertyExpressionAssignment(ctx) { const key = this.visit(ctx.propertyName()); const value = this.visit(ctx.singleExpression()); return `'${key}': ${value}`; }

The ` visitPropertyExpressionAssignment function visits the node responsible for assigning the value ` visitPropertyExpressionAssignment ` visitPropertyExpressionAssignment . In Python, the string parameters of an object must be enclosed in quotes, as opposed to JavaScript, where they are optional. In this case, this is the only modification that is required to transform a fragment of JavaScript code into Python code.

Add to index.js a call to our PythonGenerator.

 console.log('JavaScript input:'); console.log(input); console.log('Python output:'); const output = new PythonGenerator().start(tree); console.log(output);

As a result of the program execution, we see that our compiler has successfully coped with its task and converted a JavaScript object into a Python object.

We begin bypassing the tree from parent to child, gradually descending to its sheets. And then we go in the reverse order and substitute the formatted values for a higher level, thus replacing the entire chain of tree nodes with their textual representation corresponding to the syntax of a new programming language.

Add some debugging to our ` visitPropertyExpressionAssignment function.

 //    console.log(ctx.getText()); // -   console.log(ctx.getChildCount()); // console.log(ctx.propertyName().getText())  x console.log(ctx.getChild(0).getText()); // : console.log(ctx.getChild(1).getText()); // console.log(ctx.singleExpression().getText())  1 console.log(ctx.getChild(2).getText());

The descendants can be accessed both by name and by sequence number. The descendants are also nodes, so we used the ` .getText() ` method to get their text value, rather than the object representation.

Let's add ECMAScript.g4 and teach our compiler to recognize the ` Number keyword.

Re-generate the visitor to tighten the changes made in the grammar.

 $ npm run compile

Now back to our PythonGenerator.js file and add a list of new methods to it.

 /** *    `new` * * @param {object} ctx * @returns {string} */ visitNewExpression(ctx) { return this.visit(ctx.singleExpression()); } /** *      `Number` * * @param {object} ctx * @returns {string} */ visitNumberExpression(ctx) { const argumentList = ctx.arguments().argumentList(); // JavaScript Number   , //         if (argumentList === null || argumentList.getChildCount() !== 1) { return 'Error: Number   '; } const arg = argumentList.singleExpression()[0]; const number = this.removeQuotes(this.visit(arg)); return `int(${number})`; } /** *          * * @param {String} str * @returns {String} */ removeQuotes(str) { let newStr = str; if ( (str.charAt(0) === '"' && str.charAt(str.length - 1) === '"') || (str.charAt(0) === '\'' && str.charAt(str.length - 1) === '\'') ) { newStr = str.substr(1, str.length - 2); } return newStr; }

Python does not use the keyword ` visitNewExpression when invoking the constructor, so we remove it, or rather, in the ` visitNewExpression node, visitNewExpression continue to traverse the tree by excluding the first descendant. Then we replace the `Number` keyword with ` int , which is its equivalent in Python. Since ` Number is an Expression, we have access to its arguments through the ` .arguments() ` method.

Similarly, we can go through all the methods listed in ECMAScriptVisitor.js and transform all the JavaScript literals, symbols, rules, etc. into their equivalents for Python (or any other programming language).

Error processing

ANTLR by default validates the input to match the syntax specified in the grammar. In the event of a discrepancy, the console will display information about the problem that has occurred, and ANTLR will also return the string as it was able to recognize it. If you remove the `{x: 2}` colon from the source line, ANTLR will replace the unrecognized nodes with `undefined`.

We can influence this behavior and output detailed error information instead of a broken line. To begin with, we will create a new module in the root of our application, which is responsible for generating custom error types.

 $ mkdir error $ cd error $ touch helper.js $ touch config.json

I will not dwell on the specifics of the implementation of this module, since this is beyond the scope of the topic about compilers. You can copy the finished version below, or write your own version, more suitable for the infrastructure of your application.

All types of errors that you want to use in your application are listed in config.json.

 { "syntax": {"generic": {"message": "Something went wrong"}}, "semantic": { "argumentCountMismatch": { "message": "Argument count mismatch" } } }

Then error.js cycles through the list from the config and for each entry in it creates a separate class inherited from Error.

 const config = require('./config.json'); const errors = {}; Object.keys(config).forEach((group) => { Object.keys(config[group]).forEach((definition) => { //        const name = [ group[0].toUpperCase(), group.slice(1), definition[0].toUpperCase(), definition.slice(1), 'Error' ].join(''); const code = `E_${group.toUpperCase()}_${definition.toUpperCase()}`; const message = config[group][definition].message; errors[name] = class extends Error { constructor(payload) { super(payload); this.code = code; this.message = message; if (typeof payload !== 'undefined') { this.message = payload.message || message; this.payload = payload; } Error.captureStackTrace(this, errors[name]); } }; }); }); module.exports = errors;

Update the ` visitNumberExpression method and now instead of a text message marked visitNumberExpression we will throw out the error ` SemanticArgumentCountMismatchError , which is easier to catch and thus begin to distinguish between successful and problematic results of our application.

 const path = require('path'); const { SemanticArgumentCountMismatchError } = require(path.resolve('error', 'helper')); /** *      `Number` * * @param {object} ctx * @returns {string} */ visitNumberExpression(ctx) { const argumentList = ctx.arguments().argumentList(); if (argumentList === null || argumentList.getChildCount() !== 1) { throw new SemanticArgumentCountMismatchError(); } const arg = argumentList.singleExpression()[0]; const number = this.removeQuotes(this.visit(arg)); return `int(${number})`; }

Now we will deal with errors directly related to the work of ANTLR, namely those that occur during the parsing of the code. In the codegeneration directory we will create a new ErrorListener.js file and in it a class inherited from `antlr4.error.ErrorListener`.

 const antlr4 = require('antlr4'); const path = require('path'); const {SyntaxGenericError} = require(path.resolve('error', 'helper')); /** *        * * @returns {object} */ class ErrorListener extends antlr4.error.ErrorListener { /** *     * * @param {object} recognizer ,     * @param {object} symbol ,   * @param {int} line    * @param {int} column    * @param {string} message   * @param {string} payload   */ syntaxError(recognizer, symbol, line, column, message, payload) { throw new SyntaxGenericError({line, column, message}); } } module.exports = ErrorListener;

To override the standard error output method, we use two methods available to the ANTLR parser:

parser.removeErrorListeners () removes the standard ConsoleErrorListener.
parser.addErrorListener () adds a custom ErrorListener.

This should be done after creating the parser, but before calling it. The full code of the updated index.js will look like this:

 const antlr4 = require('antlr4'); const ECMAScriptLexer = require('./lib/ECMAScriptLexer.js'); const ECMAScriptParser = require('./lib/ECMAScriptParser.js'); const PythonGenerator = require('./codegeneration/PythonGenerator.js'); const ErrorListener = require('./codegeneration/ErrorListener.js'); const input = '{x 2}'; const chars = new antlr4.InputStream(input); const lexer = new ECMAScriptLexer.ECMAScriptLexer(chars); lexer.strictMode = false; //   JavaScript strictMode const tokens = new antlr4.CommonTokenStream(lexer); const parser = new ECMAScriptParser.ECMAScriptParser(tokens); const listener = new ErrorListener(); parser.removeErrorListeners(); parser.addErrorListener(listener); console.log('JavaScript input:'); console.log(input); console.log('Python output:'); try { const tree = parser.expressionSequence(); const output = new PythonGenerator().start(tree); console.log(output); // console.log(tree.toStringTree(parser.ruleNames)); } catch (error) { console.log(error); }

Thanks to the information that is now contained in the error object, we can decide how to correctly handle the exception, interrupt or continue program execution, create an informative log or, last but not least, write tests that support correct and incorrect source data for the compiler.

Conclusion

If someone offers you to write a compiler, immediately agree! This is very interesting and most likely significantly different from your usual programming tasks. We have considered only the simplest nodes in order to form an overview of the process of writing a JavaScript compiler using ANTLR. You can extend the functionality by validating the types of arguments passed, add Extended JSON or BSON support to the grammar, apply an identifier table to recognize such methods as toJSON (), toString (), getTimestamp () and so on. In fact, the possibilities are endless.

At the time of this writing, work on the MongoDB compiler is in its initial stages. It is likely that this code transformation approach will change over time, or more optimal solutions will appear, and then I may write a new article with more relevant information.

Today I am very passionate about writing a compiler and I want to preserve the knowledge gained in the process, which may be useful to someone else.

If you want to go deeper into the topic, I can recommend the following resources for reading:

The language of languages by Matt Might
Compiler Construction by Niklaus Wirth
Compilers Principles, technologies and tools of authors Alfred V. Aho, Monica S. Lam, Ravi Seti, Jeffrey D. Ulman
Tree structures processing and unified AST by Ivan Kochurkin, Positive Technologies
Lecture "Fundamentals of Compilers" by A. Aho, R. Seti, J. Ullman

And links to resources used in the article:

Super Tiny Compiler by James Kyle
Implementing a Simple Compiler on 25 Lines of JavaScript by Minko Gechev
How to write a simple interpreter in JavaScript by Peter Olson
Parsing in JavaScript: Tools and Libraries by Gabriele Tomassetti
The ANTLR Mega Tutorial by Gabriele Tomassetti
MySQL grammar on ANTLR 4 by Ivan Khudyashov
Treating tree structures and unified AST by Ivan Kochurkin
Theory and practice of source code parsing using ANTLR and Roslyn by Ivan Kochurkin
Antlr4 - Visitor vs Listener Pattern by Saumitra Srivastav

Thanks to my Compass team and, in particular, Anna Herlihy for mentoring and contributing to compiler writing, to reviewers Alex Komigin , Misha Tyulenev for recommendations on the structure of the article, and for lettering to the title illustration of Oksana Nalyvaiko .

The English version of the article on Medium: Compiler in JavaScript using ANTLR

Source: https://habr.com/ru/post/351906/

All Articles