Content
That which checks and determines the meaning of expressions in a programming language is in turn simply a program.
Hal Abelson and Gérald Sasman, "Structure and Interpretation of Computer Programs."
')
When the teacher asked the teacher about the nature of the Data and Control cycle, Yuan-Ma replied: "Think about a compiler compiling yourself."
Master Yuan-Ma, "Programming Book"Creating your own programming language is surprisingly easy (as long as you do not set any extraordinary goals) and quite instructive.
The main thing that I want to demonstrate in this chapter is that there is no magic in the construction of a language. It often seemed to me that some human inventions are so complex and abstruse that I will never understand them. However, after a little self-education and picking, such things often turn out to be quite ordinary.
We will build the programming language Egg (Egg). It will be small, simple, but powerful enough to express any calculations. It will also implement simple abstractions based on functions.
Parsing
What lies on the surface of the language is syntax, notation. A grammar analyzer, or parser, is a program that reads a piece of text and issues a data structure that describes the structure of the program contained in the text. If the text does not describe the correct program, the parser should complain and indicate an error.
Our language will have a simple and uniform syntax. In Egg, everything will be an expression. An expression can be a variable, a number, a string, or an application. Applications are used to call functions and constructions of if or while type.
To simplify parsing, strings in Egg will not support backslashes and similar things. A string is simply a sequence of characters that are not double quotes enclosed in double quotes. Number - a sequence of numbers. Variable names can consist of any characters that are not spaces and have no special meaning in the syntax.
Applications are written the same way as in JS - with the help of brackets after the expression and with any number of arguments in brackets, separated by commas.
do(define(x, 10), if(>(x, 5)), print(""), print(""))
The homogeneity of the language means that what is an operator in JS is applied in the same way as other functions. Since there is no concept of blocks in the syntax, we need the do construction to refer to several things that are performed sequentially.
The data structure that describes the program will consist of expression objects, each of which will have a type property that reflects the type of this expression and other properties that describe the content.
Expressions of type “value” represent strings or numbers. Their value property contains the string or number they represent. Expressions like “word” are used for identifiers (names). Such objects have a name property that contains the name of the identifier as a string. Finally, the “apply” clauses represent applications. They have an “operator” property, referencing the applicable expression, and an “args” property with an array of arguments.
The> (x, 5) part will be represented as:
{ type: "apply", operator: {type: "word", name: ">"}, args: [ {type: "word", name: "x"}, {type: "value", value: 5} ] }
This data structure is called a syntax tree. If you represent objects as points, and connections between them as lines, you will get a tree structure. The fact that expressions contain other expressions, which in turn may contain their own expressions, is similar to how branches branch.
Syntax tree structure
Compare this with the parser we wrote for the configuration file in Chapter 9, which had a simple structure: it divided the input into lines and processed them one by one. There were only a few forms that are allowed to take the string.
Here we need a different approach. Expressions are not divided into lines, and their structure is recursive. Application expressions contain other expressions. Fortunately, this problem is elegantly solved by using a recursive function that reflects the recursiveness of the language.
We define the parseExpression function, which accepts an input string and returns an object containing the data structure for the expression from the beginning of the string, along with the portion of the string left after parsing. When parsing subexpressions (such as an application argument), this function is called again, returning the expression of the argument along with the remaining text. That text may, in turn, contain more arguments, or be a closing parenthesis, completing the list of arguments.
The first part of the parser:
function parseExpression(program) { program = skipSpace(program); var match, expr; if (match = /^"([^"]*)"/.exec(program)) expr = {type: "value", value: match[1]}; else if (match = /^\d+\b/.exec(program)) expr = {type: "value", value: Number(match[0])}; else if (match = /^[^\s(),"]+/.exec(program)) expr = {type: "word", name: match[0]}; else throw new SyntaxError(" : " + program); return parseApply(expr, program.slice(match[0].length)); } function skipSpace(string) { var first = string.search(/\S/); if (first == -1) return ""; return string.slice(first); }
Because Egg allows any number of spaces in the elements, we need to constantly cut the spaces from the beginning of the line. SkipSpace copes with this.
Passing leading spaces, parseExpression uses three regulars to recognize three simple (atomic) elements supported by the language: strings, numbers, and words. The parser creates different structures for different types. If the input does not match any of the forms, this is not a valid expression, and it throws an error. SyntaxError is a standard error object that is created when you try to run an incorrect JavaScript program.
We can cut off the processed part of the program, and pass it, along with the expression object, to parseApply, which determines whether the expression is not an application. If so, it parses the list of arguments in brackets.
function parseApply(expr, program) { program = skipSpace(program); if (program[0] != "(") return {expr: expr, rest: program}; program = skipSpace(program.slice(1)); expr = {type: "apply", operator: expr, args: []}; while (program[0] != ")") { var arg = parseExpression(program); expr.args.push(arg.expr); program = skipSpace(arg.rest); if (program[0] == ",") program = skipSpace(program.slice(1)); else if (program[0] != ")") throw new SyntaxError(" ',' or ')'"); } return parseApply(expr, program.slice(1)); }
If the next program character is a non-opening parenthesis, then this is not an application, and parseApply simply returns the expression given to it.
Otherwise, it skips the opening bracket and creates a syntax tree object for this expression. It then recursively calls parseExpression to parse each argument until it encounters a closing bracket. Indirect recursion, parseApply and parseExpression cause each other.
Since the application itself can be an expression (multiplier (2) (1)), parseApply should, after parsing the application, invoke itself again by checking if there is another pair of braces.
That's all we need to parse Egg. We wrap this into a convenient parse function that checks that it reached the end of the line after parsing an expression (the Egg program is one expression), and this will give us the program's data structure.
function parse(program) { var result = parseExpression(program); if (skipSpace(result.rest).length > 0) throw new SyntaxError(" "); return result.expr; } console.log(parse("+(a, 10)"));
Works! It does not provide useful information in case of an error, and does not store the row and column numbers from which each expression begins, which could be useful in analyzing errors - but that’s enough for us.
Interpreter
And what should we do with the syntax tree of the program? Run it! The interpreter does this. You give it a syntax tree and an environment object that associates names with values, and it interprets the expression represented by the tree and returns the result.
function evaluate(expr, env) { switch(expr.type) { case "value": return expr.value; case "word": if (expr.name in env) return env[expr.name]; else throw new ReferenceError(" : " + expr.name); case "apply": if (expr.operator.type == "word" && expr.operator.name in specialForms) return specialForms[expr.operator.name](expr.args, env); var op = evaluate(expr.operator, env); if (typeof op != "function") throw new TypeError(" ."); return op.apply(null, expr.args.map(function(arg) { return evaluate(arg, env); })); } } var specialForms = Object.create(null);
The interpreter has code for each type of expression. For literals, it returns their value. For example, the expression 100 is interpreted into the number 100. We must check the variable if it is defined in the environment, and if so, ask for its value.
With applications harder. If this is a special form of the if type, we do not interpret anything, but simply pass the arguments along with the environment to the function that processes the form. If this is a simple call, we interpret the operator, verify that it is a function and call it with the result of the interpretation of the arguments.
To represent the values ​​of Egg functions, we will use simple values ​​of JavaScript functions. We will come back to this later when we define a special form of fun.
The recursive interpreter structure resembles a parser. Both reflect the structure of the language. One could integrate the parser into the interpreter and interpret it during parsing, but their separation makes the program more readable.
That's all you need to interpret Egg. Just like that. But without defining a few special forms and adding useful values ​​to the environment, you cannot do anything with this language.
Special forms
The specialForms object is used to define special Egg syntax. He matches words with functions that interpret these special forms. While it is empty. Let's add some forms.
specialForms["if"] = function(args, env) { if (args.length != 3) throw new SyntaxError(" if"); if (evaluate(args[0], env) !== false) return evaluate(args[1], env); else return evaluate(args[2], env); };
The if construct of the Egg language waits for three arguments. It calculates the first, and if the result is not false, it calculates the second. Otherwise, it calculates the third. This if is more like a ternary operator?:. It is an expression, not an instruction, and it gives the value, namely, the result of the second or third expression.
Egg differs from JavaScript in how it handles an if condition. It will not count a zero or empty string as false.
if is represented as a special form and not as an ordinary function, because the arguments of functions are evaluated before the call, and if must interpret one of two arguments, the second or third, depending on the value of the first.
The form for while is similar.
specialForms["while"] = function(args, env) { if (args.length != 2) throw new SyntaxError(" while"); while (evaluate(args[0], env) !== false) evaluate(args[1], env);
Another main part of the language is do, which performs all the arguments from top to bottom. Its value is the value returned by the last argument.
specialForms["do"] = function(args, env) { var value = false; args.forEach(function(arg) { value = evaluate(arg, env); }); return value; };
To create variables and give them values, we create a define form. She expects word as the first argument, and an expression that produces the value to assign to this word as the second. define, like everything, is an expression, so it must return a value. Let it return the assigned value (just like the operator = in JavaScript).
specialForms["define"] = function(args, env) { if (args.length != 2 || args[0].type != "word") throw new SyntaxError("Bad use of define"); var value = evaluate(args[1], env); env[args[0].name] = value; return value; };
Environment
The environment accepted by the interpreter is an object with properties, whose names correspond to the names of variables, and values ​​- to the values ​​of these variables. Let's define an environment object that represents a global scope.
To use the if construct, we must create boolean values. Since there are only two of them, special syntax is not needed for them. We simply make two variables true and false.
var topEnv = Object.create(null); topEnv["true"] = true; topEnv["false"] = false; , . var prog = parse("if(true, false, true)"); console.log(evaluate(prog, topEnv));
To support simple arithmetic operators and comparisons, we add several functions to the environment. To simplify the code, we will use the new Function to create a set of operator functions in a loop, rather than defining them all separately.
["+", "-", "*", "/", "==", "<", ">"].forEach(function(op) { topEnv[op] = new Function("a, b", "return a " + op + " b;"); });
Also useful is a way to output values, so we wrap console.log into a function and call it print.
topEnv["print"] = function(value) { console.log(value); return value; };
This gives us enough elementary tools for writing simple programs. The next run function provides a convenient way to write and run. It creates a fresh environment, parsit and parses the lines that we pass to it, as if they were one program.
function run() { var env = Object.create(topEnv); var program = Array.prototype.slice .call(arguments, 0).join("\n"); return evaluate(parse(program), env); }
Using Array.prototype.slice.call is a trick to turning an object like an array, such as an argument, into a real array, so we can apply join to it. It takes all the arguments passed to run, and considers that all of them are program lines.
run("do(define(total, 0),", " define(count, 1),", " while(<(count, 11),", " do(define(total, +(total, count)),", " define(count, +(count, 1)))),", " print(total))");
We have already seen this program several times - it calculates the sum of numbers from 1 to 10 in the Egg language. It is uglier than an equivalent JavaScript program, but not so bad for a language given in less than 150 lines of code.
Functions
A programming language without functions is a bad language.
Fortunately, it is easy to add the fun construct, which regards the last argument as the function body, and all the previous ones are the function argument names.
specialForms["fun"] = function(args, env) { if (!args.length) throw new SyntaxError(" "); function name(expr) { if (expr.type != "word") throw new SyntaxError(" word"); return expr.name; } var argNames = args.slice(0, args.length - 1).map(name); var body = args[args.length - 1]; return function() { if (arguments.length != argNames.length) throw new TypeError(" "); var localEnv = Object.create(env); for (var i = 0; i < arguments.length; i++) localEnv[argNames[i]] = arguments[i]; return evaluate(body, localEnv); }; };
Functions in Egg have their own local environment, just like in JavaScript. We use Object.create to create a new object that has access to variables in the external environment (of its prototype), but it can also contain new variables without changing the external scope.
The function created by the fun form creates its local environment and adds variable arguments to it. She then interprets the body in this environment and returns the result.
run("do(define(plusOne, fun(a, +(a, 1))),", " print(plusOne(10)))");
Compilation
We have built an interpreter. During interpretation, it works with the program representation created by the parser.
Compiling - adding one more step between the parser and running the program, which turns into a program into something that can be performed more efficiently, by doing most of the work in advance. For example, in well-organized languages, each time a variable is used, it is obvious which variable is accessed, even without starting the program. This can be used not to look for a variable by name every time it is accessed, but to directly call it from some predetermined area of ​​memory.
By tradition, the compilation also turns the program into machine code - a raw format suitable for execution by the processor. But each process of turning a program into a different kind is, in fact, a compilation.
You could create another Egg interpreter, which first turns the program into a JavaScript program, uses the new Function to invoke the JavaScript compiler, and returns the result. If implemented correctly, Egg would execute very quickly with a relatively simple implementation.
If this is interesting for you and you want to spend time on it, I encourage you to try to make such a compiler as an exercise.
Fraud
When we defined if and while, you might notice that they were simple wrappers around if and while in JavaScript. Values ​​in Egg are also common JavaScript values.
By comparing the JavaScript implementation of Egg with the amount of work needed to create a programming language directly in machine language, the difference becomes huge. Nevertheless, this example, I hope, gives you an idea of ​​the work of programming languages.
And when you need to do something, it will be more efficient to cheat than to do everything from scratch yourself. And although a toy language is no better than JavaScript, in some situations, writing your own language helps you do work faster.
Such a language is not obliged to remind ordinary PL. If JavaScript did not contain regular expressions, you could write your own parser and interpreter for such a sub-language.
Or imagine that you are building a giant dinosaur robot and you need to program its behavior. JavaScript is not the most efficient way to do this. You can instead choose a language like this:
behavior walk perform when destination ahead actions move left-foot move right-foot behavior attack perform when Godzilla in-view actions fire laser-eyes launch arm-rockets
This is commonly referred to as a domain-specific language (domain-specific language), a language specifically designed to work in a narrow direction. Such a language can be more expressive than a general purpose language, because it is designed to express precisely those things that need to be expressed in this area - and nothing more.
Exercises
Arrays
Add array support to Egg. To do this, add three functions to the main scope: array (...) to create an array containing the values ​​of the arguments, length (array) to return the length of the array, and element (array, n) to return the nth element.
Closures
The definition of fun allows functions in Egg to lock around the environment, and use local variables in the function body that are visible during the definition, just like in JavaScript functions.
The following program illustrates this: the function f returns a function that adds its argument to the argument f, that is, it needs access to the local scope within f to use the variable a.
run("do(define(f, fun(a, fun(b, +(a, b)))),", " print(f(4)(5)))");
Explain, using the form definition fun, which mechanism allows this construction to work.
Comments
It would be nice to have comments in Egg. For example, we could ignore the rest of the line, encountering the “#” character, as it does with “//” in JS.
Big changes in the parser do not have to. We will simply change skipSpace so that it skips comments, as if they are spaces - and in all places where skipSpace is called, comments will also be skipped. Make this change.
We fix the scope
Now we can assign a variable value only through define. This construct works both when assigning old variables and creating new ones.
This ambiguity leads to problems. If you are trying to assign a new value to a non-local variable, instead you define a local with the same name. (Some languages ​​do that, but it has always seemed to me a foolish way of working with the scope).
Add a set form, similar to define, which assigns a new value to a variable, updating the variable in the external scope if it is not set in the local one. If the variable is not set at all, throw a ReferenceError (another standard error type).
The technique of presenting scopes as simple objects, which until now has been convenient, will now disturb you. You may need the Object.getPrototypeOf function to return the prototype of the object. Also remember that the scope is not inherited from the Object.prototype, so if you need to call hasOwnProperty on them, you will have to use this clumsy construct:
Object.prototype.hasOwnProperty.call(scope, name);
This calls the hasOwnProperty method of the Object prototype and then calls it on the scope object.
specialForms["set"] = function(args, env) {