Operator preceding descending parser

Douglas Crockford

2007-02-21

Introduction

In 1973, at the first annual symposium “ Principles of Programming Languages ” ( Principles of Programming Languages Symposium ), Von Pratt introduced the article “ Top-down Operator Precedence Parser ”. In this article, Pratt described a syntactic parsing method that combines the best of recursive descent and Floyd's operator precedence method. The Pratt method is very similar to recursive descent, but requires less code and runs much faster. Pratt said that his method is easy to learn, implement and use, extremely effective and very flexible. Due to its dynamism, it can be used for extensible languages.

But if the method is really flawless, why do compiler developers ignore it to this day? In his article, Pratt suggested that BNF grammars and their numerous modifications, as well as related theorems and automata, occupied a niche earlier and now impede the development of the theory of syntactic analysis in other directions.
')
There is another explanation: this method is most effective for dynamic, functional programming languages and it is much more difficult to use it in a static, procedural language. Pratt illustrates his article with Lisp as an example and effortlessly builds syntax trees along a stream of tokens. But parsing methods are not particularly appreciated in the community of Lisp programmers who preach a Spartan rejection of syntax. Since the creation of Lisp, many attempts have been made to give this language a rich ALGOL style syntax: Pratt's CGOL , Lisp-2 , MLISP , Dylan , Interlisp's Clisp , the original McCarthy M-expressions, and so on. But they all failed. For the Lisp community, consistency of programs and data turned out to be more important than expressive syntax. On the other hand, the vast majority of programmers love syntax, so Lisp itself has never become popular. The Pratt method needs a dynamic language, but the dynamic language community has historically not used the syntax that has been so conveniently implemented by the Pratt method.

Javascript

The situation has changed with the advent of javascript. JavaScript is a dynamic, functional language, but syntactically it belongs to the C family. It is a dynamic language, and its community loves syntax.

JavaScript is also object-oriented. Pratt's article anticipated the object-oriented approach, but it lacked expressive notation for that. JavaScript is the ideal language for implementing Pratt's method. I’ll show you how to create JavaScript parsers quickly and efficiently.

One article is not enough to deal with JavaScript completely and, perhaps, we don’t want to, because in this language the devil will break his leg. But there are brilliant sides in it, quite worthy of consideration. We will create a parser that can handle Simplified JavaScript. And we will write this parser in Simplified JavaScript. Simplified JavaScript is all the best of the language, including:

Functions as first class objects. In Simplified JavaScript, functions are lambda expressions with a lexical scope.
Dynamic objects with prototype inheritance. There are no classes. We can add a new member to the object using regular assignment. An object can inherit members of another object.
Literals for objects and arrays. This notation is very convenient for creating new objects and arrays. JavaScript literals provided inspiration for the JSON format .

We take advantage of JavaScript prototypes to create lexeme objects that inherit from characters. Our implementation will need the Object.create method (creates a new object that inherits the members of an existing object) and a lexical analyzer, which, on the input line, creates an array of lexemes objects. Moving along this array, we will build a syntax parsing tree.

table of symbols

Each token, for example, an operator or identifier, will be inherited from a character. We will write all of our possible characters (which define the types of language tokens) into the symbol_table object.

 var symbol_table = {};

The original_symbol object is the prototype for all other characters. His methods are usually overwhelmed. The meaning of the nud and led methods, as well as the binding power, will be explained below in the “Priorities” section.

 var original_symbol = { nud: function () { this.error("Undefined."); }, led: function (left) { this.error("Missing operator."); } };

Define the function that creates the characters. It accepts a symbol identifier ( id ) and a binding strength (optional bp parameter, the default is zero), and returns a symbol object for the given id . If the symbol is already in symbol_table , the function returns it. Otherwise, it creates a new character that is inherited from original_symbol , stores it in the symbol table, and returns. The character object initially contains the id , value, left binding power ( lbp ), and everything that came from original_symbol .

 var symbol = function (id, bp) { var s = symbol_table[id]; bp = bp || 0; if (s) { if (bp >= s.lbp) { s.lbp = bp; } } else { s = Object.create(original_symbol); s.id = s.value = id; s.lbp = bp; symbol_table[id] = s; } return s; };

We declare frequently occurring delimiters and trailing characters.

 symbol(":"); symbol(";"); symbol(","); symbol(")"); symbol("]"); symbol("}"); symbol("else");

The symbol (end) indicates the end of the stream of tokens. The symbol (name) is a prototype for new names, for example, variable names. I included brackets in their ids to avoid possible matches with custom lexemes.

 symbol("(end)"); symbol("(name)");

Lexemes

We assume that the source text has already been converted into an array of tokens primitive lexemes containing the type field ( "name" , "string" , "number" , or "operator" ), and the value field (string or number). The variable token always refers to the current token.

 var token;

The advance function creates a new token object from the following primitive token and assigns it to the variable token . If the optional id parameter is passed, the function checks that the token has the corresponding identifier. The prototype of the new lexeme object is a symbol (name) in the current scope or a symbol from the symbol table. The arity field of the new lexeme will be either "name" , or "literal" , or "operator" . Subsequently, when we learn more about the role of a token in a program, this value may change to "binary" , "unary" or "statement" .

 var advance = function (id) { var a, o, t, v; if (id && token.id !== id) { token.error("Expected '" + id + "'."); } if (token_nr >= tokens.length) { token = symbol_table["(end)"]; return; } t = tokens[token_nr]; token_nr += 1; v = t.value; a = t.type; if (a === "name") { o = scope.find(v); } else if (a === "operator") { o = symbol_table[v]; if (!o) { t.error("Unknown operator."); } } else if (a === "string" || a === "number") { a = "literal"; o = symbol_table["(literal)"]; } else { t.error("Unexpected token."); } token = Object.create(o); token.value = v; token.arity = a; return token; };

Area of visibility

Most languages have a notation for defining new characters (for example, variable names). In simple language, when we meet a new word, we automatically define it and put it in the symbol table. In more complex languages, there is a scope that allows a programmer to control access to a variable and its lifetime.

The scope is the part of the program in which the variable is defined and accessible. Scopes can be nested. A variable defined in a certain scope is not visible outside.

We will store the current scope in a separate variable scope .

 var scope;

The original_scope object is the prototype for all objects that represent the scope. It contains a define method that allows you to define new variables. The define method converts a name token into a token variable. It gives an error if the variable has already been defined in scope or the name is a reserved word.

 var itself = function () { return this; }; var original_scope = { define: function (n) { var t = this.def[n.value]; if (typeof t === "object") { n.error(t.reserved ? "Already reserved." : "Already defined."); } this.def[n.value] = n; n.reserved = false; n.nud = itself; n.led = null; n.std = null; n.lbp = 0; n.scope = scope; return n; },

The find method is used to find a definition by name. He begins the search from the current scope and goes, if necessary, up the chain, ending with a table of characters. If the definition could not be found, it returns symbol_table["(name)"] . The method checks that the value with the given name is not equal to undefined (which would mean referring to an undeclared name) and that this is not a function (which would indicate a conflict with the inherited method).

  find: function (n) { var e = this, o; while (true) { o = e.def[n]; if (o && typeof o !== 'function') { return e.def[n]; } e = e.parent; if (!e) { o = symbol_table[n]; return o && typeof o !== 'function' ? o : symbol_table["(name)"]; } } },

The pop method closes the scope by replacing it with its parent.

  pop: function () { scope = this.parent; },

The reserve method is used to indicate that a given name is a reserved word in the current scope.

  reserve: function (n) { if (n.arity !== "name" || n.reserved) { return; } var t = this.def[n.value]; if (t) { if (t.reserved) { return; } if (t.arity === "name") { n.error("Already defined."); } } this.def[n.value] = n; n.reserved = true; } };

Now we need a strategy for processing reserved words. In some languages, words describing the structure of a program (for example, if ) are reserved and cannot be used as variable names. The flexibility of our parser allows you to achieve more. For example, we can say that in any function any given name can be used either as a language operator or as a variable name. We will reserve words locally only after they have been used as reserved. This simplifies the life of the creator of the language, because the addition of new keywords will not break existing programs, and simplifies the life of programmers, because they do not interfere with unnecessary restrictions on the use of names.

When we want to create a new scope for a function or block, we call the new_scope function, which creates a new prototype instance, original_scope .

 var new_scope = function () { var s = scope; scope = Object.create(original_scope); scope.def = {}; scope.parent = s; return scope; };

A priority

The lexeme objects contain methods that allow you to make decisions about priorities, select other lexemes and build trees (and in more complex projects you can also check types, optimize and generate code). The main task of priorities is as follows: for a given operand between two operators, determine whether the operand belongs to the left operator or to the right operator:

d A e B f

If A and B are operators, to which of them does operand e belong? In other words, we have to choose between (d A e) B f
and d A (e B f) .

Ultimately, the main complexity of the syntactic analysis is to resolve this uncertainty. The objects of the tokens from our method store the binding strength (or priority level), and the simple methods nud (null denotation, null-match) and led (left denotation, left-match). The nud does not matter which lexemes are to the left, and the led method is important. The nud method nud used by values (variables and literals) and prefix operators. The led method is used by infix and postfix operators. A lexeme can have both nud and led methods nud . For example, minus ( - ) can be either prefix (changing the sign of a number) or infix (subtraction), so both methods are defined for it.

Our parser uses the following binding forces:

0	operators without communication `;` etc.
ten	assignment operators: `=` , etc.
20	`?:`
thirty	`\|\| &&`
40	comparison operators: `===` , etc.
50	`+ -`
60	`* /`
70	unary operators:! etc.
80	`. [ (`

Expressions

The main component of the Pratt method is the function expression . It takes as input the right binding force, which indicates how actively the expression is associated with the tokens on the right.

 var expression = function (rbp) { var left; var t = token; advance(); left = t.nud(); while (rbp < token.lbp) { t = token; advance(); left = t.led(left); } return left; }

The expression function calls the nud method of the current token token , which processes literals, variables, and prefix operators. Then, as long as the right binding power is less than the left binding power of the next lexeme, the led method is invoked. This method handles infix and postfix operators. The process can be recursive, since the nud and led methods themselves can call expression .

Infix operators

The + operator is an infix operator, so it has a led method that turns the lexeme object into a tree, whose two branches ( first and second ) are operands to the left and right of the + sign. The led method accepts the left operand as a parameter and finds the right operand by calling expression .

 symbol("+", 50).led = function (left) { this.first = left; this.second = expression(50); this.arity = "binary"; return this; };

The * symbol is similar to + except for the id value and the binding strength. It is more tightly bound to operands, so its binding power is higher.

 symbol("*", 60).led = function (left) { this.first = left; this.second = expression(60); this.arity = "binary"; return this; };

Not all infix operators will look the same, but many will. Therefore, we can simplify our work by defining an infix function that helps us create infix operators. The infix function takes id , binding strength, and optionally led function. If the function is not specified, infix creates the default led function, which is suitable in most cases.

 var infix = function (id, bp, led) { var s = symbol(id, bp); s.led = led || function (left) { this.first = left; this.second = expression(bp); this.arity = "binary"; return this; }; return s; }

Now we can describe infix operators in a more declarative style:

 infix("+", 50); infix("-", 50); infix("*", 60); infix("/", 60);

=== is an exact comparison operator in javascript.

 infix("===", 40); infix("!==", 40); infix("<", 40); infix("<=", 40); infix(">", 40); infix(">=", 40);

The ternary operator accepts three expressions separated by ? and : This is not a regular infix operator, so you have to set the led function here.

 infix("?", 20, function (left) { this.first = left; this.second = expression(0); advance(":"); this.third = expression(0); this.arity = "ternary"; return this; });

The dot operator ( . ) Is used to refer to a member of an object. To his right there must be a name, but it will be used as a literal.

 infix(".", 80, function (left) { this.first = left; if (token.arity !== "name") { token.error("Expected a property name."); } token.arity = "literal"; this.second = token; this.arity = "binary"; advance(); return this; });

The operator [ used to access a member of an object or an element of an array dynamically. The expression on the right should be followed by a closing bracket ] .

 infix("[", 80, function (left) { this.first = left; this.second = expression(0); this.arity = "binary"; advance("]"); return this; });

All these infix operators are left associative. We can also create right associative operators (for example, logical || and &&), reducing the right binding force.

 var infixr = function (id, bp, led) { var s = symbol(id, bp); s.led = led || function (left) { this.first = left; this.second = expression(bp - 1); this.arity = "binary"; return this; }; return s; }

The && operator returns the first operand if it is false, otherwise it returns the second. Operator || returns the first operand if it is true, otherwise returns the second. False value is 0 , the empty string is "" , false or null . Any other value (including any object) is considered true.

 infixr("&&", 30); infixr("||", 30);

Prefix Operators

The code that we used for right-associative infisk operators can be adapted for prefix operators. Prefix operators are right associative. The prefix has no left binding power, because it does not bind to anything on the left. Sometimes prefix operators are reserved words.

 var prefix = function (id, nud) { var s = symbol(id); s.nud = nud || function () { scope.reserve(this); this.first = expression(70); this.arity = "unary"; return this; }; return s; } prefix("-"); prefix("!"); prefix("typeof");

The nud method for the bracket ( calls advance(")") to find the matching bracket ) . The lexeme itself ( misses the syntax tree, because nud returns only the contents of the brackets.

 prefix("(", function () { var e = expression(0); advance(")"); return e; });

Assignment operators

To define assignment operators, we could use the infixr function. But better we will write a separate assignment function, because we want to do something else: make sure that the expression on the left is lvalue, that is, we can assign something to it, and set the field of assignment so that we can easily find all assignments in the future .

 var assignment = function (id) { return infixr(id, 10, function (left) { if (left.id !== "." && left.id !== "[" && left.arity !== "name") { left.error("Bad lvalue."); } this.first = left; this.second = expression(9); this.assignment = true; this.arity = "binary"; return this; }); }; assignment("="); assignment("+="); assignment("-=");

Notice: we implemented something like inheritance. The assignment function returns the result of the infixr call, and infixr result of the symbol call.

Constants

The constant function creates language constants. The nud method turns a name token into a literal token.

 var constant = function (s, v) { var x = symbol(s); x.nud = function () { scope.reserve(this); this.value = symbol_table[this.id].value; this.arity = "literal"; return this; }; x.value = v; return x; }; constant("true", true); constant("false", false); constant("null", null); constant("pi", 3.141592653589793);

A character (literal) is a prototype for all string and numeric literals. The nud method of a nud token returns the token itself.

 symbol("(literal)").nud = itself;

suggestions

In the original, the Pratt method was created for functional languages, where there are only expressions. Most popular languages use statements (statements), which are not so difficult to invest in each other as expressions. We can easily process and proposals, if we add a new method to the tokens: std (statement denotation, matching the proposal). The std method is similar to nud , but it is called only at the beginning of a sentence.

The statement function parses one statement . If the current token contains the std method, the token is reserved and this method is called. Otherwise, we believe that a sentence is an expression ending in a semicolon. For reliability, we will consider expressions that are not assignments or function calls for error.

 var statement = function () { var n = token, v; if (n.std) { advance(); scope.reserve(n); return n.std(); } v = expression(0); if (!v.assignment && v.id !== "(") { v.error("Bad expression statement."); } advance(";"); return v; };

The statements function statements sentences until it encounters a token (end) or } , which marks the end of a block. The function returns a sentence, an array of sentences, or null if no sentences are found.

 var statements = function () { var a = [], s; while (true) { if (token.id === "}" || token.id === "(end)") { break; } s = statement(); if (s) { a.push(s); } } return a.length === 0 ? null : a.length === 1 ? a[0] : a; };

The stmt function is used to add sentence symbols to the symbol table. It takes the id and the std function as parameters.

 var stmt = function (s, f) { var x = symbol(s); x.std = f; return x; };

A block clause is a list of sentences in curly braces for which a new scope is defined. In normal JavaScript, there are no scope for blocks, but they will be in our Simplified JavaScript.

 stmt("{", function () { new_scope(); var a = statements(); advance("}"); scope.pop(); return a; });

The block function parses a block.

 var block = function () { var t = token; advance("{"); return t.std(); };

The var clause defines one or more variables in the current block. The variable name can be followed by the equal sign = and the initial value of the variable.

 stmt("var", function () { var a = [], n, t; while (true) { n = token; if (n.arity !== "name") { n.error("Expected a new variable name."); } scope.define(n); advance(); if (token.id === "=") { t = token; advance("="); t.first = n; t.second = expression(0); t.arity = "binary"; a.push(t); } if (token.id !== ",") { break; } advance(","); } advance(";"); return a.length === 0 ? null : a.length === 1 ? a[0] : a; });

The while clause defines a loop. It includes an expression in brackets and a block.

 stmt("while", function () { advance("("); this.first = expression(0); advance(")"); this.second = block(); this.arity = "statement"; return this; });

The if clause creates a conditional construct. If the block is followed by an else symbol, then we also analyze the next block or the next if .

 stmt("if", function () { advance("("); this.first = expression(0); advance(")"); this.second = block(); if (token.id === "else") { scope.reserve(token); advance("else"); this.third = token.id === "if" ? statement() : block(); } else { this.third = null; } this.arity = "statement"; return this; });

The break clause is used to end the loop ahead of time

 stmt("break", function () { advance(";"); if (token.id !== "}") { token.error("Unreachable statement."); } this.arity = "statement"; return this; });

The return clause is used to exit a function. It may contain an optional expression (the return value of the function).

 stmt("return", function () { if (token.id !== ";") { this.first = expression(0); } advance(";"); if (token.id !== "}") { token.error("Unreachable statement."); } this.arity = "statement"; return this; });

Functions

Functions are executable values. A function may have an optional name (so that it can call itself recursively), a list of parameter names in brackets, and a body — a list of sentences in curly braces. The function has its own scope.

 prefix("function", function () { var a = []; new_scope(); if (token.arity === "name") { scope.define(token); this.name = token.value; advance(); } advance("("); if (token.id !== ")") { while (true) { if (token.arity !== "name") { token.error("Expected a parameter name."); } scope.define(token); a.push(token); advance(); if (token.id !== ",") { break; } advance(","); } } this.first = a; advance(")"); advance("{"); this.second = statements(); advance("}"); this.arity = "function"; scope.pop(); return this; });

Functions are performed using the operator ( . When calling, you can specify a number of arguments. We will check the left operand to cut off situations where the value on the left cannot be a function.

 infix("(", 80, function (left) { var a = []; if (left.id === "." || left.id === "[") { this.arity = "ternary"; this.first = left.first; this.second = left.second; this.third = a; } else { this.arity = "binary"; this.first = left; this.second = a; if ((left.arity !== "unary" || left.id !== "function") && left.arity !== "name" && left.id !== "(" && left.id !== "&&" && left.id !== "||" && left.id !== "?") { left.error("Expected a variable name."); } } if (token.id !== ")") { while (true) { a.push(expression(0)); if (token.id !== ",") { break; } advance(","); } } advance(")"); return this; });

The this symbol is a special variable. When calling a method, it contains a reference to the object.

 symbol("this").nud = function () { scope.reserve(this); this.arity = "this"; return this; };

Literals for objects and arrays

An array literal is a set of expressions in square brackets, separated by commas. Each expression is evaluated, and all results form a new array.

 prefix("[", function () { var a = []; if (token.id !== "]") { while (true) { a.push(expression(0)); if (token.id !== ",") { break; } advance(","); } } advance("]"); this.first = a; this.arity = "unary"; return this; });

The literal for an object is a set of pairs in curly braces, separated by commas. A pair consists of a key and an expression, which are separated by a colon ( : . The key is a literal or a name that is interpreted as a literal.

 prefix("{", function () { var a = []; if (token.id !== "}") { while (true) { var n = token; if (n.arity !== "name" && n.arity !== "literal") { token.error("Bad key."); } advance(); advance(":"); var v = expression(0); v.key = n.value; a.push(v); if (token.id !== ",") { break; } advance(","); } } advance("}"); this.first = a; this.arity = "unary"; return this; });

What to think and what to do

The created tree can be transferred to a code generator or interpreter. To create a tree requires a minimum of calculations. And, as we see, it takes not so much effort from a programmer to write such a parser.

We can add an operation code parameter to the infix function to help the code generator. We can also pass in additional methods for convolving constants and generating code.

We can add other proposals (for example, for, switchand try), the label, more code to check for errors, error recovery, and a bunch of new operators. We can add task and type inference.

We can make our language extensible. We can allow the programmer to declare new operators and sentences as easily as declare new variables.

Test yourself the parser described in this article.

You can find another example of using this method of parsing in the JSLint project .

From the translator: picked the source code for JSLint and decided that the Russian translation of this wonderful article would not hurt. The parser in JSLint is really exceptionally clear, powerful and easily extensible. Thank you very muchKVie for editing the translation.

This article was published as part of the book Beautiful Code (chapter nine). Russian translation of the entire book in electronic form can be purchased on the website of the publishing house "Peter" .

Source: https://habr.com/ru/post/227241/

All Articles