We write the three-address code interpreter

Introduction

Good day.
I continue to write about near-compiler topics. This time I’ll touch on the issue of designing and creating an interpreter that works with syntax trees.
I recommend to get acquainted with the previous article - “We write LR (0) -analyzer. Simple words about the complicated ” , because in the interpreter I do not build the parser from scratch, but use the developments described in that article. Oh yeah, one more important moment - we will write in JavaScript. I am not a fan of this language, but I believe that this is the most convenient way for the public to see the result. Not everyone dares to download the unknown, and it’s still more difficult than just opening a page. Atypicality of the instrument is compensated by the “learning” of the example The speed of work is not important (100-150 lines is the limit, it seems to me that no one else wants to type in order to play with the interpreter), and the JS code is reasonably clear.

')

Three Address Code

First you need to decide what we interpret. I chose the three-address code.
What is it? It was assumed that this would be an intermediate language between the high-level code and the byte-code being compiled. For example, Sishnaya line:

a = (b + c) * (d - e)

Turn into this:

 f = b + c g = d - e a = f * g

The main postulate of this language is that each command contains no more than three operands. This makes it very easy to compile the intermediate code into bytecode.
What else except arithmetic operations will contain language?
First of all, these are declarations of variables and arrays. All variables are integers. By the way, the array index is also considered an operand, so we cannot use arrays in binary operations in order not to exceed the limit of 3 operands. Variables can be pointers (our pointer type is identically equal to integer).
It is logical to add the ability to input and output user data to the language. Also need control over the execution of the code (control flow) - conditional and unconditional transitions. Well, how to do without the stack, although it is not needed because we have an unlimited number of variables, but still, it is often quite convenient. Having thought out the language, you need to design a grammar. This is also not very difficult to do for our synthetic language.

 //   <line> = <vars_decl>|<command> //    <var_decl> = <ident>|<ident><left_sq_bracket><number><right_sq_bracket> //   <var_list> = <var_decl>|<var_list><comma><whitespace><var_decl> //    <vars_decl> = <var_keyword><whitespace><var_list> //  L: operation <command> = <ident><colon><whitespace><operation>|<operation> //  <operation> = <io>|<binary>|<control>|<assign>|<stack> // / in var, out var <io> = <in_keyword><whitespace><var_output_array>|<out_keyword><whitespace><var_input_array> //     - [], ,   <var_output> = <var_ptr>|<ident> <var_output_array> = <var_output>|<var_ar> //  <var_ar> = <ident><left_sq_bracket><var_numb><right_sq_bracket> <var_numb> = <ident>|<number> //  <var_ptr> = <asterisk><ident> //  <var_ref> = <ref><ident> //     - [], ,  ,   ,  <var_input_array> = <var_input>|<var_ar> <var_input> = <var_output>|<var_ref>|<number> //   ,  3  <binary> = <var_output><assign_sign><var_input><operator><var_input> <operator> = <plus>|<minus>|<asterisk>|<div>|<mod> //  <control> = <goto>|<if> //  <goto> = <goto_keyword><whitespace><ident> //  <if> = <if_keyword><whitespace><var_input><comp_operator><var_input><whitespace><goto> //  <assign> = <var_output_array><assign_sign><var_input>|<var_output><assign_sign><var_input_array> //    <stack> = <pop_keyword><whitespace><var_output_array>|<push_keyword><whitespace><var_input_array>

All other characters (which are in the right parts, but not in the left) are terminals, I will describe them a little lower.

Building a parse tree

As I wrote earlier, the stream of tokens comes to the input of the parser, this stream is produced by the lexical analyzer based on the input string.
For example for the string

 if a<b goto L

we get the next stream of tokens - LexKeywordIf , LexWhitespace , LexIdent , LexComp , LexIdent , LexWhitespace , LexKeywordGoto , LexWhitespace , LexIdent .
It is easy to trace the necessary lexemes from the description of non-terminals:

 <ident> = <letter>|<ident><letter>|<ident><digit> <number> = <digit>|<number><digit> <plus> = '+' <minus> = '-' <div> = '/' <mod> = '%' <comp_operator> = '<'|'>'|'>='|'<='|'=='|'!=' <whitespace> = ' ' <pop_keyword> = "pop" <push_keyword> = "push" <goto_keyword> = "goto" <in_keyword> = "in" <out_keyword> = "out" <var_keyword> = "var" <if_keyword> = "if" <left_sq_bracket> = '[' <right_sq_bracket> = ']' <colon> = ':' <asterisk> = '*' <assign_sign> = '=' <ref> = '&' <comma> = ',' $ = end of line

The code itself is extremely simple, I will give only in general form:

 function GetLex(stream) { var c = stream.get(); switch ( c ) { case ' ': return {type: LexTypes.LexWhitespace}; ... } if (c <= '9' && c >= '0') return {type: LexTypes.LexNumber, number: readNumber()}; if (c > 'z' || c < 'a') return {type: LexTypes.LexError}; var ident = readIdent(); if (ident == 'pop') return {type: LexTypes.LexKeywordPop}; ... return {type: LexTypes.LexIdent, ident: ident}; }

Now we have a stream of tokens.
We can pass it to the parser. Of course, I chose my LR (0) with pre-generated tables for the interpreter (I also wrote how to get them). It turned out 82 states and 1400 lines of formatted code under the table. The parser code itself has hardly changed with the same in C ++, it’s worth mentioning the construction of the tree itself.
For each ActionShift action, we create a leaf of the tree (a node that has no children) containing our token. And when we fold the rule, we associate the corresponding symbols of the right part into one bundle, which forms a new node, with the indication of the non-terminal of this node (which is the node within our grammar). It sounds a little difficult, but in fact everything is simple, an example of a tree for a binary operation:

Interpreter structure and launch preparation

The interpreter includes a tree for each line of the source code, the number of the current line (needed for error output, and as an IP - an intruction pointer, the address of the executable code), an array of tags, an array of variables, the stack itself for stack operations and a couple of objects helpers.

Before start it is necessary to reset all objects. Then we parse each line. Why do this? In order to create an array of tags, as in the team we can meet the transition to a label that is located further in the code than the line being executed. Therefore, we perform code analysis before the start of execution. Here we should also take into account such a moment that even if we get a syntactically incorrect string, this does not mean that we should immediately report about it.
When interpreting, we may not reach this site.

Code execution

I introduced two execution modes - step-by-step and simple. From the names it is obvious what each of them represents.
Code:

 function executeLine() { if (input_data.active) { setTimeout(executeLine, 100); return ; } if (input_data.error) return ; var line = lines[currentline]; var ret; if (!line.correct) { addlog('syntax error at position ' + line.pos); return ; } var general = line.tree.childs[0]; if (general.nonterm == NonTerminals.vars_decl) ret = interp_var_list(general.childs[0]); else ret = interp_operation(general.childs[0]); if (!ret) return ; currentline++; if (currentline == linecount) { addlog('Finish!', true); return ; } if (!step_by_step) setTimeout(executeLine, 0); }

input_data is the first helper object. In JavaScript, our own event model, and our events are not possible to use, so when we display the value input dialog (in command) we set the “freeze” flag - input_data.active, meaning that we need to wait for the dialogue to end, and only then proceed to execute the code . The second object, step_by_step, indicates that the step mode is used, if not, then we plan the function call again.
The execution itself is primitive - for almost every non-terminal we have a function that processes it. And we go down the tree collect non-terminals, getting their values.
Nonterminal processing example:

 function interp_control_goto(node) { var ident = node.childs[0].lex.ident; if (ident in labels) { currentline = labels[ident] - 1; return true; } if (!check_var(ident)) return false; var val = vars[ident]; if (val >= linecount) { addlog('invalid address of code'); return false; } currentline = val.value - 1; return true; }

You can see all this in the work here , a mirror .
It works best in FireFox, in Opera (keypress / keydown) some of the editor’s tweaks do not work, in Chrome and IE (selectionStart) it is similar. But the interpreter itself should work properly.
As an example, the recursive function of calculating factorial (overloaded with code, of course, but this is specifically to show the possibilities of the language).

Source: https://habr.com/ru/post/117173/

All Articles