(How to write (Lisp) interpreter (in Python))

Translation of the article "(How to Write a (Lisp) Interpreter (in Python))" by Peter Norvig. In this article, Peter quite briefly, but succinctly, describes the principle of the interpreters and shows how you can write a very tiny (only 90 lines in Python) interpreter of a subset of the Scheme language. The translation is published with the permission of the author.

Peter Norvig (English Peter Norvig) - American scientist in the field of computing. Currently working as research director (previously - director for search quality) at Google. Previously, he was the head of the Computer Engineering Division at the Ames NASA Research Center, where he led a staff of two hundred NASA scientists in the fields of anatomy and robotics, automated software development and data analysis, neuroengineering, collective systems development, and decision-making based on simulation.

This article has two goals: to describe in general the implementation of programming language interpreters, as well as to demonstrate the implementation of a subset of the Scheme language , a Lisp dialect, using Python . I called my interpreter Lispy ( lis.py ). A year ago, I showed how to write the Scheme interpreter in Java , as well as in Common Lisp . This time, the goal is to demonstrate, briefly and as accessible as possible, what Alan Kay called " Maxwell's Equations of Software".
')

Lispy syntax and semantics

Most programming languages have different syntax conventions (keywords, infix operators, brackets, operator precedence, dot notation, semicolons, etc.), but like a member of the Lisp family of languages, Scheme’s entire syntax is based on lists in parentheses. prefix notation. This form may seem unfamiliar, but it has advantages in simplicity and consistency. Some people joke that “Lisp” means “Lots of Irritating Silly Parentheses” , I think it means “Lisp is syntactically clean” (Lisp Is Syntactically Pure). See:

Java		Scheme
`if ( x.val() > 0 ) {<br> z = f(a * x.val() + b); <br>}`		`(if (> (val x) 0)<br> (set! z (f (+ (* a (val x)) b))))`

Note that the exclamation mark is not a special character in Scheme, it is just part of the name " set! ". Only brackets are special characters. A list, such as (set! xy) , with a special keyword at the beginning is called a special form in Scheme; The beauty of the language is that we need only 6 special forms, plus 3 syntactic constructions - variables, constants and procedure calls:

The form	Syntax	Semantics and Example
variable reference	var	A symbol is interpreted as a variable name, its value is a variable value. Example: `x`
constant literal	number	The number means itself. Examples: `12` or `-3.45e+6`
quote	`(quote` exp `)`	Returns exp literally without evaluating it. Example: `(quote (abc)) ⇒ (abc)`
conditional	`(if` test conseq alt `)`	Calculate test ; if true, calculate and return conseq ; otherwise calculate and return alt . Example: `(if (< 10 20) (+ 1 1) (+ 3 3)) ⇒ 2`
assignment	`(set!` var exp `)`	Calculate exp and assign the value to the variable var , which should be previously defined (using `define` or as a procedure parameter). Example: `(set! x2 (* xx))`
definition	`(define` var exp `)`	Define a new variable in the nested environment itself and give it the value of the expression exp . Examples: `(define r 3)` or `(define square (lambda (x) (* xx)))` .
function	`(lambda (` var ... `)` exp `)`	Create a function with a parameter (s) with the names var ... and an expression in the body. Example: `(lambda ( r ) (* 3.141592653 (* rr)))`
sequencing	`(begin` exp ... `)`	Calculate each expression from left to right and return the last value. Example: `(begin (set! x 1) (set! x (+ x 1)) (* x 2)) ⇒ 4`
procedure call	`(` proc exp ... `)`	If proc is not `if, set!, define, lambda, begin,` or `quote` then it is considered as a procedure. It is calculated according to the same rules as defined here. All expressions are evaluated, and then the function is called with a list of expressions in the argument. Example: `(square 12) ⇒ 144`

In this table, var must be an identifier symbol, such as x or square ; number - must be an integer or floating point number, while the remaining italicized words can be any expression. The notation exp ... means zero or more reps of exp .

To learn more about the Scheme, refer to the following great books ( Friedman and Fellesein , Dybvig , Queinnec , Harvey and Wright or Sussman and Abelson ), video ( Abelson and Sussman ), introductions ( Dorai , PLT , or Neller ), or a reference guide .

What the language interpreter does

The interpreter consists of two parts:

Parsing: the parsing component receives a program as a sequence of characters as input, checks it in accordance with the syntax rules of the language and translates the program into an internal representation. In a simple interpreter, the internal representation looks like a tree structure that accurately reflects the structure of nested operators and expressions in the program. In a language translator called a compiler , the internal representation is a sequence of instructions that can be directly executed by a computer. As Steve Yegge said , “If you don’t know how compilers work, then you don’t know how computers work . ” Egge described 8 scenarios that can be solved using compilers (or, equivalently, interpreters). The Lispy parser is implemented using the parse function.
Execution: the internal representation is processed in accordance with the semantic rules of the language, thereby the calculations are performed. Execution is performed by the eval function (note, it hides the built-in Python function).

Here is a picture of the process of interpretation and interactive session, showing how parse and eval handle a short program:

>> program = "(begin (define r 3) (* 3.141592653 (* r r)))"

>>> parse ( program )
[ ' begin ', [ ' define ', 'r', 3 ] , [ ' * ', 3.141592653 , [ ' * ', 'r', 'r' ] ] ]

>>> eval ( parse ( program ) )
28.274333877

We use the simplest internal representation, where the lists, numbers, and symbols of Scheme are represented by lists, numbers, and Python strings, respectively.

Execution: `eval`

Each of the nine cases in the table above has one, two, or three lines of code. To define eval nothing else is needed:

def eval ( x, env = global_env ) :
"Evaluate an expression in an environment."
if isa ( x, Symbol ) : # variable reference
return env. find ( x ) [ x ]
elif not isa ( x, list ) : # constant literal
return x
elif x [ 0 ] == 'quote' : # (quote exp)
( _, exp ) = x
return exp
elif x [ 0 ] == 'if' : # (if test conseq alt)
( _, test , conseq, alt ) = x
return eval ( ( conseq if eval ( test , env ) else alt ) , env )
elif x [ 0 ] == 'set!' : # (set! var exp)
( _, var, exp ) = x
env. find ( var ) [ var ] = eval ( exp, env )
elif x [ 0 ] == 'define' : # (define var exp)
( _, var, exp ) = x
env [ var ] = eval ( exp, env )
elif x [ 0 ] == 'lambda' : # (lambda (var *) exp)
( _, vars , exp ) = x
return lambda * args: eval ( exp, env ( vars , args, env ) )
elif x [ 0 ] == 'begin' : # (begin exp *)
for exp in x [ 1 : ] :
val = eval ( exp, env )
return val
else : # (proc exp *)
exps = [ eval ( exp, env ) for exp in x ]
proc = exps. pop ( 0 )
return proc ( * exps )

isa = isinstance

Symbol = str

That's all that eval needs! .. well, except for the media. The environment is simply the mapping of characters to the values belonging to them. New symbol / value associations are added using define or ( lambda expression).

Let's take a look at an example of what happens when we declare and call a function in Scheme (the hint lis.py> means that we interact with the Lisp interpreter, not Python):

lis.py > ( define area ( lambda ( r ) ( * 3.141592653 ( * r r ) ) ) )
lis.py > ( area 3 )
28.274333877

When we execute (lambda ( r ) (* 3.141592653 (* rr))) , we fall into the elif x[0] == 'lambda' branch in eval , assign the three elements of the list x ( and signal an error if the length x not 3). We create a new function that, when called, will compute the expression ['*', 3.141592653 ['*', 'r', 'r']] in the environment formed by a bunch of formal parameters of the function (in this case, only one parameter r ) with arguments in the function call, as well as using the current environment for any variables that are not parameters (in the example, the variable * ). The value of this newly-minted function is then assigned as the value of the area in global_env .

Now, what happens when we calculate (area 3) ? Because area is not a special form of characters, so this is a function call (the last else: in eval ), and the entire list of expressions is executed at once. The area calculation returns the function we just created; calculating 3 returns 3 . Then (according to the last line of eval ) the newly created function with the list of arguments [3] called. This means exp , which is ['*', 3.141592653 ['*', 'r', 'r']] in an environment where r is 3 and the external environment is global and therefore * is a multiplication function.

Now we are ready to explain the details of the Env class:

class Env ( dict ) :
"An environment: a dict of {'var': val} pairs, with an outer Env."
def __init__ ( self , parms = ( ) , args = ( ) , outer = None ) :
self . update ( zip ( parms, args ) )
self . outer = outer
def find ( self , var ) :
"Find the innermost Env where the var appears."
return self if var in self else self self . outer . find ( var )

Please note that the Env class is a subclass of dict , which means that it is using regular dictionary operations. In addition, there are two methods, the constructor __init__ and find for finding the right environment for a variable. The key to understanding this class (as well as the reason that we do not use just dict ) is the concept of the external environment. Consider the following program:

Each rectangle represents a medium, and its color corresponds to the color of the variables that were defined in that medium. In the last two lines, we define a1 and call (a1 -20.00) ; These two lines represent the creation of a bank account with a balance of $ 100 and the subsequent withdrawal of $ 20. In the process of execution (a1 -20.00) we calculate the expression highlighted in yellow. 3 variables are involved in this expression. The amt variable can be found immediately in the most nested (green) environment. But balance not defined here, we will search in the external environment relatively green, ie in blue. Finally, the variable + not found in any of these environments; we need to take another step into the global (red) environment. This search process, from the internal environment to the external, is called lexical scoping. Procedure.find finds the right environment according to the lexical scope.

All that remains is to define the global environment. It is necessary to have + and all other functions built into Scheme. We will not worry about the implementation of all of them, we will get a bunch of necessary by importing the math module in Python and then explicitly adding 20 popular ones:

def add_globals ( env ) :
"Add some Scheme standard procedures to an environment."
import math , operator as op
env. update ( vars ( math ) ) # sin, sqrt, ...
env. update (
{ '+' : op. add , '-' : op. sub , '*' : op. mul , '/' : op. div , 'not' : op. not_ ,
'>' : op. gt , '<' : op. lt , '> =' : op. ge , '<=' : op. le , '=' : op. eq ,
'equal?' : op. eq , 'eq?' : op. is_ , 'length' : len , 'cons' : lambda x, y: [ x ] + y,
'car' : lambda x: x [ 0 ] , 'cdr' : lambda x: x [ 1 : ] , 'append' : op. add ,
'list' : lambda * x: list ( x ) , 'list?' : lambda x: isa ( x, list ) ,
'null?' : lambda x: x == [ ] , 'symbol?' : lambda x: isa ( x, Symbol ) } )
return env

global_env = add_globals ( Env ( ) )

Parsing: `read` and `parse`

Now about the parse function. Parsing is traditionally divided into two parts: lexical analysis , during which the input string of characters is divided into a sequence of tokens , and syntactic analysis , during which the tokens are assembled into an internal representation. Lispy tokens are brackets, characters (such as set! Or x ), and numbers. Here's how it works:

>>> program = "(set! x * 2 (* x 2))"

>>> tokenize ( program )
[ ' ( ', ' set! ', 'x * 2 ', ' ( ', ' * ', 'x', ' 2 ', ' ) ', ' ) ' ]

>>> parse ( program )
[ ' set! ',' x * 2 ', [ ' * ',' x ', 2 ] ]

There are a lot of tools for lexical analysis (for example, lex Mike Lesk and Eric Schmidt), but we will go the simple way: we use the Python str.split . We simply add spaces around each bracket, and then get a list of tokens by calling str.split .

Now about the syntax analysis. We saw that the Lisp syntax is very simple, but some Lisp interpreters have made the parsing work even easier, taking as a program any character string that is a list. In other words, the string (set! 1 2) will be accepted by the syntactically correct program, and only at run time the interpreter will swear that set! expects the first argument to be a character, not a number. In Java or Python, the equivalent assignment of 1 = 2 will be recognized as an error at compile time. On the other hand, Java and Python do not require detection at compile time of an error in the x/0 expression, therefore, as you can see, it is not always strictly defined when errors should be recognized. Lispy implements parse using read , a function that reads any expression (number, character, or nested list).

The read function works by passing read_from tokens obtained after lexical analysis. Having a list of tokens, we look at the first token; if this is a syntax error. If this is '(' , then we start building a list of expressions until we get to the corresponding ')' . Everything else must be characters or numbers, which are full expressions in their own right. The final trick is knowing that '2' represents an integer, '2.0' is a floating point number, and x represents a character. We allow Python to make these distinctions: for each “non-bracket” token, we first try to interpret it as an integer, and then as a floating point number, and finally as a symbol. Following these instructions will get:

def read ( s ) :
"Read a Scheme expression from a string."
return read_from ( tokenize ( s ) )

parse = read

def tokenize ( s ) :
"Convert a string into a list of tokens."
return s. replace ( '(' , '(' ) . replace ( ')' , ')' ) . split ( )

def read_from ( tokens ) :
"Read an expression of a sequence of tokens."
if len ( tokens ) == 0 :
raise SyntaxError ( 'unexpected EOF while reading' )
token = tokens. pop ( 0 )
if '(' == token :
L = [ ]
while tokens [ 0 ] ! = ')' :
L. append ( read_from ( tokens ) )
tokens pop ( 0 ) # pop off ')'
return L
elif ')' == token :
raise SyntaxError ( 'unexpected)' )
else :
return atom ( token )

def atom ( token ) :
"Numbers become numbers; every other token is a symbol."
try : return int ( token )
except ValueError :
try : return float ( token )
except ValueError :
return Symbol ( token )

Finally, we add the to_string function in order to convert the expression back to a readable Lisp string, as well as the repl function, which is needed for the read-eval-print loop in the form of an interactive interpreter:

def to_string ( exp ) :
"Convert a Python object back into a Lisp-readable string."
return '(' + '' . join ( map ( to_string, exp ) ) + ')' if isa ( exp, list ) else str ( exp )

def repl ( prompt = 'lis.py>' ) :
"A prompt-read-eval-print loop."
while true :
val = eval ( parse ( raw_input ( prompt ) ) )
if val is not None : print to_string ( val )

Here is our code at work:

>>> repl ( )
lis.py > ( define area ( lambda ( r ) ( * 3.141592653 ( * r r ) ) ) )
lis.py > ( area 3 )
28.274333877
lis.py > ( define fact ( lambda ( n ) ( if ( <= n 1 ) 1 ( * n ( fact ( - n 1 ) ) ) ) ) )
lis.py > ( fact 10 )
3628800
lis.py > ( fact 100 )
93326215443944152681699238856266700490715968264381621468592963895217599322991
56089414639761565182862536979208272237582511852109168640000000000000000000000
lis.py > ( area ( fact 10 ) )
4.1369087198e + 13
lis.py > ( define first car )
lis.py > ( define rest cdr )
lis.py > ( define count ( lambda ( item L ) ( if L ( + ( equal? item ( first L ) ) ( count item ( rest L ) ) ) 0 ) ) )
lis.py > ( count 0 ( list 0 1 2 3 0 0 ) )
3
lis.py > ( count ( quote the ) ( quote ( the more the bigger the better the better ) ) )
four

How small / fast / complete / good is Lispy?

We will judge Lispy by the following criteria:

Size: Lispy is tiny, 90 lines with no comments, less than 4K of source code. There are fewer lines compared to the first version, which had 96 lines - I accepted the offer of Eric Cooper to remove the definition of the Procedure class and use the Python lambda instead. The smallest Java version of my Scheme ( Jscheme ) consisted of 1664 lines and 57K source code. Jsceheme was originally called SILK (Scheme in Fifty Kilobytes, Scheme is 50 kilobytes), but I only met this limit in relation to the resulting bytecode. I think it satisfies Alan Kay’s 1972 requirement that you could define “the most powerful language in the world” in the “code page” .
bash $ grep "^ \ s * [^ # \ s]" lis.py | wc
90 398 3423
Speed: Lispy calculates (fact 100) in 0.004 seconds. For me, this is quite fast (although much slower than most other ways to calculate it).
Completion: Lispy does not fully comply with the Scheme standard. Here are the main disadvantages:
- Syntax: missing comments, quoting / quasi-citing, # literals, derived expression types (such as cond , derived from if , or let , derived from lambda ), and point list notation.
- Semantics: call / cc and tail recursion absent.
- Data types: no lines, single characters, boolean type, ports (ports), vectors, exact / inaccurate numbers. Lists in Python are actually closer to vectors in Scheme than pairs and lists.
- Functions: More than 100 primitive functions are missing, most of which are for missing data types, plus some others (such as set-car! And set-cdr! we cannot implement set-cdr! Completely using Python lists) .
- Error handling: Lispy does not attempt to detect errors, make an acceptable report of them, or recover. Lispy expects programmer excellence.
Good: it's up to the readers. I find it good for my purpose to explain Lisp interpreters.

True story

Returning to the idea that knowing how interpreters work can be very useful, I will tell a story. In 1984, I wrote my doctoral thesis. That was before LaTeX, before Microsoft Word - we used troff. Unfortunately, troff had no way to link to symbolic labels: I wanted to be able to write "As we will see on the @theoremx page", and then write something like "@ (set theoremx \ n%)" in the appropriate place ( troff reserved \ n% as page number). My friend Tony Deroz felt the same need, and together we sketched a simple Lisp program that worked as a preprocessor. However, it turned out that Lisp, which we had at that time, was good for reading Lisp expressions, but was so slow to read everything else that it irritated.

Tony and I went in different ways. He thought that the difficult part was the interpreter for expressions, he needed Lisp for this, but he knew how to write a small C subroutine to repeat non-Lisp characters and link them to a Lisp program. I did not know how to do this binding, but I thought that writing an interpreter for this trivial language (all that was in it - setting the value of a variable, getting the value of a variable, and concatenating strings) was a simple task, so I wrote this interpreter in C. Funny that Tony wrote a Lisp program because he was a C programmer, and I wrote a C program because I was a Lisp programmer.

In the end, we both achieved the result.

Together

The full Lispy code is available for download: lis.py

Source: https://habr.com/ru/post/115206/

All Articles