Manual: writing an interpreter with JIT to PyPy

All source codes and examples from this article are available here .

When I first watched the PyPy project, it took me a while to figure out what it was. It consists of two things:

- a set of tools for writing interpreters of programming languages;
- implementation of Python using this toolkit.
')
Probably most people think that PyPy is only the second part, but this guide is not about the Python interpreter. It is about how to write an interpreter of your language.

I took this tutorial in order to better understand how PyPy works and what it is like. It is assumed that you know very little about PyPy, so I will start from the beginning.

What is PyPy

Suppose we want to write an interpreter. This includes writing a source code parser, a bytecode loop, and a lot of standard library code.

Writing a parser and compiler is usually not fun at all, so there are tools that generate a parser and compiler for you.

And even then you have to take care of the memory management in the interpreter and you will have to implement data types such as integers of arbitrary dimension, hash tables and so on. This is enough for many people to decide not to implement their interpreter.

Wouldn't it be great if you could implement your language in a high-level language, for example, Python? You would get all the benefits of a high-level language, such as automatic memory management and a rich set of data types. But the interpretation of a language in another interpreted language must be very slow, right?

As you can guess, PyPy solves this problem. PyPy is a sophisticated toolkit for analyzing and translating your interpreter code into C (or JVM or CLI). This process is called “translation.” He knows how to translate not the entire syntax of Python, but quite a large part of it. All you have to do is write your interpreter in RPython , a subset of Python, after which you will get a very efficient interpreter.

Because writing effective interpreters should not be a problem.

Tongue

The language I chose to implement is terribly simple. The runtime of a language consists of a tape of integers initialized to zero, and one pointer to the current cell in this tape. The language has only 8 commands:

< - move the pointer to the previous cell.

> - move the pointer to the next cell.

+ - increase by one number in the current cell.

- - decrease by one number in the current cell.

[ - if the number in the current cell is 0, then skip all instructions to the corresponding instruction].

] - go back to the corresponding instruction [.

. - output to the standard output stream one byte from the current cell.

, - read one byte from the standard input stream and put in the current cell.

Any unrecognized characters should be ignored.

Some could learn this language, this is a brainfuck.

The language itself is already a bytecode, so it does not require a separate translation of the source code into bytecode. This means that the main execution loop of our interpreter will work directly with the source code. This simplifies its implementation a little.

The first steps

Let's start by writing an interpreter on a regular Python. Here is a sketch of the main run cycle.

def mainloop(program): tape = Tape() pc = 0 while pc < len(program): code = program[pc] if code == ">": tape.advance() elif code == "<": tape.devance() elif code == "+": tape.inc() elif code == "-": tape.dec() elif code == ".": sys.stdout.write(chr(tape.get())) elif code == ",": tape.set(ord(sys.stdin.read(1))) elif code == "[" and value() == 0: # Skip forward to the matching ] elif code == "]" and value() != 0: # Skip back to the matching [ pc += 1

As you can see, the command counter (pc) stores a pointer to the current instruction. The first expression in the loop retrieves the instruction, then several conditional statements determine how to execute it.

The implementation of the operators “[” and “]” is omitted; they must change the command counter to the position of the matching bracket.

And now the implementation of the class Tape, which stores a tape of integers and a pointer to the current number.

 class Tape(object): def __init__(self): self.thetape = [0] self.position = 0 def get(self): return self.thetape[self.position] def set(self, val): self.thetape[self.position] = val def inc(self): self.thetape[self.position] += 1 def dec(self): self.thetape[self.position] -= 1 def advance(self): self.position += 1 if len(self.thetape) <= self.position: self.thetape.append(0) def devance(self): self.position -= 1

As you can see, the tape increases if necessary. In fact, it would be worthwhile to add error checking when the pointer becomes negative. But so far it does not matter.

If the program has many comments, they will be read by one byte, so it is better to parse the source code in advance. At the same time we will make a dictionary for brackets, so that you can find paired brackets in it.

 def parse(program): parsed = [] bracket_map = {} leftstack = [] pc = 0 for char in program: if char in ('[', ']', '<', '>', '+', '-', ',', '.'): parsed.append(char) if char == '[': leftstack.append(pc) elif char == ']': left = leftstack.pop() right = pc bracket_map[left] = right bracket_map[right] = left pc += 1 return "".join(parsed), bracket_map

This function returns a string only from language commands and a dictionary of paired brackets.

It remains to connect this, and we get a working brainfuck interpreter.

 def run(input): program, map = parse(input.read()) mainloop(program, map) if __name__ == "__main__": import sys run(open(sys.argv[1], 'r'))

The full interpreter code, including the implementation of square brackets, can be viewed in the first example. example1.py

Now you can try running the interpreter on Python to make sure that it works.

$ python example1.py 99bottles.b

PyPy Broadcast

But our goal was not only to write a brainfuck interpreter. What do you need to do so that PyPy creates a super-fast executable file from this code?

In the PyPy source, the pypy / translator / goal folder contains simple examples that will come in handy. First, let's take a look at targetnopstandalone.py - a simple hello world for PyPy.

It is important that the module contains the target function, which returns the entry point. Broadcast starts from this point.

 def run(fp): program_contents = "" while True: read = os.read(fp, 4096) if len(read) == 0: break program_contents += read os.close(fp) program, bm = parse(program_contents) mainloop(program, bm) def entry_point(argv): try: filename = argv[1] except IndexError: print "You must supply a filename" return 1 run(os.open(filename, os.O_RDONLY, 0777)) return 0 def target(*args): return entry_point, None if __name__ == "__main__": entry_point(sys.argv)

The entry_point function will take command line arguments when you run the resulting executable file.

RPython

Let's talk about RPython. PyPy cannot translate regular Python code, because Python is slightly too dynamic. There are some restrictions that apply to the standard library and Python syntax so that PyPy can translate it. I will not list them all, it is better to see them in the documentation .

In the above example, several things had to be changed. Now you have to use low-level descriptors with the os.open and os.read functions instead of using file objects. The implementation of "." And "," is also slightly modified. These are all the changes necessary for PyPy to digest the code.

It wasn't too hard, was it? I continue to use dictionaries, expandable lists, and even classes with objects. And if the file descriptors are too low-level for you, there are several useful abstractions in the rlib.streamio module that comes with the standard RPython library.

The full code now looks like this: example2.py

Broadcast

If you have not done so already, merge the latest version of PyPy from the repository to bitbucket.org.

$ hg clone bitbucket.org/pypy/pypy

The script to run is pypy / translator / goal / translate.py. As a parameter, it takes a module that needs to be translated.

$ python ./pypy/pypy/translator/goal/translate.py example2.py

For faster translation, you can use PyPy instead of Python.

The result of the execution will be an executable file - the brainfuck interpreter. The repository is a generator of fractals on the brainfuck, which takes about 45 seconds to complete on my machine. Try it yourself.

$ ./example2-c mandel.b

And compare the speed with that which gives the same interpreter running on Python.

$ python example2.py mandel.b

So, we wrote an interpreter in RPython and translated it using the PyPy toolkit.

Add JIT

RPython translation in C is certainly cool, but one of the main features of PyPy is the ability to generate a runtime compiler (JIT). Using just a few tips on how your interpreter works, PyPy will generate a JIT compiler that will translate the interpreted brainfuck code into machine code.

For this to happen, PyPy needs to know where the program execution cycle begins. This allows you to track which instructions are executed on the brainfuck.

We must also indicate the features of the run cycle. Since there is no stack in our language, we only need to specify which variables relate to the program code, and which to its data. This is called green and red variables appropriately.

Let's return to the second example.

Our main run loop uses four variables: pc, program, bracket_map and tape. Of course, pc, program and bracket_map are green variables, they determine the execution of the interpreted program. The variable tape is red, it changes when the interpreted program is executed.

Let's tell PyPy this data. Begin by importing the JitDriver class and creating its instance.

 from pypy.rlib.jit import JitDriver jitdriver = JitDriver(greens=['pc', 'program', 'bracket_map'], reds=['tape'])

And add this line to the very beginning of the execution loop:

 fjitdriver.jit_merge_point(pc=pc, tape=tape, program=program, bracket_map=bracket_map)

We also need to define JitPolicy.

 def jitpolicy(driver): from pypy.jit.codewriter.policy import JitPolicy return JitPolicy()

The full text of the example: example3.py

Now let's translate the code again, but with the flag --opt = jit:

$ python ./pypy/pypy/translator/goal/translate.py --opt = jit example3.py

The translation will go much longer, almost 8 minutes on my machine, and the resulting executable file will be much longer. After the end of the broadcast run the program to generate fractals again. The difference is huge - about 12 seconds versus 45 in the previous version!

As you can see, the JIT compiler really used machine code instead of interpretation. The first few lines of the picture are displayed quickly enough, then the program is accelerated and the rest is displayed even faster.

Few tracing JIT compilers

It is worth talking about how tracing JIT compilers work at all. Your interpretive code runs as usual. When the JIT encounters a frequently executed loop in an interpreted language (brainfuck), the loop is marked for tracing. The next time the same cycle is reached, logging of each executed instruction is enabled.

The resulting log is sent to the optimizer, the result of which is converted into machine code. This code is used for subsequent loop executions.

The resulting machine code is correct only under certain conditions under which it was received. Therefore, before using it, these conditions are checked. If the check fails, the interpreter is restarted instead of the machine code.

More information can be found on Wikipedia .

Debugging and trace logs

Can we see what JIT does?

Let's add the function get_printable_location, which will be used to display debug information.

 def get_location(pc, program, bracket_map): return "%s_%s_%s" % ( program[:pc], program[pc], program[pc+1:] ) jitdriver = JitDriver(greens=['pc', 'program', 'bracket_map'], reds=['tape'], get_printable_location=get_location)

This function accepts green variables and returns a due date. We output the brainfuck code in which the current instruction being executed is surrounded by underscores.

Translate example code example4.py .

Now run the test program (test.b just prints the letter “A” about 15 times) with the output of trace logs.

$ PYPYLOG = jit-log-opt: logfile ./example4-c test.b

The logfile file contains logs of all produced traces and allows you to look at what instructions were compiled into machine code. The file is useful in that it allows you to see unnecessary instructions or ways to optimize.

Each trace starts with a line like this:
[3c091099e7a4a7] {jit-log-opt-loop

And ends with this line:
[3c091099eae17d jit-log-opt-loop}

Immediately after the trace header is a comment with a sequence number and the number of operations. In my case, the first trace looks like this.

  1: [3c167c92b9118f] {jit-log-opt-loop 2: # Loop 0 : loop with 26 ops 3: [p0, p1, i2, i3] 4: debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0) 5: debug_merge_point('+<[>[>_+_<-]>.[<+>-]<<-]++++++++++.', 0) 6: i4 = getarrayitem_gc(p1, i2, descr=<SignedArrayDescr>) 7: i6 = int_add(i4, 1) 8: setarrayitem_gc(p1, i2, i6, descr=<SignedArrayDescr>) 9: debug_merge_point('+<[>[>+_<_-]>.[<+>-]<<-]++++++++++.', 0) 10: debug_merge_point('+<[>[>+<_-_]>.[<+>-]<<-]++++++++++.', 0) 11: i7 = getarrayitem_gc(p1, i3, descr=<SignedArrayDescr>) 12: i9 = int_sub(i7, 1) 13: setarrayitem_gc(p1, i3, i9, descr=<SignedArrayDescr>) 14: debug_merge_point('+<[>[>+<-_]_>.[<+>-]<<-]++++++++++.', 0) 15: i10 = int_is_true(i9) 16: guard_true(i10, descr=<Guard2>) [p0] 17: i14 = call(ConstClass(ll_dict_lookup__dicttablePtr_Signed_Signed), ConstPtr(ptr12), 90, 90, descr=<SignedCallDescr>) 18: guard_no_exception(, descr=<Guard3>) [i14, p0] 19: i16 = int_and(i14, -9223372036854775808) 20: i17 = int_is_true(i16) 21: guard_false(i17, descr=<Guard4>) [i14, p0] 22: i19 = call(ConstClass(ll_get_value__dicttablePtr_Signed), ConstPtr(ptr12), i14, descr=<SignedCallDescr>) 23: guard_no_exception(, descr=<Guard5>) [i19, p0] 24: i21 = int_add(i19, 1) 25: i23 = int_lt(i21, 114) 26: guard_true(i23, descr=<Guard6>) [i21, p0] 27: guard_value(i21, 86, descr=<Guard7>) [i21, p0] 28: debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0) 29: jump(p0, p1, i2, i3, descr=<Loop0>) 30: [3c167c92bc6a15] jit-log-opt-loop}

I trimmed the debug_merge_point lines too long.

This code segment takes four parameters: two pointers to objects (p0 and p1) and two numbers (i2 and i3).

The first operator “>” begins on the 4th line. It runs without instructions and looks finally optimized. This loop always works with one part of the tape, and the pointer to the current cell remains constant.

Fifth to eighth lines - operator “+”. First, the element of the array with index i2 is extracted from the pointer p1 (line 6), one is added and stored in i6 (line 7). The result is placed back into the array (line 8).

Line 9 corresponds to the instruction "<", but it also does not require operations. Apparently - i2 and i3 - these are two pointers to the tape cells, which are calculated in advance. You can also see that p1 is a ribbon of commands. It is not yet clear what p0 is.

Lines 10 through 13 perform the “-” operator: extract the element of the array, subtract and put it back.

In the 14th line we come to the operator "]". Lines 15 and 16 check if i9 is true (i.e. is non-zero). i9 is the value that we just reduced by one and put it on tape. Line 16 - check. If the condition is not met, the function with the name <Guard2> is executed, to which one parameter is passed, p0.

If the check is passed, lines 17 to 23 take out from the dictionary of the bracket_map the address of the instruction to which you want to go. I'm not sure what exactly these lines do, but it is clear that they contain two external calls and 3 checks. This is too wasteful, given that the bracket_map does not change and the result will be the same address to which to go. But PyPy doesn't know about this, but we know, so we can optimize this place.

Line 24 increments the pointer received from bracket_map. Lines 25 and 26 verify that it did not exceed the length of the program.

In addition, line 27 carries out an additional check that the pointer is strictly equal to 86. This is necessary in order to make sure that the jump must be made at the beginning of the cycle.

At the end, the cycle closes on line 28, and on line 29 there is a jump to the beginning of the cycle with parameters p0, p1, i2, i3.

Optimization

As noted, at each iteration of the loop, a search is performed in the dictionary to find a matching pair. This is terribly inefficient because the goal of the transition does not change at different iterations.

We need to give another hint to the translator to say that a query to the dictionary will always return the same elements for the same dictionary indices.

To do this, we will render the call to the dictionary into a separate function and wrap it with pypy.rlib.jit.purefunction.

 @purefunction def get_matching_bracket(bracket_map, pc): return bracket_map[pc]

This version can be found in example5.py .

Broadcast this example. Mandelbrot now runs in 6 seconds instead of 12!

Let's take a look at the new trace log.

  1: [3c29fad7b792b0] {jit-log-opt-loop 2: # Loop 0 : loop with 15 ops 3: [p0, p1, i2, i3] 4: debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0) 5: debug_merge_point('+<[>[>_+_<-]>.[<+>-]<<-]++++++++++.', 0) 6: i4 = getarrayitem_gc(p1, i2, descr=<SignedArrayDescr>) 7: i6 = int_add(i4, 1) 8: setarrayitem_gc(p1, i2, i6, descr=<SignedArrayDescr>) 9: debug_merge_point('+<[>[>+_<_-]>.[<+>-]<<-]++++++++++.', 0) 10: debug_merge_point('+<[>[>+<_-_]>.[<+>-]<<-]++++++++++.', 0) 11: i7 = getarrayitem_gc(p1, i3, descr=<SignedArrayDescr>) 12: i9 = int_sub(i7, 1) 13: setarrayitem_gc(p1, i3, i9, descr=<SignedArrayDescr>) 14: debug_merge_point('+<[>[>+<-_]_>.[<+>-]<<-]++++++++++.', 0) 15: i10 = int_is_true(i9) 16: guard_true(i10, descr=<Guard2>) [p0] 17: debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0) 18: jump(p0, p1, i2, i3, descr=<Loop0>) 19: [3c29fad7ba32ec] jit-log-opt-loop}

Much better! Now, each cycle is just one addition, subtraction, two loads from the array, two rooms into the array, and a check on exit. The code does not require any changes to the command counter.

This optimization was suggested to me by Armin Rigo in the pypy-dev mailing list. Karl Friedrich has several articles on interpreter optimization that have also proven useful.

Conclusion

I hope this article has convinced you that PyPy is not only a fast Python interpreter.

For those of you who want to learn more about the PyPy JIT compiler, I recommend reading the Tracing the Meta-Level article : PyPy's Tracing JIT Compiler .

Source: https://habr.com/ru/post/124418/

All Articles