Python interpreter: what does a snake think? (part I-III)

From translator

A very free translation of a series of three articles about the device of the Python interpreter. The author is developing his own bike on this topic and decided to share the knowledge that emerged in the process. Let's see what came out of it.

This series of articles is intended for those who can write in python as a whole, but poorly understands how this language is built from the inside. Actually, like me three months ago.
')
A small disclaimer: I'll tell my story using the python 2.7 interpreter as an example. Everything that will be discussed further can be repeated in python 3.x, taking into account some differences in the syntax and naming of some functions.

So, let's begin.

Part I. Listen to Python, and what's inside you?

Let's start with a little (in fact, with a strong) high-level view of what our beloved snake is. What happens when you type a string like this in the interactive interpreter?

>>> a = "hello"

Your finger falls on enter and the python initiates the following 4 processes: lexical analysis , parsing , compilation, and direct interpretation . Lexical analysis is the process of parsing the line of code you have typed into a specific sequence of characters called tokens. Next, the parser based on these tokens generates a structure that displays the relationship between its constituent elements (in this case, the structure is an abstract syntax tree or ASD). Next, using the ASD, the compiler creates one or more object modules and passes them to the interpreter for direct execution.

I will not delve into the topics of lexical analysis, parsing and compilation, mainly because I myself have no idea about them. Instead, let's better imagine that smart people did everything right and these stages in the Python interpreter work without errors. Submitted? Moving on.

Before going to the object modules (or code objects, or object files), something should be clarified. In this series of articles, we will talk about function objects, object modules, and bytecode — all of these are completely different, though in some way related concepts. Although we don’t need to know what function objects are to understand an interpreter, I still would like to stop your attention on them. Not to mention the fact that they are simply cool.

So,

Function objects or functions as objects

If this is not your first Python programming article, you should hear about some “function objects”. It is about them that people with a smart look argue in the context of talking about "functions as first-class objects " and "the presence of first-class functions in python." Consider the following example:

 >>> def foo(a): ... x = 3 ... return x + a ... >>> foo <function foo at 0x107ef7aa0>

The expression “functions are first-class objects” means that functions are ~~first-class~~ objects, in the sense in which lists are objects, and an instance of the class MyObject is an object. And since foo is an object, it has significance in itself, regardless of calling it as a function (that is, foo and foo() are two different things). We can pass foo to another function as an argument, we can reassign it to a new name ( other_function = foo ). With the functions of the first class, you can do anything and they will endure.

Part II. Object modules

At this stage, we need to go deeper to find out that the function object in turn contains the code object:

 >>> def foo(a): ... x = 3 ... return x + a ... >>> foo <function foo at 0x107ef7aa0> >>> foo.func_code <code object foo at 0x107eeccb0, file "<stdin>", line 1>

As can be seen from the above listing, the object module is an attribute of a function object (which has many other attributes, but in this case they are not of particular interest due to the simplicity of foo ).

The object module is generated by the Python compiler and then passed to the interpreter. The module contains all the necessary information for execution. Let's look at its attributes:

 >>> dir(foo.func_code) ['__class__', '__cmp__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'co_argcount', 'co_cellvars', 'co_code', 'co_consts', 'co_filename', 'co_firstlineno', 'co_flags', 'co_freevars', 'co_lnotab', 'co_name', 'co_names', 'co_nlocals', 'co_stacksize', 'co_varnames']

As you can see, there are a lot of them, so we will not consider everything, for example, let’s dwell on the three most understandable ones:

 >>> foo.func_code.co_varnames ('a', 'x') >>> foo.func_code.co_consts (None, 3) >>> foo.func_code.co_argcount 1

Attributes look pretty intuitive:
co_varnames - variable names
co_consts - values that the function knows about
co_argcount - the number of arguments that the function takes

All this is very informative, but it looks a little too high for our topic, isn't it? Where are the instructions for the interpreter to directly execute our module? And such instructions are and they are presented by bytecode. The latter is also an attribute of the object module:

 >>> foo.func_code.co_code 'd\x01\x00}\x01\x00|\x01\x00|\x00\x00\x17S'

What kind of unknown byte garbage, you ask?

Part III. Baytkod

You probably yourself understand, but I, just in case, will sound - “bytecode” and “code object” are different things: the first is an attribute of the second, among many others (see part 2). The attribute is called co_code and contains all the necessary instructions for execution by the interpreter.

What is this baytkod? As the name suggests, this is simply a sequence of bytes. When outputting to the console, it looks quite delusional, so let's bring it to a numerical sequence, passing through ord :

 >>> [ord(b) for b in foo.func_code.co_code] [100, 1, 0, 125, 1, 0, 124, 1, 0, 124, 0, 0, 23, 83]

Thus, we obtained a numerical representation of Python bytecode. The interpreter will traverse each byte in the sequence and execute the instructions associated with it. Note that bytecode itself does not contain Python objects, references to objects, etc.

You can try to understand bytecode by opening the CPython interpreter file (ceval.c), but we will not do this. More precisely, we will, but later. Now let's go in a simple way and use the dis module from the standard library.

Disassemble it

Disassembling is the translation of a byte sequence into something more understandable to the human mind. For this purpose, there is a dis module in python that will show you in detail everything that is hidden. The module has no special application in production code, the results of its work are needed only by you, not the interpreter.

So let's apply dis and remove the paranja from our object module. To do this, use the function dis.dis :

 >>> def foo(a): ... x = 3 ... return x + a ... >>> import dis >>> dis.dis(foo.func_code) 2 0 LOAD_CONST 1 (3) 3 STORE_FAST 1 (x) 3 6 LOAD_FAST 1 (x) 9 LOAD_FAST 0 (a) 12 BINARY_ADD 13 RETURN_VALUE

Hidden text

Often you can see records like dis.dis(foo) , i.e. the function object is passed to the disassembler directly. This is done for convenience, under the hood dis still finds and analyzes func_code . In our example, we pass the code object explicitly for a better understanding of the process.

The numbers in the first column are the line numbers of the source code being analyzed. The second column reflects the offset of commands in bytecode: LOAD_CONST is in position “0”, STORE_FAST in position “3”, etc. The third column gives byte instructions intelligible names. These names are needed only to the ~~wretched~~ little ~~people of~~ us, they are not used in the interpreter.

The last two columns contain details about the arguments for this command, if any. The fourth column reflects the position of the argument in the attribute of the object module. In our example, the argument of the LOAD_CONST instruction is in the first position of the list attribute co_consts, the argument STORE_FAST is in the first position in co_varnames . Finally, in the fifth column, dis reflects the value or name of the corresponding variable. Make sure to say in practice:

 >>> foo.func_code.co_consts[1] 3 >>> foo.func_code.co_varnames[1] 'x'

It also explains why STORE_FAST is in third position in bytecode: if there is an argument somewhere in bytecode, the next two bytes will represent this argument. Correct handling of such situations also falls on the shoulders of the interpreter.

Hint

If you are suddenly surprised by the lack of arguments in BINARY_ADD - take the cookie for attentiveness, but do not worry ahead of time. We will return to this point a little later, when the conversation goes about the interpreter itself.

How does dis translate bytes (for example, 100) into meaningful names (for example, LOAD_CONST ) and vice versa? Think about how you yourself would organize a similar system? If you have thoughts, like "well, maybe there is some list with sequential definition of bytes" or "in any dictionary with the names of instructions as keys and bytes as values", congratulations - you are absolutely right. That's the way it works. The definitions themselves occur in the opcode.py file (you can also see the header file opcode.h), where you can see ~ one hundred and fifty such lines:

 def_op('LOAD_CONST', 100) # Index in const list def_op('BUILD_TUPLE', 102) # Number of tuple items def_op('BUILD_LIST', 103) # Number of list items def_op('BUILD_SET', 104) # Number of set items

(Some comment buff left us with explanations for instructions.)

Now we have some idea of what is (and what is not) bytecode and how to use dis to analyze it. In the following parts, we will look at how python can be compiled into bytecode while remaining dynamic PL.

Source: https://habr.com/ru/post/206420/

All Articles