Immersion into the abyss of the Python interpreter. P1

From the translator: Probably everyone is interested in what is inside the instrument that you use, this interest has taken possession of me, but the main thing is not to drown in it and not to dig in so that it does not crawl out. Having found for myself an interesting material , I decided to carefully translate it and present it to the habrosocommunity (my first publication, please do not kick with my legs). For those who are interested in how Python actually works, please follow under cat.

For the last three months I have spent a lot of time on byterun , an interpreter of Python bytecode written in python. Working on this project was exciting and fun for me. I would be glad if you too would poke him. But first we need to settle down a bit, understand how python works, so that we know what an interpreter really is and what it is eaten with.

I mean that you are now in the same position as I was three months ago. You understand python, but have no idea how it works.
')
A quick note: I am working with version 2.7 in this post. The third version is almost similar to the second, there are small differences in syntax and names, but in general, all the same.

How does python work?

We will start with a very (very very) high level of inner work. What happens when you run the code in your interpreter?

~ $ python Python 2.7.2 (default, Jun 20 2012, 16:23:33) [GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> a = "hello"

The years go by, the ice-boats melt, Linus Torvalds saws the next core, and the 64-bit processor works tirelessly, meanwhile there are four steps: lexical analysis, parsing, compilation, and finally interpretation. The parser picks up instructions fed to it and generates a structure that explains their relationship forming an AST (Abstract Syntax Tree). The compiler then converts the AST to one (or several) code objects (bytecode + binding). Then the interpreter executes each object.

I am not going to talk about lexical analysis, parsing or compilation today, probably because I myself don’t know anything about these things, but don’t be discouraged: you can always learn this by spending fifty hours or so. We assume that these steps went well and successfully, and we have python code objects in our hands.

Before I get down to business, I want to make a small remark: in this topic we will talk about function objects, code objects, and bytecode. These are all different things. Let's start with the functions. We don’t have to go deeply into them to get to the interpreter, but I just want to clarify that function objects and code objects are two big differences, and function objects are the most interesting.

Function Objects

You could probably hear about the "function objects". These are things that people mean when they say: "Functions are first class objects." Let's explore them in more detail:

 >>> def foo(a): ... x = 3 ... return x + a ... >>> foo <function foo at 0x107ef7aa0>

“Functions are objects of the first class” means that functions are objects as well as a list is an object or MyObject instances are objects. Since foo is an object, we can exploit it without performing it (this is the difference between foo () and foo). We can pass foo as a parameter to another function or we can assign it to a variable.

Let's look at foo for a bit more detail:

 >>> def foo(a): ... x = 3 ... return x + a ... >>> foo <function foo at 0x107ef7aa0> >>> foo.func_code <code object foo at 0x107eeccb0, file "<stdin>", line 1>

As you can see in the above code, the code object is an attribute of a function object. The code object is generated by the Python compiler and interpreter, it contains the information necessary for the interpreter to work. Let's look at the attributes of the code object:

 >>> dir(foo.func_code) ['__class__', '__cmp__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'co_argcount', 'co_cellvars', 'co_code', 'co_consts', 'co_filename', 'co_firstlineno', 'co_flags', 'co_freevars', 'co_lnotab', 'co_name', 'co_names', 'co_nlocals', 'co_stacksize', 'co_varnames']

Here a whole bunch of nishtyakov, most of which nm is not needed now. Let's take a closer look at the three attributes of the foo object.

 >>> foo.func_code.co_varnames ('a', 'x') >>> foo.func_code.co_consts (None, 3) >>> foo.func_code.co_argcount 1

Here is what is there: the names of variables and constants that are used in our function and the number of arguments taken. But we still do not see anything that would be similar to the instructions. The instructions call the bytecode reference, by the way this is an attribute of the code object:

 >>> foo.func_code.co_code 'd\x01\x00}\x01\x00|\x01\x00|\x00\x00\x17S'

I remind you that bytecode and code objects are not the same. Bytecode is an attribute of a code object besides many other attributes. So what is baytkod? Well, it's just a set of bytes. They look weird when we print them because some bytes match characters and others don’t, let's output them as numbers.

 >>> [ord(b) for b in foo.func_code.co_code] [100, 1, 0, 125, 1, 0, 124, 1, 0, 124, 0, 0, 23, 83]

Here are the bytes that do all the magic. The interpreter will sequentially and tirelessly select bytes, watch what operations they perform and with which arguments and execute commands. In order to go even further, you can view the source code of Cpython and specifically ceval.c that we will do later.

Disassembling bytecode

Disassembling means to take all these bytes and transform them into something that we humans can understand. This is not performed in the standard python cycle. Today there is a great tool for this task - the dis module. We will use the dis.dis function to analyze what our foo is doing.

 >>> def foo(a): ... x = 3 ... return x + a ... >>> import dis >>> dis.dis(foo.func_code) 2 0 LOAD_CONST 1 (3) 3 STORE_FAST 1 (x) 3 6 LOAD_FAST 1 (x) 9 LOAD_FAST 0 (a) 12 BINARY_ADD 13 RETURN_VALUE

The first number is the line of the source python code, the second number is the offset inside bytecode: LOAD_CONST is at position 0, STORE_FAST is at position 3, and so on. The middle column is the name of the instruction itself, the last two columns give an idea of the arguments of the instruction (if they are), the fourth column shows the argument itself, which is an index in other attributes of the code object. In this example, the argument for LOAD_CONST is the index in the co_consts list, and the argument for STORE_FAST is the index in co_varnames, in the fifth column the variable names or the value of the constants are displayed. We can easily verify this:

 >>> foo.func_code.co_consts[1] 3 >>> foo.func_code.co_varnames[1] 'x'

This also explains the second STORE_FAST instruction which is position 3 in the bytecode. If the instruction has an argument the next two bytes and there is this argument. The work of the interpreter is just the same so as not to get confused and continue to sow reasonable, good, eternal. (you may have noticed that BINARY_ADD has no arguments, do not worry, we will come back to this)

There was one thing that surprised me when I started to understand how python works, how can python be dynamic if it is also “compiled”? Usually, these two words are “antonyms”, there are dynamic languages such as Python, Ruby, and Javascript, but there are compiled ones like C, Java, and Haskell.

When people talk about compiled languages, they mean compiling into native x86 / ARM / etc instructions. The interpreted language has no compilation at all, except that it is only “compiled” on the fly into bytecode. The python interpreter parses the bytecode and executes it inside the virtual machine, which by the way is a lot of work, but we'll talk about it later.

In order to be dynamic you need to be abstract, let's see what it means:

 >> def modulus(x, y): ... return x % y ... >>> [ord(b) for b in modulus.func_code.co_code] [124, 0, 0, 124, 1, 0, 22, 83] >>> dis.dis(modulus.func_code) 2 0 LOAD_FAST 0 (x) 3 LOAD_FAST 1 (y) 6 BINARY_MODULO 7 RETURN_VALUE

This is a disassembly function in bytecode. By the time we get the invitation, the modus function has been compiled and the cord object has been generated. Suddenly enough, but the operation of the remainder of the division of % (modulus operation) is converted to BINARY_MODULO . It looks like this function can be used for numbers:

 >>> modulus(15,4) 3

Not bad, but what if we pass something else, such as a string.

 >>> modulus("hello %s", "world") 'hello world'

Opana, what is it here? You've probably seen this before:

 >>> print "hello %s" % "world" hello world

When the BINARY_MODULO operation is performed for two strings, it performs string substitution instead of division remainder. This situation is an excellent example of dynamic typing. When the compiler generates a code object for modulus, it has no idea what x and y are , whether they are strings or numbers or something else. It simply executes instructions: load one variable, load another, perform preparation of a binary module, return the result. The interpreter's job is to understand what BINARY_MODULO means in the current context. Our modulus function can count the remainder, substitute strings ... maybe something else? If we define a class with the __mod__ method , then we can do anything.

 >>> class Surprise(object): ... def __init__(self, num): ... self.num = num ... def __mod__(self, other): ... return self.num + other.num ... >>> seven = Surprise(7) >>> four = Surprise(4) >>> modulus(seven, four) 11 >>> modulus(7,4) 3 >>> modulus("hello %s", "world") 'hello world'

The same function with the same bytecode can perform different operations depending on the type of context. Also, the modulus function can raise an exception for the TypeError example if we call it for objects that are not implemented.

This is one of the reasons why it is difficult to optimize python. You do not know when you generate an object code and a byte code, what kind of objects will be in the end. Russell Power and Alex Rubinsteyn wrote an article “how fast python can be”, this article is sufficiently informative.

For now, for now. Original article here . I apologize for possible errors by nature I possess inborn illiteracy and it is compelled to use a machine way of verification of the text.

Source: https://habr.com/ru/post/264609/

All Articles

Immersion into the abyss of the Python interpreter. P1

How does python work?

Function Objects

Disassembling bytecode

More articles: