📜 ⬆️ ⬇️

Python inside. Introduction

Boa constrictor 1. Introduction
2. Objects. Head
3. Objects. Tail
4. Process structures

In addition to studying the standard library, it is always interesting, and sometimes useful, to know how the language is built from the inside. Andrei Svetlov ( svetlov ), one of the Python developers, advises to everyone interested in a series of articles on the CPython device. I present you the translation of the first episode.

A friend of mine once told me: “You know, for some people, C is just a set of macros that unfolds into assembler instructions.” It was a long time ago (for know-alls: yes, even before the appearance of LLVM ), but I remembered these words well. Maybe when Kernigan and Ritchie look at the C program, do they actually see the assembler code? And Tim Burners-Lee ? Can he surf the internet in a different way, not like us? And what, after all, did Keanu Reeves see in that creepy green mess? No, really, what the hell did he see there ?! Um ... back to the programs. What does Guido van Rossum see when he reads Python programs?

This post is the first in a series of articles on Python internals. I believe that explaining a topic to other people is the best way to understand it. And I really wanted to learn to see and understand the “eerie green mess” that stands behind the Python code. Basically, I will write about CPython 3rd version , about bytecode (I'm not a fan of the compilation phase), but maybe I won’t ignore much more that is associated with the execution of any kind of Python code ( Unladen Swallow, Jython , Cython , etc.). For brevity, I write Python , meaning CPython , unless it says otherwise. I also mean a POSIX-compatible OS or, if it matters, Linux, unless otherwise stated. If you are interested in how Python works, then I advise you to finish reading this post to the end. You should do this all the more if you want to contribute to CPython. And you can do this in order to find the mistakes I made, laugh at me and leave a malicious comment, if this is your only way to express your feelings and emotions.
')
Practically everything I’m going to write about can be found in the Python source codes or in some other good sources (documentation, especially this and this page, separate lectures with PyCon, search in python-dev , etc.). You can find everything, but I hope that my efforts to combine all the materials into one, which you can subscribe to via RSS, will facilitate your adventures. I assume the reader is a little familiar with the C language; with the theory of operating systems; a little with the assembler of any architecture; not bad with Python and feels comfortable in UNIX (for example, it easily installs any of the sources). Do not worry if you do not have enough experience in all this, but I do not promise a light swim. If you do not have a customized environment for the development of Python, I suggest you go here and perform the necessary steps.

Let's start with what you probably already know. To understand what is happening, it seems to me a convenient metaphor of mechanisms. In the case of Python, this is easy, because Python relies on the virtual machine to do what it does (like most interpreted languages). Here it is important to correctly understand the term “ virtual machine ”: you should think more towards the JVM than VirtualBox (technically, they are essentially the same, but in the real world, they are usually shared). To understand this term is easier, it seems to me, literally - this is a mechanism made up of programs. Your processor is just a complex electronic machine that accepts machine code and data as inputs, has a state (registers), and based on the input data and the current state it brings new information to memory or to the bus. It is clear, yes? And CPython is a mechanism, assembled from software components, that has a state and processes instructions (different implementations may use different instructions). This mechanism works in the process where the Python interpreter is located. I like this metaphor with “ mechanisms ”, and I have already described it in great detail.

Considering the above, let us estimate from the height of bird flight what happens when we run this command:

$ python -c 'print("Hello, world!")' 

The Python binary is launched, the standard C library is initialized (this happens when you start almost any process), the main function is called (see the sources ./Modules/python.c : main , from which ./Modules/main.c : Py_Main ). After some preparatory steps (analysis of arguments, consideration of environment variables, assessment of the situation with standard threads, etc.), ./Python/pythonrun.c : Py_Initialize . By and large, this function “creates” and assembles the parts needed to start a CPython machine, and simply the “process” turns into a “process with the Python interpreter inside”. In addition, two very important structures are created: interpreter states and stream states . It also initializes the built-in sys module and the module, which contains all built-in functions and variables. In the following episodes, these steps will be described in detail.

Having all of this, Python crawls in one of several ways, depending on what was fed to it: the line will be executed (the -c option), the module will be executed (the -m option), the file will be executed (explicitly transmitted on the command line or transferred by the kernel if Python is used as a script interpreter) or the REPL will start (this is a special case of executing a file that is an interactive device). In our case, the string will be executed, since we passed the -c option. To execute this line, the ./Python/pythonrun.c : PyRun_SimpleStringFlags . This function creates the namespace __main__ in which our line of code will be executed (where will a be stored if execute $ python -c 'a=1; print(a)' ? Correctly, in this space). After creating the space, the string is executed in it (more precisely, interpreted). For this to happen, first you need to convert the string to something understandable for the machine.

As I said, I will not focus on the parser and the Python compiler. I am not an expert on these areas, it does not interest me much, and as far as I know, there is no special magic in the Python compiler that goes beyond the limits of the university course on compilers. We’ll go over the top of these topics only a little bit and, maybe, we’ll come back a bit later to consider some features of CPython's behavior (for example, the global operator that affects the parser). In general, the parsing / compilation stages in PyRun_SimpleStringFlags are as follows: lexical analysis and creation of a parse tree , its conversion into an abstract syntax tree (AST), compilation of AST into a code object using ./Python/ast.c : PyAST_FromNode . Now you can think of the code object as a binary string that the mechanisms of the Python virtual machine can work with - now we are ready for interpretation.

We have an almost empty __main__ , we have a code object, and we want to execute it. What's next? Everything makes a line from ./Python/pythonrun.c : run_mod :

 v = PyEval_EvalCode((PyObject*)co, globals, locals); 

The function takes the code object and the namespaces globals and locals (in our case, they are the newly created namespace __main__ ), creates a frame object and executes it. Let's Py_Initialize back to Py_Initialize , which defines the state of the thread. Each pit thread is represented by a separate state structure, which (among other things) indicates a stack of frames currently running. After the frame object is created and placed on top of the stream status stack, it (more precisely, the byte code it points to) is executed, operation by operation, using the rather long function ./Python/ceval.c : PyEval_EvalFrameEx .

PyEval_EvalFrameEx takes the frame, extracts the opcodes (and operands, if any; we'll talk more about that) and executes pieces of C-code corresponding to opcodes. Let's disassemble the Python code snippet and see what these “op codes” look like:

 >>> from dis import dis # !    ! >>> co = compile("spam = eggs - 1", "<string>", "exec") >>> dis(co) 1 0 LOAD_NAME 0 (eggs) 3 LOAD_CONST 0 (1) 6 BINARY_SUBTRACT 7 STORE_NAME 1 (spam) 10 LOAD_CONST 1 (None) 13 RETURN_VALUE >>> 

... even without special knowledge, the byte code is quite readable. “Load” something with the name eggs (from where do we load? Where do we load it from?) And load the constant value (1), then do “binary subtraction” (what is meant by the word “binary”? What are the operands?), And etc.

As you might have guessed, the variables are “loaded” from the global and local namespaces that we saw earlier on the operand stack (do not confuse with the stack of executing frames), just where the binary subtraction pulls them out, subtracts one from the other and puts the result back onto the stack. "Binary subtraction" is the subtraction of one operand from the other (hence the "binary", that is, there is no connection with binary numbers).

You can study the PyEval_EvalFrameEx function in the PyEval_EvalFrameEx file ./Python/ceval.c . It is quite large, and for obvious reasons, I will not describe it in its entirety here, but I will show the code that is executed when processing the BINARY_SUBTRACT operation:

 TARGET(BINARY_SUBTRACT) { PyObject *right = POP(); PyObject *left = TOP(); PyObject *diff = PyNumber_Subtract(left, right); Py_DECREF(right); Py_DECREF(left); SET_TOP(diff); if (diff == NULL) goto error; DISPATCH(); } 

... push the first operand, take the second operand from the stack, transfer both operands to the PyNumber_Subtract C function, make it incomprehensible (we will deal with it later) Py_DECREF both operands, rewrite the upper value of the stack by the subtraction result and make some DISPATCH if diff not NULL . So. Although we still do not understand some things, I think that the subtraction of two numbers in Python at the lowest level is understandable. But in order to reach this point, it took us about a thousand and a half words!

After the frame is executed, PyRun_SimpleStringFlags returns an PyRun_SimpleStringFlags code, the main function purges (we pay special attention to Py_Finalize ), libc ( atexit , etc.) is deinitialized, and the process ends.

I hope this post turned out to be quite informative, and we will later use it as a foundation when discussing different parts of Python. We still have a lot of terms to return to: the interpreter, the state of the stream, the namespace, modules, built-in functions and variables, code and frame objects, and those incomprehensible words DECREF and DISPATCH from the BINARY_SUBTRACT handler. We also have a key “phantom” term around which we wandered in this article, but which was not called by name — object . The CPython object system is important for understanding how it all works, and I hope we will discuss it in detail in the next post.

Stay in touch.

When translating, someone must have suffered: meanings, terms, and reptiles. Let's make the world better together, write about errors in the comments, so it’s safer.

Join us, come to Buruki ! Together we will learn the wisdom of modern tools and create cool products.

Source: https://habr.com/ru/post/189972/


All Articles