
1.
Introduction2.
Objects. Head3.
Objects. Tail4.
Process structuresIn addition to studying the standard library, it is always interesting, and sometimes useful, to know how the language is built from the inside. Andrei Svetlov ( svetlov ), one of the Python developers, advises to everyone interested in a series of articles on the CPython device. I present you the translation of the first episode.A friend of mine once told me: “You know, for some people, C is just a set of macros that unfolds into assembler instructions.” It was a long time ago (for know-alls: yes, even before the appearance of
LLVM ), but I remembered these words well. Maybe when
Kernigan and Ritchie look at the C program, do they actually see the assembler code? And
Tim Burners-Lee ? Can he surf the internet in a different way, not like us? And what, after all, did
Keanu Reeves see in that creepy green mess? No, really, what the hell did he see there ?! Um ... back to the programs. What does
Guido van Rossum see when he reads Python programs?
This post is the first in a series of articles on Python internals. I believe that explaining a topic to other people is the best way to understand it. And I really wanted to learn to see and understand the “eerie green mess” that stands behind the Python code. Basically, I will write about
CPython 3rd version , about bytecode (I'm not a fan of the compilation phase), but maybe I won’t ignore much more that is associated with the execution of any kind of Python code (
Unladen Swallow, Jython , Cython , etc.). For brevity, I write
Python , meaning
CPython , unless it says otherwise. I also mean a POSIX-compatible OS or, if it matters, Linux, unless otherwise stated. If you are interested in how Python works, then I advise you to finish reading this post to the end. You should do this all the more if you want to contribute to CPython. And you can do this in order to find the mistakes I made, laugh at me and leave a malicious comment, if this is your only way to express your feelings and emotions.
')
Practically everything I’m going to write about can be found in the Python source codes or in some other good sources (documentation, especially
this and
this page, separate lectures with PyCon,
search in
python-dev , etc.). You can find everything, but I hope that my efforts to combine all the materials into one, which you can subscribe to via RSS, will facilitate your adventures. I assume the reader is a little familiar with the C language; with the theory of operating systems; a little with the assembler of any architecture; not bad with Python and feels comfortable in UNIX (for example, it easily installs any of the sources). Do not worry if you do not have enough experience in all this, but I do not promise a light swim. If you do not have a customized environment for the development of Python, I suggest you go
here and perform the necessary steps.
Let's start with what you probably already know. To understand what is happening, it seems to me a convenient metaphor of mechanisms. In the case of Python, this is easy, because Python relies on the virtual machine to do what it does (like most interpreted languages). Here it is important to correctly understand the term “
virtual machine ”: you should think more towards the JVM than VirtualBox (technically, they are essentially the same, but in the real world, they are usually shared). To understand this term is easier, it seems to me, literally - this is a mechanism made up of programs. Your processor is just a complex electronic machine that accepts machine code and data as inputs, has a state (registers), and based on the input data and the current state it brings new information to memory or to the bus. It is clear, yes? And CPython is a mechanism, assembled from software components, that has a state and processes instructions (different implementations may use different instructions). This mechanism works in the process where the Python interpreter is located. I like this metaphor with “
mechanisms ”, and I have already
described it in great detail.
Considering the above, let us estimate from the height of bird flight what happens when we run this command:
$ python -c 'print("Hello, world!")'
The Python binary is launched, the standard C library is initialized (this happens when you start almost any process), the main function is called (see the sources
./Modules/python.c
:
main
, from which
./Modules/main.c
:
Py_Main
). After some preparatory steps (analysis of arguments, consideration of environment variables, assessment of the situation with standard threads, etc.),
./Python/pythonrun.c
:
Py_Initialize
. By and large, this function “creates” and assembles the parts needed to start a CPython machine, and simply the “process” turns into a “process with the Python interpreter inside”. In addition, two very important structures are created:
interpreter states and
stream states . It also initializes the built-in
sys
module and the module, which contains all built-in functions and variables. In the following episodes, these steps will be described in detail.
Having all of this, Python crawls in one of several ways, depending on what was fed to it: the line will be executed (the
-c
option), the module will be executed (the
-m
option), the file will be executed (explicitly transmitted on the command line or transferred by the kernel if Python is used as a script interpreter) or the
REPL will start (this is a special case of executing a file that is an interactive device). In our case, the string will be executed, since we passed the
-c
option. To execute this line, the
./Python/pythonrun.c
:
PyRun_SimpleStringFlags
. This function creates
the namespace __main__
in which our line of code will be executed (where will a be stored if execute
$ python -c 'a=1; print(a)'
? Correctly, in this space). After creating the space, the string is executed in it (more precisely, interpreted). For this to happen, first you need to convert the string to something understandable for the machine.
As I said, I will not focus on the parser and the Python compiler. I am not an expert on these areas, it does not interest me much, and as far as I know, there is no special magic in the Python compiler that goes beyond the limits of the university course on compilers. We’ll go over the top of these topics only a little bit and, maybe, we’ll come back a bit later to consider some features of CPython's behavior (for example, the
global operator that affects the parser). In general, the
parsing / compilation stages in
PyRun_SimpleStringFlags
are as follows: lexical analysis and creation of
a parse tree , its conversion into an
abstract syntax tree (AST), compilation of AST into
a code object using
./Python/ast.c
:
PyAST_FromNode
. Now you can think of the code object as a binary string that the mechanisms of the Python virtual machine can work with - now we are ready for interpretation.
We have an almost empty
__main__
, we have a code object, and we want to execute it. What's next? Everything makes a line from
./Python/pythonrun.c
:
run_mod
:
v = PyEval_EvalCode((PyObject*)co, globals, locals);
The function takes the code object and the namespaces
globals
and
locals
(in our case, they are the newly created namespace
__main__
), creates
a frame object and executes it. Let's
Py_Initialize
back to
Py_Initialize
, which defines the state of the thread. Each pit thread is represented by a separate state structure, which (among other things) indicates a stack of frames currently running. After the frame object is created and placed on top of the stream status stack, it (more precisely, the byte code it points to) is executed, operation by operation, using the rather long function
./Python/ceval.c
:
PyEval_EvalFrameEx
.
PyEval_EvalFrameEx
takes the frame, extracts the opcodes (and operands, if any; we'll talk more about that) and executes pieces of C-code corresponding to opcodes. Let's disassemble the Python code snippet and see what these “op codes” look like:
>>> from dis import dis
... even without special knowledge, the byte code is quite readable. “Load” something with the name
eggs
(from where do we load? Where do we load it from?) And load the constant value (1), then do “binary subtraction” (what is meant by the word “binary”? What are the operands?), And etc.
As you might have guessed, the variables are “loaded” from the global and local namespaces that we saw earlier on the operand stack (do not confuse with the stack of executing frames), just where the binary subtraction pulls them out, subtracts one from the other and puts the result back onto the stack. "Binary subtraction" is the subtraction of one operand from the other (hence the "binary", that is, there is no connection with binary numbers).
You can study the
PyEval_EvalFrameEx
function in the
PyEval_EvalFrameEx
file
./Python/ceval.c
. It is quite large, and for obvious reasons, I will not describe it in its entirety here, but I will show the code that is executed when processing the
BINARY_SUBTRACT
operation:
TARGET(BINARY_SUBTRACT) { PyObject *right = POP(); PyObject *left = TOP(); PyObject *diff = PyNumber_Subtract(left, right); Py_DECREF(right); Py_DECREF(left); SET_TOP(diff); if (diff == NULL) goto error; DISPATCH(); }
... push the first operand, take the second operand from the stack, transfer both operands to the
PyNumber_Subtract
C function, make it incomprehensible (we will deal with it later)
Py_DECREF
both operands, rewrite the upper value of the stack by the subtraction result and make some
DISPATCH
if
diff
not
NULL
. So. Although we still do not understand some things, I think that the subtraction of two numbers in Python at the lowest level is understandable. But in order to reach this point, it took us about a thousand and a half words!
After the frame is executed,
PyRun_SimpleStringFlags
returns an
PyRun_SimpleStringFlags
code, the main function purges (we pay special attention to
Py_Finalize
),
libc
(
atexit
, etc.) is deinitialized, and the process ends.
I hope this post turned out to be quite informative, and we will later use it as a foundation when discussing different parts of Python. We still have a lot of terms to return to: the interpreter, the state of the stream, the namespace, modules, built-in functions and variables, code and frame objects, and those incomprehensible words
DECREF
and
DISPATCH
from the
BINARY_SUBTRACT
handler. We also have a key “phantom” term around which we wandered in this article, but which was not called by name —
object . The CPython object system is important for understanding how it all works, and I hope we will discuss it in detail in the next post.
Stay in touch.
When translating, someone must have suffered: meanings, terms, and reptiles. Let's make the world better together, write about errors in the comments, so it’s safer.Join us, come to Buruki ! Together we will learn the wisdom of modern tools and create cool products.