Python memory management

Hello! That ended the long March weekend. We want to devote the first post-holiday publication to our favorite, many courses - “Python Developer” , which starts in less than 2 weeks. Go.

Content

Memory is an empty book
Memory management: from hardware to software
Basic Python implementation
Global Interpreter Lock (GIL) concept
Garbage collector
CPython memory management:
- Pula
- Blocks
- Arenas
Conclusion

')
Have you ever wondered how Python behind the scenes processes your data? How are your variables stored in memory? At what point are they removed?
In this article, we delve into the internal structure of Python to understand how memory management happens.

After reading this article, you:

Learn more about low-level operations, especially memory.
Understand how Python abstracts low-level operations.
Learn about memory management algorithms in Python.

Knowledge of the internal device Python will give a better understanding of the principles of its behavior. I hope you can take a look at Python from a new perspective. Behind the scenes, a great many logical operations take place so that your program works properly.

Memory is an empty book

You can think of a computer's memory as an empty book, waiting for many short stories to be written into it. On its pages there is nothing yet, but soon there will be authors who will want to write down their stories in it. For this they need a place.
Since they cannot write one story over another, they need to be very careful about the pages on which they write. Before you start writing, they consult with the book manager. The manager decides where in the book the authors can write down their history.

Since the book has been around for many years, many stories in it become obsolete. When no one reads or refers to a story, it is deleted so that it makes room for new stories.
At its core, computer memory is like an empty book. Continuous fixed-length memory blocks are usually called pages, so this analogy comes in handy.

Authors may be various applications or processes that need to store data in memory. The manager, who decides where authors can write their stories, plays the role of a memory manager - a sorter. And the one who erases old stories is a garbage collector.

Memory management: from hardware to software

Memory management is the process by which software applications read and write data. The memory manager determines where to put these programs. Since the amount of memory is of course, as is the number of pages in the book, respectively, the manager needs to find free space in order to give it to the use of the application. This process is called “memory allocation”.

On the other hand, when data is no longer needed, it can be deleted. In this case, talking about the release of memory. But what is it free from and where does it come from?
Somewhere inside the computer there is a physical device that stores data when you run Python programs. Python code goes through many levels of abstraction before it reaches this device.

One of the main levels lying above the hardware (RAM, hard disk, etc.) is the operating system. It manages requests to read and write to memory.
There is an application layer above the operating system, which has one of the implementations of Python (wired into your OS or downloaded from python.org). Memory management for code in this programming language is governed by special Python tools. The algorithms and structures that Python uses to manage memory are the main topic of this article.

Basic Python implementation

The basic Python implementation, or “pure Python”, is CPython, written in C.
I was very surprised when I first heard about it. How can one language be written in another language ?! Well, not literally, of course, but the idea is something like this.

The Python language is described in a special reference manual in English . However, this guide alone is not very helpful. You still need a tool to interpret code written according to the rules of the directory.

And you will need something to run the code on your computer. A basic Python implementation ensures that both conditions are met. It converts Python code into instructions that are executed on a virtual machine.

Note: Virtual machines are like physical computers, but they are built into the software. They handle basic instructions similar to assembly code .

Python interpreted programming language. Your Python code is compiled using instructions that are more easily understood by the computer - bytecode . These instructions are interpreted by the virtual machine when you run the code.

Have you ever seen files with the extension .pyc or folder __ pycache__ ? This is the same bytecode that is interpreted by the virtual machine.
It is important to understand that there are other implementations besides CPython, for example IronPython , which is compiled and run in the Microsoft Common Language Runtime (CLR). Jython is compiled into Java bytecode to run on a Java virtual machine. And there is PyPy about which you can write a separate article, so I will mention it only in passing.

In this article, we will focus on memory management using CPython tools.
Warning: Python versions are updated and anything can happen in the future. At the time of writing, the latest version was Python 3.7 .

Well, we have CPython, written in C, which interprets Python bytecode. How does this relate to memory management? To begin with, the algorithms and structures for memory management exist in the CPython code, in C. To understand these principles in Python, you need a basic understanding of CPython.

CPython is written in C, which in turn does not support object-oriented programming. Because of this, the CPython code has a rather interesting structure.

You must have heard that everything in Python is an object, even types such as int and str, for example. This is true at the CPython implementation level. There is a structure called PyObject that every object in CPython uses.

Note: A structure in C is a user-defined data type, which in itself groups various data types. You can draw an analogy with object-oriented languages and say that a structure is a class with attributes, but without methods.

PyObject is the progenitor of all objects in Python, containing just two things:

ob_refcnt : reference count;
ob_type : pointer to another type.

Reference counting is required for garbage collection. We also have a pointer to a specific type of object. An object type is just another structure that describes objects in Python (such as dict or int).

Each object has an object-oriented memory allocator that knows how to allocate memory and store an object. Each object also has an object-oriented resource liberator, which clears the memory if its contents are no longer needed.

There is one important factor in talking about memory allocation and cleansing. Memory is a shared computer resource, and a rather unpleasant thing can happen if two processes try to write data to the same memory cell at the same time.

Global Interpreter Lock (GIL)

GIL is a solution to the general problem of sharing memory between such combined resources as computer memory. When two streams try to change the same resource at the same time, they step on each other’s heels. As a result, a complete mess is formed in memory and no process finishes its work with the desired result.

Returning to the book analogy, suppose that of the two authors, each decides that he should write his own history on the current page at this particular moment. Each of them ignores the attempts of the other to write a story and begins to write obstinately on the page. As a result, we have two stories, one on top of the other, and an absolutely unreadable page.

One of the solutions to this problem is GIL, which blocks the interpreter while the thread interacts with the selected resource, thus allowing one and only one thread to write to the allocated memory area. When CPython allocates memory, it uses GIL to make sure it does it correctly.
This approach has both a lot of advantages and a lot of disadvantages, so GIL causes fights in the Python community. To learn more about GIL, I suggest reading the following article .

Garbage collector

Let us return to our analogy with the book and imagine that some stories in it are hopelessly outdated. Nobody reads them or addresses them. In this case, the natural way out would be to get rid of them as unnecessary, thereby freeing up space for new stories.
Such old, unused stories can be compared to objects in Python, whose reference count has dropped to 0. Remember that every object in Python has a reference count and a pointer to a type.

The link count may increase for several reasons. For example, it will increase if you assign one variable to another variable.

It will also increase if you pass the object as an argument.

In the last example, the reference count will increase if you include an object in the list.

Python allows you to find out the current value of the reference count using the sys module. You can use sys.getrefcount(numbers) , but remember that calling getrefcount() will increase the reference count by one.

In any case, if an object is still needed in your code, its reference count value will be greater than 0. And when it drops to zero, a special memory cleaning function will be initiated that will free it and make it available to other objects.

But what does “free up memory” mean and how do other objects use it? Let's dive directly into memory management in CPython.

CPython memory management

In this part, we dive into the CPython memory architecture and the algorithms by which it functions.

As mentioned earlier, there is such a thing as abstraction layers between physical equipment and CPython. The operating system (OS) abstracts the physical memory and creates a virtual memory layer that applications can access, including Python.

An OS-based virtual memory manager allocates a specific area of memory for Python processes. In the picture, the dark gray areas are the space that the Python process occupies.

Python uses part of the memory for internal use and non-object memory (non-object memory). The other part is divided into the storage of objects (your int, dict , etc.). Now I speak a very simple language, but you can look right “under the hood”, that is, in the source code of CPython and see how it all happens from a practical point of view. .

In CPython, there is an object allocator responsible for allocating memory within an object memory area. It is in this object distributor that all the magic is accomplished. It is called every time when each new object needs to occupy or free memory.

Usually, adding and deleting data in Python, such as int or list, for example, does not use a lot of data at one time. That is why the architecture of the distributor focuses on working with small amounts of data in one unit of time. Also, he does not allocate memory in advance, that is, until it becomes absolutely necessary.

Comments in the source code define the allocator (allocator) as "a fast, special purpose memory allocator that works like the generic malloc function." Accordingly, in C, malloc is used for memory allocation.

Now let's take a look at the memory allocation strategy in CPython. To begin, let's talk about the three main parts and how they relate to each other.

Arenas (arena) - the largest areas of memory, which occupy space to the edges of pages in memory. The page boundary (page spread) is the extreme point of the continuous block of fixed-length memory used by the OS. Python sets the system page boundary to 256 KB.

Inside the arenas are pools (pool), which are considered one virtual memory page (4 Kb). They are like pages in our analogy. Pools are divided into even smaller pieces of memory - blocks.

All blocks in the pool are in the same "size class". Class size (size class) determines the size of the block, having a certain amount of requested data. The gradation in the table below is taken directly from the comments in the source code:

For example, if 42 bytes are needed, then the data will be placed in a block of 48 bytes in size.

Pula

Pools are made up of blocks of the same size class. Each pool works on the principle of a doubly linked list with other pools of the same size class. Therefore, the algorithm can easily find the necessary space for the required block size, even among multiple pools.

The list of used pools ( usedpools list ) keeps track of all the pools that have some free space available for the data of each size class. When the requested block size is requested, the algorithm checks the list of pools used to find a suitable pool for it.

Pools are in three states: used, full, empty. The used pool contains blocks in which you can write some information. Full pool blocks are all distributed and already contain data. Empty pools do not contain data and can be divided into which size classes are appropriate if needed.

The list of empty pools ( freepools list ) contains, respectively, all pools in an empty state. But at what point are they used?

Let's say your code needs a memory area of 8 bytes. If there are no pools with a size class of 8 bytes in the list of used pools, the new empty pool is initialized as storing blocks of 8 bytes each. Then an empty pool is added to the list of used pools and can be used for the following requests.

A filled pool frees up some blocks when this information is no longer needed. This pool will be added to the list of used in accordance with its class size. You can observe how pools change their states and even size classes according to an algorithm.

Blocks

As can be seen from the figure, pools contain pointers to free memory blocks. In their work there is a small nuance. According to the comments in the source code, the distributor "strive never to touch any memory area at any level (arena, pool, block) until it is needed."

This means that a block can have three states. They can be defined as follows:

Intact : areas of memory that were not allocated;
Free : memory areas that were allocated, but later released by CPython, because they did not contain relevant information;
Distributed : areas of memory that currently contain current information.

The freeblock pointer is a simply linked list of free blocks of memory. In other words, this is a list of free places where information can be recorded. If you need more memory than there is in the free blocks, then the distributor uses intact blocks in the pool.

As soon as the memory manager releases blocks, these blocks are added to the top of the free block list. The actual list may not contain a continuous sequence of blocks of memory, as in the first “successful” figure.

Arenas

Arenas contain pools. Arenas, in contrast to pools, do not have explicit state separations.

They themselves are organized into a doubly linked list, which is called the list of used arenas (usable_arenas). This list is sorted by the number of free pools. The fewer free pools, the closer the arena to the top of the list.

This means that the most complete arena will be selected to record even more data. But why exactly? Why not write data to where the most free space is?

This brings us to the idea of a complete liberation of memory. The fact is that in some cases, when the memory is released, it is still inaccessible to the operating system. The Python process keeps it distributed and uses it later for new data. Full release of memory returns memory to the use of the operating system.

Arenas are not the only areas that can be completely vacated. Thus, we understand that those arenas that are on the “closer to empty state” list should be released. In this case, the memory area can actually be completely freed, and accordingly the total memory size of your Python program is reduced.

Conclusion

Memory management is one of the most important parts in working with a computer. Python somehow makes almost all the actions in the hidden mode.

From this article, you learned:

What is memory management and why is it important?
What is CPython, the basic implementation of Python;
How data structures and algorithms work in CPython memory management and store your data.

Python abstracts many small nuances of working with a computer. This gives you the opportunity to work at a higher level and get rid of the headache on how and where the bytes of your program are stored.

So we learned about memory management in Python. Traditionally, we are waiting for your comments, as well as we invite you to the open door on the Python Developer course, which will be held on March 13

Source: https://habr.com/ru/post/443312/

All Articles

Python memory management

More articles: