Everything you need to know about the Python garbage collector

As a rule, you do not need to worry about the garbage collector and working with memory when you write code in Python. Once objects are no longer needed, Python automatically frees memory from under them. Despite this, understanding how GC works will help you write better code.

Memory manager

Unlike other popular languages, Python does not free all memory back to the operating system as soon as it deletes an object. Instead, it uses an additional memory manager designed for small objects (the size of which is less than 512 bytes). To work with such objects, it allocates large blocks of memory, in which many small objects will be stored in the future.

As soon as one of the small objects is removed - the memory does not pass from under it to the operating system, Python leaves it for new objects with the same size. If there are no objects left in one of the allocated memory blocks, then Python can release it to the operating system. As a rule, the release of blocks happens when the script creates a lot of temporary objects.

Thus, if a long-lived Python process begins to consume more memory over time, this does not mean at all that your code has a memory leak problem. If you want to learn more about the memory manager in Python, you can read about it in my other article (eng.).
')

Garbage collection algorithms

The standard python interpreter (CPython) uses two algorithms at once, reference counting and the generational garbage collector (hereinafter GC), better known as the standard gc module from Python.

The reference counting algorithm is very simple and efficient, but it has one major drawback. He does not know how to define circular references. It is because of this, in python, there is an additional collector, referred to as a generational GC, which tracks objects with potential circular references.

In Python, the reference counting algorithm is fundamental and cannot be disabled, whereas GC is optional and can be disabled.

Link counting algorithm

The reference counting algorithm is one of the easiest techniques for garbage collection. Objects are deleted as soon as they are no longer referenced.

In Python, variables do not store values, but act as references to objects. That is, when you assign a value to a new variable, an object is first created with this value, and only then the variable begins to refer to it. A set of variables can refer to one object.

Each object in Python contains an additional field (reference count) that stores the number of links to it. As soon as someone refers to an object, this field is incremented by one. If for some reason the link disappears, then this field is reduced by one.

Examples when the number of links increases:

assignment operator
passing arguments
insert a new object into the list (the number of links for the object increases)
construction of the form foo = bar (foo starts referring to the same object as bar)

As soon as the reference count for a particular object reaches zero, the interpreter starts the process of destroying the object. If the deleted object contained references to other objects, these links are also deleted. Thus, the deletion of one object may entail the removal of others.

For example, if a list is deleted, the reference count in all its elements is reduced by one. If all objects inside the list are not used anywhere else, they will also be deleted.

Variables that are declared outside functions, classes, and blocks are called global. Typically, the life cycle of such variables is equal to the life of the Python process. Thus, the number of references to objects referenced by global variables never drops to zero.

Variables that are declared inside a block (function, class) have local visibility (i.e., they are visible only inside the block). As soon as the python interpreter leaves the block, it destroys all references created by local variables inside it.

You can always check the number of links using the sys.getrefcount function.

An example of the reference counter operation:

 foo = [] # 2 ,    foo    getrefcount print(sys.getrefcount(foo)) def bar(a): # 4  #  foo,   (a), getrefcount       print(sys.getrefcount(a)) bar(foo) # 2 ,      print(sys.getrefcount(foo))

The main reason why the standard interpreter (CPython) uses a reference count is historical. There are currently many debates about this approach. Some people believe that a garbage collector can be much more efficient without reference counting. This algorithm has many problems, such as circular references, blocking threads, as well as additional overhead for memory and cpu.

The main advantage of this algorithm is that objects are deleted as soon as they are not needed.

Optional garbage collector

Why do we need an additional algorithm when we already have a reference count?

Unfortunately, the classic reference counting algorithm has one major drawback - it does not know how to find circular references. Cyclic links occur when one or more objects link to each other.

Two examples:

As you can see, the lst object refers to itself, while object1 and object2 refer to each other. For such objects, the reference count will always be 1.

Python demo:

 import gc #  ctypes        class PyObject(ctypes.Structure): _fields_ = [("refcnt", ctypes.c_long)] gc.disable() #   GC lst = [] lst.append(lst) #    lst lst_address = id(lst) #   lst del lst object_1 = {} object_2 = {} object_1['obj2'] = object_2 object_2['obj1'] = object_1 obj_address = id(object_1) #   del object_1, object_2 #          # gc.collect() #    print(PyObject.from_address(obj_address).refcnt) print(PyObject.from_address(lst_address).refcnt)

In the example above, the del instruction removes references to our objects (not the objects themselves). As soon as Python executes the del instruction, these objects become inaccessible from the Python code. However, with the gc module turned off, they will still remain in memory, since they had circular references and their counter is still equal to one. You can visually explore such links using the objgraph library.

In order to fix this problem, an additional algorithm was added in Python 1.5, known as the gc module . The only task of which is the removal of cyclic objects that are no longer accessible from the code.

Cyclic links can occur only in “container” objects. Those. in objects that can store other objects, such as lists, dictionaries, classes, and tuples. GC does not keep track of simple and immutable types, with the exception of tuples. Some tuples and dictionaries are also excluded from the list of surveillance when certain conditions are met. All other objects are guaranteed to handle reference counting.

When GC is triggered

Unlike the reference counting algorithm, cyclic GC does not work in real time and runs periodically. Each launch of the collector creates micropauses in the code, so CPython (the standard interpreter) uses different heuristics to determine the frequency at which the garbage collector runs.

The cyclic garbage collector divides all objects into 3 generations (generations). New objects fall into the first generation. If a new object survives the garbage collection process, then it moves to the next generation. The higher the generation, the less often it is scanned for rubbish. Since new facilities often have a very small lifespan (are temporary), it makes sense to interview them more often than those that have already gone through several stages of garbage collection.

In each generation there is a special counter and a trigger threshold, upon reaching which the process of garbage collection is triggered. Each counter stores the number of allocations minus the number of deallocations in this generation. As soon as any container object is created in Python, it checks these counters. If the conditions work, then the process of garbage collection begins.

If several or more generations crossed the threshold at once, then the older generation is chosen. This is due to the fact that older generations also scan all previous ones. To reduce the number of garbage collection pauses for long-lived objects, the oldest generation has an additional set of conditions .

The default thresholds for generations are set to 700, 10 10 respectively, but you can always change them using the gc.get_threshold gc.set_threshold .

Cycle search algorithm

For a full description of the cycle search algorithm, a separate article is required. In short, the GC iterates each object from the selected generations and temporarily removes all references from a single object (all references to which this object refers). After a full pass, all objects whose reference count less than two is considered inaccessible from python and can be deleted.

For a deeper understanding, I recommend reading (approx. Translator: English material) the original description of the algorithm from Neil Schemenauer and the collect function from CPython sources . A description from Quora and a post about a garbage collector can also be useful.

It should be noted that the problem with destructors described in the original description of the algorithm has been fixed since Python 3.4 (more details in PEP 442 ).

Optimization Tips

Loops often occur in real-world tasks, they can be found in problems with graphs, linked lists, or in data structures where you need to keep track of the relationships between objects. If your program has a high load and is demanding for delays, then, if possible, cycles are best avoided.

In places where you deliberately use circular links, you can use "weak" links. Weak links are implemented in the weakref module and, unlike regular links, have no effect on the link count. If an object with weak links is deleted, then None returned instead.

In some cases, it is useful to disable the automatic assembly module gc and call it manually. It is enough to call gc.disable() and then call gc.collect() manually.

How to find and debug circular references

Debugging loops can be painful, especially if your code uses a lot of third-party modules.

The gc module provides helper functions that can help debug. If the GC parameters are set to the DEBUG_SAVEALL flag, then all inaccessible objects will be added to the gc.garbage list.

 import gc gc.set_debug(gc.DEBUG_SAVEALL) print(gc.get_count()) lst = [] lst.append(lst) list_id = id(lst) del lst gc.collect() for item in gc.garbage: print(item) assert list_id == id(item)

As soon as you identify the problem area, it can be visualized using objgraph .

Conclusion

The main garbage collection process is performed by a reference counting algorithm, which is very simple and has no settings. An additional algorithm is used only for finding and deleting objects with circular references.

You should not deal with premature code optimization for the garbage collector, in practice, serious problems with garbage collection are quite rare.

PS: I am the author of this article, you can ask any questions.

Source: https://habr.com/ru/post/417215/

All Articles