What I learned about optimization in Python

Hello. Today we want to share one more translation prepared on the eve of the launch of the Python Developer course. Go!

I used Python more often than any other programming language in the last 4-5 years. Python is the predominant language for Firefox builds, testing, and the CI tool. Mercurial is also mostly written in Python. I also wrote a lot of my third-party projects on it.

')

During my work, I gained a little knowledge of Python's performance and optimization tools. In this article I would like to share this knowledge.

My experience with Python is mainly related to the CPython interpreter, especially CPython 2.7. Not all of my observations are universal for all Python distributions, or for those that have the same characteristics in similar versions of Python. I will try to mention this during the narration. Remember that this article is not a detailed review of Python performance. I will only talk about what I came across on my own.

Load due to the features of the launch and import of modules

Running the Python interpreter and importing modules is a long process if it’s about milliseconds.

If you need to run hundreds or thousands of Python processes in one of your projects, this delay will turn into a delay of a few seconds in milliseconds.

If you use Python to provide CLI tools, the overhead can cause a noticeable hang to the user. If you need CLI tools instantly, then starting the Python interpreter with each call will make it harder to get this complex tool.

I already wrote about this issue. A few of my last notes talk about this, for example, in 2014 , in May 2018 and October 2018 .

There are not so many things you can do to reduce the launch delay: fixing this case refers to manipulating the Python interpreter, since it is he who controls the execution of code that takes too much time. The best thing you can do is turn off the import of the site module in calls to avoid executing unnecessary Python code during startup. On the other hand, many applications use the site.py module functionality, so you can use it at your own risk.

We should also consider the problem of importing modules. What good is a Python interpreter if it does not process any code? The point is that the code becomes available for understanding the interpreter most often through the use of modules.

To import modules you need to take a few steps. And in each of them there is a potential source of loads and delays.

A certain delay occurs due to the search for modules and the reading of their data. As I demonstrated using PyOxidizer , replacing the search and loading of a module from the file system with an architecturally simpler solution, which consists in reading module data from a data structure in memory, you can import the standard Python library for 70-80% of the initial time to solve this problem. Having one module per file system file increases the load on the file system and can slow down the Python application in the critical first milliseconds of execution. Solutions like PyOxidizer can help avoid this. I hope that the Python community sees these costs of the current approach and is considering the possibility of moving to the mechanisms for distributing modules that are not so dependent on individual files in the module.

Another source of additional costs for importing a module is the execution of code in this module during import. Some modules contain parts of the code in the area outside the functions and classes of the module, which is executed when the module is imported. Execution of such a code increases the cost of import. Workaround: do not execute all the code during the import, but execute it only when needed. Python 3.7 supports the __getattr__ module, which will be called if the attribute of any module was not found. This can be used to lazily populate the attributes of a module upon first access.

Another way to get rid of the slowdown during import is to import the module lazily. Instead of directly loading the module when importing, you register a custom import module, which returns a stub instead. When you first access this stub, it will load the actual module and “mutate” to become this module.

You can save tens of milliseconds at the expense of applications that import several dozen modules if you bypass the file system and avoid launching unnecessary parts of the module (modules are usually imported globally, but only certain module functions are used).

Lazy import of modules is a fragile thing. Many modules have templates in which there are the following things: try: import foo ; except ImportError: A lazy module importer may never issue an ImportError, because if it does, it will have to look in the file system for the module to find out if it exists in principle. This will add additional workload and increase time costs, so lazy importers do not do this in principle! This problem is rather unpleasant. Importer of lazy modules Mercurial processes a list of modules that cannot be lazily imported, and it should bypass them. Another problem is the syntax from foo import x, y , which also interrupts the import of a lazy module, in cases where foo is a module (as opposed to a package), because to return a reference to x and y, the module still needs to be imported.

PyOxidizer has a fixed set of modules embedded in a binary, so it can be effective in issuing an ImportError. The __getattr__ module from Python 3.7 provides additional flexibility for lazy module importers. I hope to integrate a reliable lazy importer into PyOxidizer to automate some processes.

The best solution to avoid running the interpreter and the appearance of time delays is to start the background process in Python. If you run the Python process as a daemon (daemon process), say for a web server, then you can do it. The solution that Mercurial proposes is running a background process that provides a command server protocol . hg is an executable file C (or now Rust), which connects to this background process and sends a command. To find an approach to the command server, you need to do a lot of work, it is extremely unstable and has security problems. I am considering the idea of delivering a command server using PyOxidizer so that the executable file has its advantages, and the problem of the cost of a software solution is solved by creating a PyOxidizer project.

Delay due to function call

Calling functions in Python is a relatively slow process. (This observation is less applicable to PyPy, which can execute JIT code.)

I saw dozens of patches for Mercurial, which made it possible to align and combine the code in such a way as to avoid unnecessary load when calling functions. In the current development cycle, some efforts were made to reduce the number of called functions when updating the progress bar. (We use progress bars for any operations that may take some time so that the user understands what is happening). Getting the results of calling functions and avoiding simple searches among functions saves tens of hundreds of milliseconds when executed, when we talk about one million executions, for example.

If you have hard loops or recursive functions in Python, where hundreds of thousands or more function calls can happen, you should be aware of the overhead of calling a single function, as this is important. Keep in mind the presence of built-in simple functions and the possibility of combining functions to avoid overhead.

Additional attribute search load

This problem is similar to the overhead due to a function call, since the meaning is almost the same!

Finding (resolving) attributes in Python can be slow. (Again, this is faster in PyPy). However, handling this problem is something we do often in Mercurial.

Suppose you have the following code:

 obj = MyObject() total = 0 for i in len(obj.member): total += obj.member[i]

We omit that there are more efficient ways of writing this example (for example, total = sum(obj.member) ), and note that the loop needs to define obj.member at each iteration. Python has a relatively complex mechanism for defining attributes . For simple types, it can be quite fast. But for complex types, this attribute access can automatically call __getattr__ , __getattribute__ , various dunder methods dunder and even user-defined @property functions. This is similar to a quick search for an attribute that can make several function calls, which will lead to an extra load. And this load can be exacerbated if you use things like obj.member1.member2.member3 , etc.

Each attribute definition causes an extra load. And since almost everything in Python is a dictionary, we can say that every attribute search is a dictionary search. From the general concepts of basic data structures, we know that a dictionary search is not as fast as, say, a pointer search. Yes, of course there are a few tricks in CPython that allow you to get rid of overhead due to a dictionary search. But the main topic I want to touch on is that any attribute search is a potential performance leak.

For hard loops, especially those that potentially exceed hundreds of thousands of iterations, you can avoid these measurable attribute search costs by assigning a value to a local variable. Let's look at the following example:

 obj = MyObject() total = 0 member = obj.member for i in len(member): total += member[i]

Of course, this can be done safely only if it is not replaced in a cycle. If this happens, the iterator will save the link to the old element and everything can explode.

The same trick can be performed when calling an object method. Instead

 obj = MyObject() for i in range(1000000): obj.process(i)

You can do the following:

 obj = MyObject() fn = obj.process for i in range(1000000:) fn(i)

It is also worth noting that in the case when the attribute search needs to call a method (as in the previous example), then Python 3.7 is relatively faster than previous releases. But I am sure that here the excessive load is connected, first of all, with the function call, and not with the load on the attribute search. Therefore, everything will work faster if you give up the unnecessary attribute search.

Finally, since the search for attributes causes this function, we can say that the search for attributes is generally less of a problem than the load due to a function call. As a rule, to notice significant changes in the speed of work, you will need to eliminate a multitude of attribute searches. In this case, as soon as you give access to all the attributes inside the loop, you can talk about 10 or 20 attributes only in the loop before the function is called. And cycles with a total of thousands or less than tens of thousands of iterations can quickly provide hundreds of thousands or millions of attribute searches. So be careful!

Object load

From the point of view of the Python interpreter, all values are objects. In CPython, each element is a PyObject structure. Each object managed by the interpreter is in the heap and has its own memory containing a reference counter, the type of the object and other parameters. Each object is disposed of by the garbage collector. This means that each new object adds overhead due to reference counting, garbage collection, etc. (And again, PyPy can avoid this extra load, since it "more attentively refers" to the lifetime of short-term values.)

As a rule, the more unique values and Python objects you create, the slower everything works.

Let's say you iterate through a collection of one million objects. You call a function to collect this object in a tuple:

 for x in my_collection: a, b, c, d, e, f, g, h = process(x)

In this example, process() will return an 8-tuple tuple. It does not matter whether we destroy the return value or not: this tuple requires at least 9 values to be created in Python: 1 for the tuple itself and 8 for its internal members. Well, in real life there may be less values if process() returns a reference to an existing object. Or, on the contrary, there may be more if their types are not simple and require a set of PyObjects to be represented. I just want to say that under the hood of the interpreter there is a real juggling of objects for the full representation of certain structures.

From my own experience, I can say that these overheads are relevant only for operations that give a speed gain when implemented in a native language such as C or Rust. The problem is that the CPython interpreter is simply unable to perform bytecode so quickly that the additional load is important due to the number of objects. Instead, you will most likely reduce performance by calling a function, or by cumbersome calculations, etc. before you can notice the extra load due to objects. There are, of course, several exceptions, namely the construction of tuples or dictionaries of several values.

As a concrete example of overhead, we can give Mercurial having a C code that parses low-level data structures. For faster parsing, the C code is executed an order of magnitude faster than CPython does. But as soon as the C code creates PyObject to represent the result, the speed drops several times. In other words, the load is related to the creation and management of Python elements so that they can be used in the code.

The way to get around this problem is to produce fewer elements in Python. If you need to refer to a single element, then let the function return it, not a tuple or a dictionary of N elements. However, do not stop to monitor the possible load due to the call functions!

If you have a lot of code that runs fast enough using the CPython C API, and the elements that need to be distributed between different modules, do not have Python types that present the various data as C structures and have already compiled code to access these structures. instead of going through the CPython C API. By avoiding CPython C API for data access, you will get rid of a large amount of unnecessary load.

Considering elements as data (instead of having functions to access everything in a row) would be the best approach for a pythonist. Another workaround for already compiled code is the lazy creation of PyObject instances. If you create a custom type in Python (PyTypeObject) to represent complex elements, you need to define the tp_members fields or tp_getset to create custom C functions to find the value for the attribute. If you, say, write a parser and know that customers will only have access to a subset of the analyzed fields, you can quickly create a type containing raw data, return this type and call a C function to search for Python attributes that processes PyObject. You can even postpone parsing until the function is called to save resources in case parsing is never needed! This technique is quite rare because it requires writing non-trivial code, but it gives a positive result.

Predetermining the size of the collection

This refers to the CPython C API.

When creating collections, such as lists or dictionaries, use PyList_New() + PyList_SET_ITEM() to populate a new collection if its size has already been determined at the time of creation. This will pre-determine the size of the collection in order to be able to hold a finite number of elements in it. This helps to skip checking for sufficient collection size when inserting items. When creating a collection of thousands of items this will help save some resources!

Using Zero-copy in C API

In the Python C API, it’s really more like making copies of objects than returning links to them. For example, PyBytes_FromStringAndSize () copies char* into memory reserved Python. If you do this for a large number of values or big data, then we could talk about gigabytes of I / O from memory and the associated allocator load.

If you need to write high-performance code without the C API, then you should familiarize yourself with the buffer protocol and the corresponding types, such as memoryview .

Buffer protocol is built into Python types and allows interpreters to cast from / to byte types. It also allows the C interpreter to get a void* handle of a certain size. This allows you to associate any address in memory with PyObject. Many functions that work with binary data transparently accept any object that implements the buffer protocol . And if you want to accept any object that can be considered as bytes, then you need to use the s* , y* or w* format units when receiving function arguments.

Using the buffer protocol , you give the interpreter the best opportunity to use zero-copy operations and refuse to copy extra bytes into memory.

By using Python types of memoryview , you can also allow Python to access memory levels by reference, instead of creating copies.

If you have gigabytes of code that runs through your Python program, then an insightful use of Python types that support zero-copy will save you the difference in performance. Once I noticed that python-zstandard was faster than any Python LZ4 bindings (although it should be the other way around), because I used the buffer protocol too extensively and avoided excessive memory I / O in python-zstandard !

Conclusion

In this article, I tried to talk about some things that I learned while optimizing my Python programs for several years. I repeat and say that it is not in any way a comprehensive review of Python performance improvement methods. I admit that I may use Python more demandingly than others, and my recommendations cannot be applied to all programs. You should in no case massively correct your Python code and remove, for example, the search for attributes after reading this article . As always, when it comes to optimizing performance, first correct where the code is especially slow. I highly recommend py-spy for profiling Python applications. , , Python, . , , , !

, Python . , , Python - . Python – PyPy, . Python . , Python , . , « ». , , , Python, , , .

;-)

Source: https://habr.com/ru/post/457942/

All Articles