Python extension modules on Rust

“Absolute statements are the root of all evil.
The key is balance. There are no answers, only questions. ”
????

The author of the article: zolkko .

Optimization

When they talk about optimization in the context of software, they often mean optimization of the programmer’s performance and / or optimization of the software itself.
')
Based on the YAGNI principle, Python allows the programmer to focus on software implementation, eliminating the need to take care of low-level things: regions of memory in which objects are allocated, freeing up memory, calling conventions.

Simon Jones pointed to the reverse problem in one of his Haskell lectures . He had a slide with a gradient-colored arrow. In the beginning was written “no types”, in the middle - “Haskell”, at the end - “Coq”. Pointing to Coq, he said: “This stresses power over usability. Right ?! You need a PhD here! ”[1]. Despite the fact that it was a joke, the Python mantra is one of the programmers' favorite features of this language. In my experience, this is what allows you to produce a finished product a little faster.

As for software optimization, different sources say this differently, but for myself I divide it into three levels:

at the level of architecture
high-level algorithms and data structures
low level

An interesting feature here is this: the higher the level at which the optimization is carried out, the more effective it is. Usually so. On the other hand, the more we optimize at a higher level, the earlier it needs to be done: of course, at the end of the project it is more difficult to redo the application architecture. In addition, it is problematic to identify in advance where the bottleneck will be, and in general I want to avoid premature optimizations, because in case of a change in the requirements, it becomes more difficult to change the software.

Run Level Optimization

Probably the most logical and correct (in terms of complexity) strategy for low-level optimization of the Python code is the use of special tools such as PyPy, Pyston and others. This is due to the fact that the frequently used Cpython code is already optimal, and an attempt to add any line will, rather, result in performance degradation. In addition, it is impossible to use classical optimization methods due to the dynamic typing of Python.

This problem was also noted by Kevin Modzelewski at Pyston Talk 2015 [2]. According to him, you can count on about 10% runtime. By combining different techniques - JIT, tracing JIT, heuristic analysis, Pyston - you can achieve a 25% increase in productivity.
And here is one benchmark chart taken from his report:

The graph shows that at some point PyPy becomes 38 times slower than normal Cpython. The result suggests that, using such tools, it is necessary to measure performance. And it should be done on real data, in conditions close to real conditions of software execution. And it is desirable to carry out such an exercise every time the versions of interpreters are updated. Here you can quote: ”[3].

Source Code Optimization

A similar problem can also be revealed when optimizing at the language level through the use of idiomatic productive code. To illustrate, I give an example of a small program (not quite idiomatic [4]), which defines a list of words and three functions that convert it into a list of words from capital letters:

LST = list(map(''.join, product('abc', repeat=10))) def foo(): return map(str.upper, LST) def bar(): res = [] for i in LST: res.append(i.upper()) return res def baz(): return [i.upper() for i in LST]

In it, three logically equivalent functions differ semantically and throughput. At the same time, performance semantics say nothing. In any case, for the inexperienced Python programmer - while 1: pass vs while: True: pass is magic that runs the risk of becoming a hoax when switching to Python 3.

CPython Modules

Another option for low-level Python optimization is expansion modules. By transferring part of the logic to an expansion module, in some cases you can achieve good performance with a predictable result.

Tools

Many of the available Python-tools offer a variety of functionality, ranging from generating code for CUDA and ending with transparent integration with numpy or C ++. However, below I will consider their behavior only in the context of writing extension modules on a specially selected boundary example:

 def add_mul_two(a, b): acc = 0 i = 0 while i < 1000: acc += a + b i += 1 return acc

As you can see, CPython takes it literally:

 12 SETUP_LOOP 40 (to 55) 15 LOAD_FAST 3 (i) 18 LOAD_CONST 2 (1000) 21 COMPARE_OP 0 (<) 24 POP_JUMP_IF_FALSE 54 27 LOAD_FAST 2 (acc) 30 LOAD_FAST 0 (a) 33 LOAD_FAST 1 (b) 36 BINARY_ADD 37 INPLACE_ADD 38 STORE_FAST 2 (acc) 41 LOAD_FAST 3 (i) 44 LOAD_CONST 3 (1) 47 INPLACE_ADD 48 STORE_FAST 3 (i) 51 JUMP_ABSOLUTE 15 54 POP_BLOCK

You can correct the situation by writing the simplest extension module in C.
To do this, determine the minimum initialization function of the module:

 // example.c void initexample(void) { Py_InitModule("example", NULL); }

This function is called so, because in fact the execution of the import statement ...

 import example IMPORT_NAME 0 (example) STORE_FAST 0 (example)

... will lead to fulfillment ...

 // ceval.c ... w = GETITEM(names, oparg); v = PyDict_GetItemString(f->f_builtins, "__import__"); ... x = PyEval_CallObject(v, w); ...

... the builtin function builtin___import__ (bltinmodule.c), and further along the call chain:

 dl_funcptr _PyImport_GetDynLoadFunc(const char *fqname, const char *shortname, const char *pathname, FILE *fp) { char funcname[258]; PyOS_snprintf(funcname, sizeof(funcname), "init%.200s", shortname); return dl_loadmod(Py_GetProgramName(), pathname, funcname); }

In any case, for some platforms and under certain conditions: CPython is built with support for dynamically loadable extension modules, the module has not yet been loaded, the file name of the module has a specific and platform-specific extension, etc.

Next, the module method is determined ...

 static PyObject * add_mul_two(PyObject * self, PyObject * args); static PyMethodDef ExampleMethods[] = { {"add_mul_two", add_mul_two, METH_VARARGS, ""}, {NULL, NULL, 0, NULL} }; void initexample(void) { Py_InitModule("example", ExampleMethods); }

... and its implementation itself. Since in this case the types of input variables are precisely known, the function can be defined as:

 PyObject * add_mul_two(PyObject * self, PyObject * args) { int a, b, acc = 0; if (!PyArg_ParseTuple(args, "ii", &a, &b)) { PyErr_SetNone(PyExc_ValueError); return NULL; } for (int i = 0; i < 1000; i++) acc += a + b; return Py_BuildValue("i", acc); }

The output will be a binary code, about the same that can be obtained using Numba ...

 ___main__.add_mul_two$1.int32.int32: addl %r8d, %ecx imull $1000, %ecx, %eax movl %eax, (%rdi) xorl %eax, %eax retq

... but having written only two lines and not going beyond the framework of one programming language.

 @jit(int32(int32, int32), nopython=True)

In addition to this code, numba generates ...

 add_mul_two.inspect_asm().values()[0].decode('string_escape')

... type wrapper function:

 _wrapper.__main__.add_mul_two$1.int32.int32: ... movq %rdi, %r14 movabsq $_.const.add_mul_two, %r10 movabsq $_PyArg_UnpackTuple, %r11 ... movabsq $_PyNumber_Long, %r15 callq *%r15 movq %rax, %rbx xorl %r14d, %r14d testq %rbx, %rbx je LBB1_8 movabsq $_PyLong_AsLongLong, %rax …

Its task is to parse the input arguments according to the signatures described in the decorator and, if it worked, to run the compiled version. This method seems very tempting, but if, for example, you take the body of the loop into a separate function, you will also need to frame it with the decorator or disable nopython.

Cython is the next challenger. It is a Python superset with support for calling C functions and defining C types. Therefore, in the simplest case, the add_mul_two function on it will look similar to Cpython. However, the extensive functionality is not given just like that, and, unlike the C-version, the resulting file will be almost 2000 lines of CPython API-type:

 __pyx_t_2 = PyNumber_Add(__pyx_v_a, __pyx_v_b); if (unlikely(!__pyx_t_2)) { __pyx_filename = __pyx_f[0]; __pyx_lineno = 14; __pyx_clineno = __LINE__; goto __pyx_L1_error; } __Pyx_GOTREF(__pyx_t_2); __pyx_t_3 = PyNumber_InPlaceAdd(__pyx_v_acc, __pyx_t_2); if (unlikely(!__pyx_t_3)) { __pyx_filename = __pyx_f[0]; __pyx_lineno = 14; __pyx_clineno = __LINE__; goto __pyx_L1_error; }

To improve the situation in terms of specificity, but not in terms of code size, one could write, for example, the implementation of the function itself in C, and use Cython to define a wrapper ...

 int cadd_mul_two(int a, int b) { int32_t acc = 0; for (int i = 0; i < 1000; i++) acc += a + b; return acc; } cdef extern from "example_func.h": int cadd_mul_two(int, int) def add_two(a, b): return cadd_two(a, b) cythonize("sample.pyx", sources=[ 'example_func.c' ])

... while getting almost perfect option, but in this case you need to write in C, Cython, Python.

  __pyx_t_1 = __Pyx_PyInt_As_int32_t(__pyx_v_a); if (unlikely((__pyx_t_1 == (int32_t)-1) && PyErr_Occurred())) {__pyx_filename = __pyx_f[0]; __pyx_ __pyx_t_2 = __Pyx_PyInt_As_int32_t(__pyx_v_b); if (unlikely((__pyx_t_2 == (int32_t)-1) && PyErr_Occurred())) {__pyx_filename = __pyx_f[0]; __pyx_ __pyx_t_3 = __Pyx_PyInt_From_int32_t(cadd_two(__pyx_t_1, __pyx_t_2)); if (unlikely(!__pyx_t_3)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 8; _

Rusty

To make a module on Rust, you only need to declare the extern-function c no_mangle ...

 #[no_mangle] pub extern fn initexample() { unsafe { Py_InitModule4_64(&SAMPLE[0] as *const _, &METHODS[0] as *const _, 0 as *const _, 0, PYTHON_API_VERSION); }; }

... and describe the types:

 type PyCFunction = unsafe extern "C" fn (slf: *mut isize, args: *mut isize) -> *mut isize; #[repr(C)] struct PyMethodDef { pub ml_name: *const i8, pub ml_meth: Option<PyCFunction>, pub ml_flags: i32, pub ml_doc: *const i8, } unsafe impl Sync for PyMethodDef { }

As in C, you need to declare PyMethod:

 lazy_static! { static ref METHODS: Vec = { vec![ PyMethodDef { ml_name: &ADD_MUL_TWO[0] as *const _, ml_meth: Some(add_mul_two), }, ... ] }; }

Due to the fact that there are many C API calls in CPython, you will have to write something like this:

 #[link(name="python2.7")] extern { fn Py_InitModule4_64(name: *const i8, methods: *const PyMethodDef, doc: *const i8, s: isize, apiver: usize) -> *mut isize; fn PyArg_ParseTuple(arg1: *mut isize, arg2: *const i8, ...) -> isize; fn Py_BuildValue(arg1: *const i8, ...) -> *mut isize; }

But in the end we will get this beautiful function:

 #[allow(unused_variables)] unsafe extern "C" fn add_mul_two(slf: *mut isize, args: *mut isize) -> *mut isize { let mut a: i32 = 0; let mut b: i32 = 0; if PyArg_ParseTuple(args, &II_ARGS[0] as *const _, &a as *const i32, &b as *const i32) == 0 { return 0 as *mut _; } let mut acc: i32 = 0; for i in 0..1000 { acc += a + b; } Py_BuildValue(&I_ARGS[0] as *const _, acc) }

Or, if desired ...

 let acc: i32 = (0..).take(1000) .map(|_| a + b) .fold(0, |acc, x| acc + x);

... this function will also be compiled into two machine instructions:

 __ZN7add_mul_two20h391818698d43ab0ffcaE: ... callq 0x7a002 ## symbol stub for: _PyArg_ParseTuple testq %rax, %rax je 0x14e3 movl -0x8(%rbp), %eax addl -0x4(%rbp), %eax imull $0x3e8, %eax, %esi ## imm = 0x3E8 leaq _ref5540(%rip), %rdi ## literal pool for: "h" ...

The disadvantages of this approach are as follows:

only CPython API 2.7 and if you also need Python 3, you will need to duplicate a lot of code
If you try to reduce the size of the binary due to no_std, then the code will be larger
including A number of data structures in Rust are different from C. So for example Rust uses pascal strings and to interact with C you have to use something like std :: ffi :: CString

But, fortunately, there is a wonderful rust-cpython project that not only described all the necessary CpythonAPIs, but also provides high-level abstractions for them, and at the same time supports Python 2.x and 3.x. The code is approximately like this:

 [package] name = "example" version = "0.1.0" [lib] name = "example" crate-type = ["dylib"] [dependencies] interpolate_idents = "0.0.9" [dependencies.cpython] version = "0.0.5" default-features = false features = ["python27-sys"]

 #![feature(slice_patterns)] #![feature(plugin)] #![plugin(interpolate_idents)] #[macro_use] extern crate cpython; use cpython::{PyObject, PyResult, Python, PyTuple, PyDict, ToPyObject, PythonObject}; fn add_two(py: Python, args: &PyTuple, _: Option<&PyDict>) -> PyResult<PyObject> { match args.as_slice() { [ref a_obj, ref b_obj] => { let a = a_obj.extract::<i32>(py).unwrap(); let b = b_obj.extract::<i32>(py).unwrap(); let mut acc:i32 = 0; for _ in 0..1000 { acc += a + b; } Ok(acc.to_py_object(py).into_object()) }, _ => Ok(py.None()) } }

 py_module_initializer!(example, |py, module| { try!(module.add(py, "add_two", py_fn!(add_two))); Ok(()) });

Here nightly Rust is used, as a matter of fact, only for sclice_pattens and PyTuple.as_slice.
But, in my opinion, Rust in this situation offers a solution with powerful high-level abstractions, the ability to fine-tune algorithms and data structures, an effective and predictable result of optimizations. That is, it looks like a worthy alternative to other tools.

Code examples used in the article, you can see
here

Bibliography

1: Simon Peyton Jones, Adventure with Types in Haskell - Simon Peyton Jones (Lecture 2), 2014, youtu.be/brE_dyedGm0?t=536
2: Kevin Modzelewski, 2015/11/10 Pyston Meetup, 2015, www.youtube.com/watch?v=NdB9XoBg5zI
3: Martin Fowler, Yet Another Optimization Article , 2002, martinfowler.com/ieeeSoftware/yetOptimization.pdf
4: Raymond Hettinger, Transforming Code into Beautiful, Idiomatic Python, 2013, www.youtube.com/watch?v=OSGv2VnC0go

Source: https://habr.com/ru/post/279561/

All Articles