Quick Python performance test for computing tasks

Motivation

Recently a new version 0.34 of the Numba JIT compiler compiling library for Python has been released. And there cheers! the long-awaited semantics of annotations and a set of methods for organizing parallel computing appeared. The basis was taken from Intel Parallel Accelerator technology.

In this article I want to share the results of the first testing of the speed of calculations based on this library for some modern machine with a quad-core processor.

Introduction

Currently, Python is very actively used in scientific computing, and in the field of Machine Learning in general is almost one of the standards. But if you look a little deeper, almost everywhere Python is used as a wrapper over lower level libraries, written mostly in C / C ++. Is it really possible to write fast and parallel code in pure Python?

Consider a very simple task. Let us be given two sets of N points in three-dimensional space: p and q . It is necessary to calculate a special matrix based on pairwise distances between all points:

R (i, j) = f r a c 1 1 + | p (i) - q (j) |_{2}

$R (i, j) = \ frac {1} {1+ \ | p (i) - q (j) \ | _2}$

For all tests, we take N = 5000. The calculation time is averaged for 10 starts.
')

C ++ implementation

As a starting point, take the following implementation in C ++:

void getR(std::vector<Point3D> & p, std::vector<Point3D> & q, Matrix & R) { double rx, ry, rz; #pragma omp parallel for for (int i = 0; i < p.size(); i++) { for (int j = 0; j < q.size(); j++) { rx = p[i].x - q[j].x; ry = p[i].y - q[j].y; rz = p[i].z - q[j].z; R(i, j) = 1 / (1 + sqrt(rx * rx + ry * ry + rz * rz)); } } }

An external cycle on p points is performed in parallel using OpenMP technology.

Execution time: 44 ms.

Pure python

Let's start the speed test with pure Python code:

 def get_R(p, q): R = np.empty((p.shape[0], q.shape[1])) for i in range(p.shape[0]): for j in range(q.shape[1]): rx = p[i, 0] - q[0, j] ry = p[i, 1] - q[1, j] rz = p[i, 2] - q[2, j] R[i, j] = 1 / (1 + math.sqrt(rx * rx + ry * ry + rz * rz)) return R

Execution time: 52 861 ms, slower than the base implementation more than 1000 times.

Python is an interpreted language, Python uses GIL internally, which makes parallelization at the level of the code itself impossible. This is all very slow. Now we start all this accelerate.

Python + NumPy + SciPy

The problem of Python slowness for numerical problems has been recognized for a very long time. And the answer to this problem was the NumPy library. The ideology of NumPy is in many ways close to MatLab, which is a recognized tool for scientific calculations.

We stop thinking iteratively, we begin to think of matrices and vectors as atomic objects for computation. And all operations with matrices and vectors at the lower level are already performed by the high-performance libraries of the linear algebra Intel MKL or ATLAS.

The implementation on NumPy looks like this:

 def get_R_numpy(p, q): Rx = p[:, 0:1] - q[0:1] Ry = p[:, 1:2] - q[1:2] Rz = p[:, 2:3] - q[2:3] R = 1 / (1 + np.sqrt(Rx * Rx + Ry * Ry + Rz * Rz)) return R

In this implementation, there is no cycle at all!

Execution time: 839 ms, which is slower than the base implementation about 19 times.

Moreover, NumPy and SciPy have a huge number of built-in functions. The implementation of this task on SciPy looks like this:

 def get_R_scipy(p, q): D = spd.cdist(p, qT) R = 1 / (1 + D) return R

Execution time: 353 ms, which is 8 times slower than the base implementation.

For most tasks, this is a perfectly acceptable time. The price of this is to switch the way of thinking; now it is necessary to collect code from the basic operations of linear algebra. Sometimes it looks very beautiful, but sometimes you have to invent different tricks.

But what about parallelism? Here it is implicit. We hope that at a low level all operations with matrices and vectors are implemented efficiently and in parallel.

And what to do if our code does not fit into linear algebra, or do we want explicit parallelization?

Python + Cython

Here comes the time of Cython . Cython is a special language that allows you to embed code in the C-image language inside normal Python code. Cython then converts this code to .c files that are compiled into python modules. These modules are transparent enough to be called in other parts of Python code. The implementation on Cython looks like this:

 @cython.wraparound(False) @cython.nonecheck(False) @cython.boundscheck(False) def get_R_cython_mp(py_p, py_q): py_R = np.empty((py_p.shape[0], py_q.shape[1])) cdef int nP = py_p.shape[0] cdef int nQ = py_q.shape[1] cdef double[:, :] p = py_p cdef double[:, :] q = py_q cdef double[:, :] R = py_R cdef int i, j cdef double rx, ry, rz with nogil: for i in prange(nP): for j in xrange(nQ): rx = p[i, 0] - q[0, j] ry = p[i, 1] - q[1, j] rz = p[i, 2] - q[2, j] R[i, j] = 1 / (1 + sqrt(rx * rx + ry * ry + rz * rz)) return py_R

What's going on here? The function accepts python numpy objects as input, then they are converted into Cython typed C-structures, and then gil is disabled and the outer loop is executed in parallel using a special 'prange' construct.

Runtime: 76 ms, which is 1.75 times slower than the base implementation.

Well, we are almost close to the base implementation, we started writing explicit parallel code. But the price for this was less readable code, we left pure Python.

In general, most numerical calculations are written this way. Most of the NumPy, and some places critical in speed are placed in separate modules and implemented in cython.

Python + Numba

We have come a long way. We started with pure Python, then went the way of matrix computing magic, then plunged into special language extensions. It's time to go back to where we started. So, the implementation in Python + Numba:

 @jit(float64[:, :](float64[:, :], float64[:, :]), nopython=True, parallel=True) def get_R_numba_mp(p, q): R = np.empty((p.shape[0], q.shape[1])) for i in prange(p.shape[0]): for j in range(q.shape[1]): rx = p[i, 0] - q[0, j] ry = p[i, 1] - q[1, j] rz = p[i, 2] - q[2, j] R[i, j] = 1 / (1 + math.sqrt(rx * rx + ry * ry + rz * rz)) return R

Execution time: 46 ms, which practically coincides with the base implementation!

And all that needed to be done for this with the source slow Python code:

add annotation with types and parallel execution type
replace 'range' with 'prange' to run the outer loop in parallel

Numba is a very interesting project. This is an optimized JIT compiler for Python based on LLVM. It is completely transparent for use, no separate modules are needed. All you need to do is add a few annotations to speed-critical methods.

In summary, the execution times are as follows:

C ++	Python	Numpy	Scipy	Cython	Numba
44 ms	52,861 ms	839 ms	353 ms	76 ms	46 ms

Source: https://habr.com/ru/post/336684/

All Articles