And once again about GIL in Python

Foreword

The area in which I was lucky to work is the computational electrophysiology of the heart . The physiology of cardiac activity is determined by electrical processes occurring at the level of individual myocardial cells. These electrical processes create an electric field that is fairly easy to measure. Moreover, it is very well described in the framework of mathematical models of electrostatics. This is where a unique opportunity arises to describe mathematically the work of the heart, and therefore to improve the methods of treating many heart diseases.

During my work in this area, I have gained some experience in using various computing technologies. Some questions that may be of interest not only to me, I will try to answer in the framework of this publication.

Scientific Python in Brief

Starting from the first years of university, I tried to find the perfect tool for the rapid development of numerical algorithms. If we discard a number of frankly marginal technologies, I traveled between C ++ and MATLAB. This continued until I discovered Scientific Python [1].

Scientific Python is a collection of Python libraries for scientific computing and scientific visualization. In my work I use the following packages, which cover about 90% of my needs:

Title	Description
Numpy	One of the basic libraries allows you to work with multidimensional arrays as with single objects in the MATLAB style. Includes the implementation of the basic procedures of linear algebra, the Fourier transform, work with random numbers, etc.
Scipy	The NumPy extension includes the implementation of optimization methods, working with sparse matrices, statistics, etc.
Pandas	Separate package for analyzing multidimensional data and statistics.
Sympy	A package of symbolic mathematics.
Matplotlib	Two-dimensional graphics.
Mayavi2	Three-dimensional graphics based on VTK.
Spyder	Convenient IDE for interactive development of mathematical algorithms.

In Scientific Python, I found a great balance between a convenient high-level abstraction for the rapid development of numerical algorithms and a modern, advanced language. But, as you know, there are no perfect tools. And one of the rather critical problems in Python is the problem of parallel computing.
')

Problems of parallel computing in Python.

By parallel computing in this article, I will understand SMP - symmetric multiprocessing with shared memory. The use of CUDA and systems with separate memory (most commonly used standard MPI) will not touch.

The problem is GIL. GIL (Global Interpreter Lock) is a lock (mutex) that prevents multiple threads from executing the same bytecode. This lock, unfortunately, is necessary, as the memory management system in CPython is not thread-safe. Yes, GIL is not a Python language problem, but a CPython interpreter implementation problem. But, unfortunately, the remaining implementations of Python are not too adapted for creating fast numerical algorithms.

Fortunately, there are currently several ways to solve GIL problems. Consider them.

Test task

Two sets of N vectors are given: P = {p ₁ , p ₂ , ..., p _N } and Q = {q ₁ , q ₂ , ..., q _N } in three-dimensional Euclidean space. It is necessary to construct a matrix R of dimension N x N , each element of r _{i, j of} which is calculated by the formula:

Roughly speaking, you need to calculate a matrix that uses pairwise distances between all vectors. This matrix is quite often used in real calculations, for example, in the case of RBF interpolation or the solution of differents in the cp method of integral equations.

In test experiments, the number of vectors is N = 5000. A 4-core processor was used for calculations. Results are obtained on average time from 10 starts.

Full implementation of test tasks can be glanced on GitHub [2].

Correct comment in comments from "@chersaya". This test task is used here as an example. If you really need to calculate pairwise distances, use the scipy.spatial.distance.cdist function.

Parallel implementation in C ++

To compare the effectiveness of parallel computing in Python, I implemented this task in C ++. The code of the main function is as follows.

Single-processor implementation:

//! Single thread matrix R calculation void spGetR(vector<Vector3D> & p, vector<Vector3D> & q, MatrixMN & R) { for (int i = 0; i < p.size(); i++) { Vector3D & a = p[i]; for (int j = 0; j < q.size(); j++) { Vector3D & b = q[j]; Vector3D r = b - a; R(i, j) = 1 / (1 + sqrt(r * r)); } } }

Multiprocessing implementation:

 //! OpenMP matrix R calculations void mpGetR(vector<Vector3D> & p, vector<Vector3D> & q, MatrixMN & R) { #pragma omp parallel for for (int i = 0; i < p.size(); i++) { Vector3D & a = p[i]; for (int j = 0; j < q.size(); j++) { Vector3D & b = q[j]; Vector3D r = b - a; R(i, j) = 1 / (1 + sqrt(r * r)); } } }

What is interesting here? Well, first of all, I used a separate Vector3D class to represent the vector in three-dimensional space. The overloaded operator "*" in this class has the meaning of a scalar product. To represent a set of vectors, I used std :: vector. For parallel computing, OpenMP technology was used. To parallelize the algorithm, it suffices to use the "#pragma omp parallel for" directive.

Results:

Single-processor C ++	224 ms
Multiprocessor C ++	65 ms

Acceleration 3.45 times with parallel calculation, I think it is quite good for a quad-core processor.

Parallel Python implementations

1. Native implementation in pure Python

In this test, I wanted to check how much the problem would be solved in pure Python without using any special packages.

Solution code:

 def sppyGetR(p, q): R = np.empty((p.shape[0], q.shape[1])) nP = p.shape[0] nQ = q.shape[1] for i in xrange(nP): for j in xrange(nQ): rx = p[i, 0] - q[0, j] ry = p[i, 1] - q[1, j] rz = p[i, 2] - q[2, j] R[i, j] = 1 / (1 + sqrt(rx * rx + ry * ry + rz * rz)) return R

Here p , q is the input data in the NumPy format of arrays of dimensions (N, 3) and (3, N) . And then comes an honest cycle in Python that calculates the elements of the matrix R.

Results:

Uniprocessor Python

57,386 ms

Yes, yes, precisely 57 thousand milliseconds. Somewhere 256 times slower than a single-processor C ++. In general, this is not an option for numerical calculations.

2 Single-processor NumPy

In general, for calculations on Python using NumPy, sometimes you can not even think about parallelism. So, for example, the procedure of multiplying two matrices by NumPy will eventually still be performed using low-level high-performance libraries of linear algebra in C ++ (MKL or ATLAS). But, unfortunately, this is true only for the most typical operations and does not work in the general case. Our test task, unfortunately, will be performed sequentially.

The decision code is as follows:

 def spnpGetR(p, q): Rx = p[:, 0:1] - q[0:1] Ry = p[:, 1:2] - q[1:2] Rz = p[:, 2:3] - q[2:3] R = 1 / (1 + np.sqrt(Rx * Rx + Ry * Ry + Rz * Rz)) return R

Only 4 lines and no cycles! That's what I love NumPy for.

Results:

Single NumPy

973 ms

About 4.3 times slower than single-processor C ++. This is quite a good result. For the vast majority of calculations, this performance is enough. But all this is still single-processor results. We go further to multiprocessing.

3 Multiprocessor NumPy

As a solution to problems with GIL, it is traditionally proposed to use several independent execution processes instead of several execution threads. All is good, but there is a problem. Each process has an independent memory, and we need to transfer a matrix of results to each process. To solve this problem in Python multiprocessing, the RawArray class is introduced, which provides the ability to split one array of data between processes. I don’t know exactly what is the basis of RawArray. It seems to me that this is memory mapped files.

The decision code is as follows:

 def mpnpGetR_worker(job): start, stop = job p = np.reshape(np.frombuffer(mp_share.p), (-1, 3)) q = np.reshape(np.frombuffer(mp_share.q), (3, -1)) R = np.reshape(np.frombuffer(mp_share.R), (p.shape[0], q.shape[1])) Rx = p[start:stop, 0:1] - q[0:1] Ry = p[start:stop, 1:2] - q[1:2] Rz = p[start:stop, 2:3] - q[2:3] R[start:stop, :] = 1 / (1 + np.sqrt(Rx * Rx + Ry * Ry + Rz * Rz)) def mpnpGetR(p, q): nP, nQ = p.shape[0], q.shape[1] sh_p = mp.RawArray(ctypes.c_double, p.ravel()) sh_q = mp.RawArray(ctypes.c_double, q.ravel()) sh_R = mp.RawArray(ctypes.c_double, nP * nQ) nCPU = 4 jobs = utils.generateJobs(nP, nCPU) pool = mp.Pool(processes=nCPU, initializer=mp_init, initargs=(sh_p, sh_q, sh_R)) pool.map(mpnpGetR_worker, jobs, chunksize=1) R = np.reshape(np.frombuffer(sh_R), (nP, nQ)) return R

We create divided arrays for input data and output matrix, create a pool of processes by the number of cores, divide the problem into subtasks and solve in parallel.

Results:

Multiprocessor NumPy

795 ms

Yes, faster than the single-processor version, but only 1.22 times. As the number N grows, the efficiency of the solution grows. But, in general and in general, our test problem is not too adapted for solving within the framework of a set of independent processes with independent memory. Although for other tasks this option may be quite effective.

At this, known to me, solutions for parallel programming using only Python are over. Further, as we would not like, for release from GIL it is necessary to go down to the C ++ level. But this is not as scary as it seems.

4 cython

Cython [3] is an extension to the Python language that allows you to embed C instructions in Python code. Thus, we can take the code in Python and by adding a few instructions significantly speed up the narrow places in terms of performance. Cython modules are converted to C code and then compiled into Python modules. The code for solving our Cython problem is as follows:

Cython uniprocessor:

 @cython.boundscheck(False) @cython.wraparound(False) def spcyGetR(pp, pq): pR = np.empty((pp.shape[0], pq.shape[1])) cdef int i, j, k cdef int nP = pp.shape[0] cdef int nQ = pq.shape[1] cdef double[:, :] p = pp cdef double[:, :] q = pq cdef double[:, :] R = pR cdef double rx, ry, rz with nogil: for i in xrange(nP): for j in xrange(nQ): rx = p[i, 0] - q[0, j] ry = p[i, 1] - q[1, j] rz = p[i, 2] - q[2, j] R[i, j] = 1 / (1 + sqrt(rx * rx + ry * ry + rz * rz)) return R

Multiprocessing Cython:

 @cython.boundscheck(False) @cython.wraparound(False) def mpcyGetR(pp, pq): pR = np.empty((pp.shape[0], pq.shape[1])) cdef int i, j, k cdef int nP = pp.shape[0] cdef int nQ = pq.shape[1] cdef double[:, :] p = pp cdef double[:, :] q = pq cdef double[:, :] R = pR cdef double rx, ry, rz with nogil, parallel(): for i in prange(nP, schedule='guided'): for j in xrange(nQ): rx = p[i, 0] - q[0, j] ry = p[i, 1] - q[1, j] rz = p[i, 2] - q[2, j] R[i, j] = 1 / (1 + sqrt(rx * rx + ry * ry + rz * rz)) return R

If we compare this code with the implementation in pure Python, then all we had to do was just to specify the types for the variables used. GIL is released in one line. The parallel loop is organized only by the prange instruction instead of xrange. In my opinion, it is quite easy and beautiful!

Results:

Cypro uniprocessor	255 ms
Multiprocessing cython	75 ms

Wow The execution time almost coincides with the execution time in C ++. The lag is about 1.1 times in both single-processor and multiprocessor versions almost imperceptibly on real tasks.

5 numba

Numba [4] is a fairly new library, is in active development. The idea here is about the same as in Cython - an attempt to go down to the C ++ level in Python code. But the idea is implemented much more elegantly.

Numba is based on LLVM compilers that allow compiling directly during the program execution (JIT compilation). For example, to compile any procedure in Python, you just need to add the “jit” annotation. Moreover annotations allow you to specify the types of input / output data, which makes the JIT compilation much more efficient.
The task implementation code is as follows.

Single CPU Numba:

 @jit(double[:, :](double[:, :], double[:, :])) def spnbGetR(p, q): nP = p.shape[0] nQ = q.shape[1] R = np.empty((nP, nQ)) for i in xrange(nP): for j in xrange(nQ): rx = p[i, 0] - q[0, j] ry = p[i, 1] - q[1, j] rz = p[i, 2] - q[2, j] R[i, j] = 1 / (1 + sqrt(rx * rx + ry * ry + rz * rz)) return R

Multiprocessor Numba:

 def makeWorker(): savethread = pythonapi.PyEval_SaveThread savethread.argtypes = [] savethread.restype = c_void_p restorethread = pythonapi.PyEval_RestoreThread restorethread.argtypes = [c_void_p] restorethread.restype = None def worker(p, q, R, job): threadstate = savethread() nQ = q.shape[1] for i in xrange(job[0], job[1]): for j in xrange(nQ): rx = p[i, 0] - q[0, j] ry = p[i, 1] - q[1, j] rz = p[i, 2] - q[2, j] R[i, j] = 1 / (1 + sqrt(rx * rx + ry * ry + rz * rz)) restorethread(threadstate) signature = void(double[:, :], double[:, :], double[:, :], int64[:]) worker_ext = jit(signature, nopython=True)(worker) return worker_ext def mpnbGetR(p, q): nP, nQ = p.shape[0], q.shape[1] R = np.empty((nP, nQ)) nCPU = utils.getCPUCount() jobs = utils.generateJobs(nP, nCPU) worker_ext = makeWorker() threads = [threading.Thread(target=worker_ext, args=(p, q, R, job)) for job in jobs] for thread in threads: thread.start() for thread in threads: thread.join() return R

Compared to pure Python, only one abstract is added to a single-processor solution on Numba! The multiprocessor version, unfortunately, is not so beautiful. It is required to organize a pool of threads, in the manual mode to give GIL. In previous releases, Numba attempted to implement a parallel loop with one instruction, but due to stability problems in subsequent releases, this feature was removed. I’m sure this opportunity will be repaired over time.

Results of performance:

Single CPU Numba	359 ms
Multiprocessor Numba	180 ms

Slightly worse than Cython, but the results are still very decent! And the decision itself is extremely elegant.

findings

I want to illustrate the results with the following diagrams:

Fig. 1. Results of uniprocessor calculations

Fig. 2. Results of multiprocessor computing

It seems to me that the problems of GIL in Python for numerical calculations are almost overcome. So far as a parallel computing technology, I would recommend Cython. But I would carefully look to Numba.

Links

[1] Scientific Python: scipy.org
[2] Full test source codes: github.com/alec-kalinin/open-nuance
[3] Cython: cython.org
[4] Numba: numba.pydata.org

PS In the comments "@chersaya" correctly indicated another method of parallel computing. This is the use of the numexpr library. Numexpr uses its own virtual machine written in C and its own JIT compiler. This allows it to accept simple mathematical expressions as a string, compile and quickly calculate.

Usage example:

 import numpy as np import numexpr as ne a = np.arange(1e6) b = np.arange(1e6) result = ne.evaluate("sin(a) + arcsinh(a/b)")

Source: https://habr.com/ru/post/238703/

All Articles