We write numpy-module to accelerate math functions using SIMD instructions

The numpy and scipy packages provide excellent options for quickly solving various computational problems. The concept of universal functions (ufunc), working with both scalar values and arrays of various dimensions, allows us to achieve high performance while maintaining the intrinsic Python language simplicity and elegance. A universal function is usually used to perform a single operation on a large array of data, which is ideal for optimization using SIMD instructions , but I could not find a complete solution based on free software that allows you to use SIMD for calculating mathematical functions like numpy sine, cosine and exponent. I didn’t want to implement the algorithms for calculating these functions from scratch, but fortunately there were several free libraries in the C language on the Internet. Overcoming ~~laziness~~ , I decided to write my own numpy-module, offering universal functions for sine, cosine and exponential functions. For details and test results, welcome under cat.

A bit about SIMD instructions

SIMD instructions allow you to simultaneously perform the same set of operations on several numbers written in one register. Thus, you can process several numbers at once in one clock cycle and potentially increase productivity by several times.
For example, the Advanced Vector Extensions (AVX) set of SIMD instructions allows you to perform operations with 256-bit registers, each of which can include eight 32-bit floating-point numbers (single precision numbers) or four 64-bit (double precision numbers) . The set of operations is rather modest, mainly it is addition, subtraction, multiplication and division. Accurate and fast implementation of trigonometric functions using these operations is not a trivial task and, in order not to reinvent the wheel, you should use some kind of ready-made library. In addition to the proprietary Intel MKL (which already knows how to work with numpy), there were only three options ( one , two , three ).
The first option is a header file with a very modest set of functions, with almost no documentation and tests. The second option is the C ++ library vecmathlib, which for some reason I stubbornly refused to compile, despite using the recommended GCC-4.7 compiler. Option promising, but still seems raw. The third option, the SLEEF library, was found thanks to vecmathlib, which uses its code base. I immediately liked the variant with the simplicity and clarity of the code, as well as with an abundance of tests.

Motivation test

To get enough motivation to write a module, and at the same time to deal with using SLEEF, I decided to compare the speed of calculating the sine in the “C” language when using SLEEF with the standard math.h. Naturally, we are talking about the element-by-element calculation of the sine for a large data array.
Unfortunately, there is practically no documentation and examples in SLEEF, but there are quite clearly written tests, so it was easy to understand the use of the library. The source code SLEEF consists of four directories: java , purec , simd and tester . In addition, there is a README file with a brief description of the library and a general Makefile pulling the Makefile from the listed directories. Naturally, I was most interested in the simd directory, which, as the name suggests, contained functions optimized using SIMD instructions.
From the Makefile in the simd directory, simd clear that 4 variants of SIMD instructions are supported: SSE2 , AVX , AVX2 and FMA4 . The function prototypes are defined in the sleefsimd.h file, and the required set of instructions is selected when compiling with the -DENABLE_SSE2 , -DENABLE_AVX , -DENABLE_AVX2 or -DENABLE_FMA4 . The makefile collects executables for testing functions using each of the instruction sets: iutsse2 , iutavx , iutavx2 or iutfma4 . These files are called from the universal tester program (from the tester directory) and execute the commands received from the tester. The implementation of the commands is in the iut.c file, from where the use of the library becomes obvious.

Sine test function from the simd / iut.c file of the SLEEF source code

 double xxsin(double d) { double s[VECTLENDP]; int i; for(i=0;i<VECTLENDP;i++) { s[i] = random()/(double)RAND_MAX*20000-10000; } int idx = random() & (VECTLENDP-1); s[idx] = d; vdouble a = vloadu(s); a = xsin(a); vstoreu(s, a); return s[idx]; }

For double precision numbers ( double ), you need to define an array of length VECTLENDP , fill in the arguments of the function of interest and pass it to the vloadu function, which will copy them to the required place for using SIMD and return the value of type vdouble . We pass the xsin value to the vdouble function, which calculates the sine value for all VECLENDP arguments at once and returns vdouble again. The result is unpacked into an array from double using the vstoreu function.
For those who want to check the SLEEF on my machine, I cite the full source code of the program, which I wrote to assess the potential acceleration from using SIMD with the help of SLEEF.

A program to estimate the speed of calculating the sine using SLEEF

 #include <stdio.h> #include <stdlib.h> #include <time.h> #include <math.h> #include "sleefsimd.h" #define TESTSIZE (VECTLENDP*10000000) double s[TESTSIZE]; double r1[TESTSIZE]; double r2[TESTSIZE]; #define COUNT 10 int main(int argc, char *argv[]) { int k, i; clock_t t1, t2; double time1, time2; double max, rmax; srandom(time(NULL)); for(i = 0; i < TESTSIZE; i++) { s[i] = random()/(double)RAND_MAX*20000-10000; } printf("Testing sin, %d values\n", TESTSIZE*COUNT); t1 = clock(); for(k = 0; k < COUNT; k++) { for(i = 0; i < TESTSIZE; i++) { r1[i] = sin(s[i]); } } t2 = clock(); time1 = (double)(t2 - t1)/CLOCKS_PER_SEC; printf("Finish sin, spent time = %lg sec\n\n", time1); printf("Testing xsin\n"); t1 = clock(); for(k = 0; k < COUNT; k++) { for(i = 0; i < TESTSIZE; i += VECTLENDP) { vdouble a = vloadu(s+i); a = xsin(a); vstoreu(r2+i, a); } } t2 = clock(); time2 = (double)(t2 - t1)/CLOCKS_PER_SEC; printf("Finish xsin, spent time = %lg sec\n\n", time2); printf("Speed ratio: %lf\n", time1/time2); max = r1[0] - r2[0]; rmax = (r1[0] - r2[0])/r1[0]; for(i = 0; i < TESTSIZE; i++) { double delta = (r1[i] - r2[i]); if(abs(delta) > abs(max)) max = delta; delta = (r1[i] - r2[i])/r1[i]; if(abs(delta) > abs(max)) rmax = delta; } printf("Max absolute delta: %lg, relative delta %lg\n", max, rmax); return 0; }

The most advanced set of commands supported on my computer is AVX, so I compiled a program (recorded in the file simd/speedtest.c in the SLEEF sources) with the following command:

 gcc -O3 -Wall -Wno-unused -Wno-attributes -DENABLE_AVX -mavx speedtest.c sleefsimddp.c sleefsimdsp.c -o speedtest -lm

I expected the acceleration about 4 times, but the result exceeded all my expectations:

 Testing sin, 400000000 values Finish sin, spent time = 14.95 sec Testing xsin Finish xsin, spent time = 1.31 sec Speed ratio: 11.412214 Max absolute delta: 5.55112e-17, relative delta 1.58441e-16

Acceleration is more than 10 times , with a relative error of calculation of less than 2 · 10 ^-16 (the order of accuracy of the double itself), on one processor core! Of course, in a real application, the acceleration will be less, but the motivation for writing your numpy module is already enough.
')

A few words about universal functions

In numpy, data is represented as multidimensional arrays. Universal functions work elementwise with arrays of any dimension, and in the case of several parameters, their dimension may not coincide. The parameters of the universal function are first reduced to the same dimension in accordance with special rules (this is called Broadcasting ), and then the necessary calculations are performed element by element. The output is the largest array of dimensions.
For example, the same add function (which is automatically used when using the "+" operator for numpy arrays) allows you to add both two numbers or one-dimensional arrays, and add a number to an array or a one-dimensional array to a two-dimensional one.

Example

 >>> from numpy import array, add >>> add(1, 2) 3 >>> add(array([1,2]), array([4,5])) #       array([5, 7]) >>> add(array([1,2]), 1) #      (..   ) array([2, 3]) >>> add(array([[1,2],[3,4]]), array([1,2])) #      () array([[2, 4], [4, 6]])

More information on numpy in English can be found in the official documentation , in Russian - for example, here .

We write our numpy-module with a universal function and SIMD instructions

Numpy and skipy have a fairly convenient API and good documentation, in which for those who want to write their own numpy-module with a universal function there is a corresponding tutorial . First, we write a C-function, in a linear cycle, which calculates the value of the mathematical function of interest from a scalar argument:

The function for calculating the sine in the numpy-module

 static void double_xsin(char **args, npy_intp *dimensions, npy_intp* steps, void* data) { npy_intp i; npy_intp n = dimensions[0]; char *in = args[0], *out = args[1]; npy_intp in_step = steps[0], out_step = steps[1]; double tmp[VECTLENDP]; vdouble a; int slow_n = n % VECTLENDP; if(in_step != sizeof(double) || out_step != sizeof(double)) slow_n = n; for(i = 0; i < slow_n; i += VECTLENDP) { int j; for(j = 0; j < VECTLENDP && i + j < slow_n; j++) { tmp[j] = *(double *)in; in += in_step; } a = vloadu(tmp); a = xsin(a); vstoreu(tmp, a); for(j = 0; j < VECTLENDP && i + j < slow_n; j++) { *(double *)out = tmp[j]; out += out_step; } } if(n > slow_n) { double *in_array = (double *)in; double *out_array = (double *)out; for(i = 0; i < n - slow_n; i += VECTLENDP) { a = vloadu(in_array + i); a = xsin(a); vstoreu(out_array + i, a); } } }

Pointers to input and output numpy passes to us in the args array. In our case, the function has one input and one output, so the address of the input data is args[0] , the output is args[1] . The number of elements is passed to dimensions[0] . To navigate through the input data, you need to increment the pointer by steps[0] , on weekends - by steps[1] (it is important that the pointer is of type char , since steps specify values in bytes). Unfortunately, I was not able to find in the numpy documentation that the values in steps should be equal to the sizes of the corresponding data types, although the experiment showed that on my system for arrays of nonzero dimension this rule is fulfilled. In case of its violation, the calculations will be slower since additional copying of elements into and out of the tmp array is required.
The same universal function in numpy can work with different data types, but a separate C function is written for each data type. When registering a universal function, we indicate the supported types and for each combination of input and output types we pass a pointer to a C function that will work with this combination:

 static PyUFuncGenericFunction funcs[] = {&double_xsin}; static char types[] = {NPY_DOUBLE, NPY_DOUBLE}; static void *data[] = {NULL};

In the types array, both input and output types are funcs , so it is longer than the funcs and data arrays. The array of data pointers allows for each type combination to specify its own additional parameter, which will be passed to C-Functions as the last argument to void* data . In particular, this can be used to implement different universal functions with one C-function.
To register our universal function, you need to call PyUFunc_FromFuncAndData and pass the arrays described above ( funcs , data and types ), as well as the number of input and output arguments of the universal function, the number of supported type combinations, the name of the function in the module and the documentation string.

Full module source code

 #include "Python.h" #include "numpy/ndarraytypes.h" #include "numpy/ufuncobject.h" #include "numpy/npy_3kcompat.h" #include "sleef/sleefsimd.h" /* The loop definition must precede the PyMODINIT_FUNC. */ static void double_xsin(char **args, npy_intp *dimensions, npy_intp* steps, void* data) { npy_intp i; npy_intp n = dimensions[0]; char *in = args[0], *out = args[1]; npy_intp in_step = steps[0], out_step = steps[1]; double tmp[VECTLENDP]; vdouble a; int slow_n = n % VECTLENDP; if(in_step != sizeof(double) || out_step != sizeof(double)) slow_n = n; for(i = 0; i < slow_n; i += VECTLENDP) { int j; for(j = 0; j < VECTLENDP && i + j < slow_n; j++) { tmp[j] = *(double *)in; in += in_step; } a = vloadu(tmp); a = xsin(a); vstoreu(tmp, a); for(j = 0; j < VECTLENDP && i + j < slow_n; j++) { *(double *)out = tmp[j]; out += out_step; } } if(n > slow_n) { double *in_array = (double *)in; double *out_array = (double *)out; for(i = 0; i < n - slow_n; i += VECTLENDP) { a = vloadu(in_array + i); a = xsin(a); vstoreu(out_array + i, a); } } } static PyMethodDef AvxmathMethods[] = { {NULL, NULL, 0, NULL} }; static PyUFuncGenericFunction funcs[1] = {&double_xsin}; static char types[] = {NPY_DOUBLE, NPY_DOUBLE}; static void *data[] = {NULL}; void register_xsin(PyObject *module) { PyObject *xsin, *d; import_array(); import_umath(); xsin = PyUFunc_FromFuncAndData(funcs, data, types, 1, 1, 1, PyUFunc_None, "sin", "AVX-accelerated sine calculation", 0); d = PyModule_GetDict(module); PyDict_SetItemString(d, "sin", xsin); Py_DECREF(xsin); } #if PY_VERSION_HEX >= 0x03000000 static struct PyModuleDef moduledef = { PyModuleDef_HEAD_INIT, "avxmath", NULL, -1, AvxmathMethods, NULL, NULL, NULL, NULL }; PyMODINIT_FUNC PyInit_avxmath(void) { PyObject *m; m = PyModule_Create(&moduledef); if (!m) { return NULL; } register_xsin(m); return m; } #else PyMODINIT_FUNC initavxmath(void) { PyObject *m; m = Py_InitModule("avxmath", AvxmathMethods); if (m == NULL) { return; } register_xsin(m); } #endif

To build the module, I used the standard setup.py from the documentation , replacing the module name and adding the SLEEF library C files, compiler and linker flags. I saved the above source code for the module next to setup.py in a file called avxmath.c , renamed the sleef directory simd from the source code sleef and also put it next to setup.py .

Setup.py file to build avxmath module

 def configuration(parent_package='', top_path=None): import numpy from numpy.distutils.misc_util import Configuration config = Configuration('', parent_package, top_path) config.add_extension('avxmath', ['avxmath.c', 'sleef/sleefsimddp.c', 'sleef/sleefsimdsp.c'], extra_compile_args=['-O3', '-Wall', '-Wno-unused', '-Wno-attributes', '-DENABLE_AVX','-mavx'], extra_link_args=['-lm']) return config if __name__ == "__main__": from numpy.distutils.core import setup setup(configuration=configuration)

To compile without installing to the system, you need to run the python setup.py build_ext --inplace command python setup.py build_ext --inplace , the result of the successful execution of which should be a ready-made module in the avxmath.so file. Now you can check the performance of our function. Run python in the same directory as avxmath.so , and check:

 >>> from numpy import array, pi >>> import avxmath >>> avxmath.sin(0) 0.0 >>> avxmath.sin(pi) 1.2246467991473532e-16 >>> avxmath.sin([0, pi/2, pi, 3*pi/2, 2*pi]) array([ 0.00000000e+00, 1.00000000e+00, 1.22464680e-16, -1.00000000e+00, -2.44929360e-16]) >>>

Making sure that the avxmath module avxmath imported and works without errors, you can do a little test of the performance and accuracy of the new function.

The program for checking the sin function of the avxmath module and the result of its execution

 import numpy import avxmath import time from numpy import random, pi COUNT=10 x = 2e4*random.random(40000000) - 1e4 t = time.clock() for i in xrange(COUNT): y1 = numpy.sin(x) duration1 = time.clock() - t print "numpy.sin %f sec" % duration1 t = time.clock() for i in xrange(COUNT): y2 = avxmath.sin(x) duration2 = time.clock() - t print "avxmath.sin %f sec" % duration2 delta = y2 - y1 rdelta = delta/y1 print "max absolute difference is %lg, relative %lg" % ( delta[abs(delta).argmax()], rdelta[abs(rdelta).argmax()]) print "speedup is %lg" % (duration1/duration2)

 numpy.sin 15.510000 sec avxmath.sin 2.260000 sec max absolute difference is 2.22045e-16, relative 2.63873e-16 speedup is 6.86283

So, we received an acceleration of more than 6 times with a calculation accuracy no worse than 3 · 10 ^-16 ! Replacing the calls of the xsin function xsin a simple copying of memory, it is easy to verify that the acceleration failed 10 times due to the fact that about 1 second of the 2.26 seconds of execution we received was spent on overhead. Similarly, replacing the xsin function xsin the usual sine from math.h , we find that the computation times using avxmath.sin and numpy.sin in our test will coincide with high accuracy.
Thus, using SIMD instructions, one can achieve a significant acceleration of calculations performed using numpy and scipy, simply by replacing the import of normal functions with optimized ones. And of course the source code is somewhat expanded compared with this article avxmath module available on Github by reference .
Upd : Do not use the sine of SLEEF for argument values of the order of 1e10 or more (see comment )

Source: https://habr.com/ru/post/198916/

All Articles

We write numpy-module to accelerate math functions using SIMD instructions

A bit about SIMD instructions

Motivation test

A few words about universal functions

We write our numpy-module with a universal function and SIMD instructions

More articles: