java
, purec
, simd
and tester
. In addition, there is a README file with a brief description of the library and a general Makefile pulling the Makefile from the listed directories. Naturally, I was most interested in the simd
directory, which, as the name suggests, contained functions optimized using SIMD instructions.simd
directory, simd
clear that 4 variants of SIMD instructions are supported: SSE2 , AVX , AVX2 and FMA4 . The function prototypes are defined in the sleefsimd.h
file, and the required set of instructions is selected when compiling with the -DENABLE_SSE2
, -DENABLE_AVX
, -DENABLE_AVX2
or -DENABLE_FMA4
. The makefile collects executables for testing functions using each of the instruction sets: iutsse2
, iutavx
, iutavx2
or iutfma4
. These files are called from the universal tester program (from the tester
directory) and execute the commands received from the tester. The implementation of the commands is in the iut.c
file, from where the use of the library becomes obvious. double xxsin(double d) { double s[VECTLENDP]; int i; for(i=0;i<VECTLENDP;i++) { s[i] = random()/(double)RAND_MAX*20000-10000; } int idx = random() & (VECTLENDP-1); s[idx] = d; vdouble a = vloadu(s); a = xsin(a); vstoreu(s, a); return s[idx]; }
double
), you need to define an array of length VECTLENDP
, fill in the arguments of the function of interest and pass it to the vloadu
function, which will copy them to the required place for using SIMD and return the value of type vdouble
. We pass the xsin
value to the vdouble
function, which calculates the sine value for all VECLENDP
arguments at once and returns vdouble
again. The result is unpacked into an array from double
using the vstoreu
function. #include <stdio.h> #include <stdlib.h> #include <time.h> #include <math.h> #include "sleefsimd.h" #define TESTSIZE (VECTLENDP*10000000) double s[TESTSIZE]; double r1[TESTSIZE]; double r2[TESTSIZE]; #define COUNT 10 int main(int argc, char *argv[]) { int k, i; clock_t t1, t2; double time1, time2; double max, rmax; srandom(time(NULL)); for(i = 0; i < TESTSIZE; i++) { s[i] = random()/(double)RAND_MAX*20000-10000; } printf("Testing sin, %d values\n", TESTSIZE*COUNT); t1 = clock(); for(k = 0; k < COUNT; k++) { for(i = 0; i < TESTSIZE; i++) { r1[i] = sin(s[i]); } } t2 = clock(); time1 = (double)(t2 - t1)/CLOCKS_PER_SEC; printf("Finish sin, spent time = %lg sec\n\n", time1); printf("Testing xsin\n"); t1 = clock(); for(k = 0; k < COUNT; k++) { for(i = 0; i < TESTSIZE; i += VECTLENDP) { vdouble a = vloadu(s+i); a = xsin(a); vstoreu(r2+i, a); } } t2 = clock(); time2 = (double)(t2 - t1)/CLOCKS_PER_SEC; printf("Finish xsin, spent time = %lg sec\n\n", time2); printf("Speed ratio: %lf\n", time1/time2); max = r1[0] - r2[0]; rmax = (r1[0] - r2[0])/r1[0]; for(i = 0; i < TESTSIZE; i++) { double delta = (r1[i] - r2[i]); if(abs(delta) > abs(max)) max = delta; delta = (r1[i] - r2[i])/r1[i]; if(abs(delta) > abs(max)) rmax = delta; } printf("Max absolute delta: %lg, relative delta %lg\n", max, rmax); return 0; }
simd/speedtest.c
in the SLEEF sources) with the following command: gcc -O3 -Wall -Wno-unused -Wno-attributes -DENABLE_AVX -mavx speedtest.c sleefsimddp.c sleefsimdsp.c -o speedtest -lm
Testing sin, 400000000 values Finish sin, spent time = 14.95 sec Testing xsin Finish xsin, spent time = 1.31 sec Speed ratio: 11.412214 Max absolute delta: 5.55112e-17, relative delta 1.58441e-16
double
itself), on one processor core! Of course, in a real application, the acceleration will be less, but the motivation for writing your numpy module is already enough. >>> from numpy import array, add >>> add(1, 2) 3 >>> add(array([1,2]), array([4,5])) # array([5, 7]) >>> add(array([1,2]), 1) # (.. ) array([2, 3]) >>> add(array([[1,2],[3,4]]), array([1,2])) # () array([[2, 4], [4, 6]])
static void double_xsin(char **args, npy_intp *dimensions, npy_intp* steps, void* data) { npy_intp i; npy_intp n = dimensions[0]; char *in = args[0], *out = args[1]; npy_intp in_step = steps[0], out_step = steps[1]; double tmp[VECTLENDP]; vdouble a; int slow_n = n % VECTLENDP; if(in_step != sizeof(double) || out_step != sizeof(double)) slow_n = n; for(i = 0; i < slow_n; i += VECTLENDP) { int j; for(j = 0; j < VECTLENDP && i + j < slow_n; j++) { tmp[j] = *(double *)in; in += in_step; } a = vloadu(tmp); a = xsin(a); vstoreu(tmp, a); for(j = 0; j < VECTLENDP && i + j < slow_n; j++) { *(double *)out = tmp[j]; out += out_step; } } if(n > slow_n) { double *in_array = (double *)in; double *out_array = (double *)out; for(i = 0; i < n - slow_n; i += VECTLENDP) { a = vloadu(in_array + i); a = xsin(a); vstoreu(out_array + i, a); } } }
args
array. In our case, the function has one input and one output, so the address of the input data is args[0]
, the output is args[1]
. The number of elements is passed to dimensions[0]
. To navigate through the input data, you need to increment the pointer by steps[0]
, on weekends - by steps[1]
(it is important that the pointer is of type char
, since steps
specify values in bytes). Unfortunately, I was not able to find in the numpy documentation that the values in steps
should be equal to the sizes of the corresponding data types, although the experiment showed that on my system for arrays of nonzero dimension this rule is fulfilled. In case of its violation, the calculations will be slower since additional copying of elements into and out of the tmp
array is required. static PyUFuncGenericFunction funcs[] = {&double_xsin}; static char types[] = {NPY_DOUBLE, NPY_DOUBLE}; static void *data[] = {NULL};
types
array, both input and output types are funcs
, so it is longer than the funcs
and data
arrays. The array of data
pointers allows for each type combination to specify its own additional parameter, which will be passed to C-Functions as the last argument to void* data
. In particular, this can be used to implement different universal functions with one C-function.PyUFunc_FromFuncAndData
and pass the arrays described above ( funcs
, data
and types
), as well as the number of input and output arguments of the universal function, the number of supported type combinations, the name of the function in the module and the documentation string. #include "Python.h" #include "numpy/ndarraytypes.h" #include "numpy/ufuncobject.h" #include "numpy/npy_3kcompat.h" #include "sleef/sleefsimd.h" /* The loop definition must precede the PyMODINIT_FUNC. */ static void double_xsin(char **args, npy_intp *dimensions, npy_intp* steps, void* data) { npy_intp i; npy_intp n = dimensions[0]; char *in = args[0], *out = args[1]; npy_intp in_step = steps[0], out_step = steps[1]; double tmp[VECTLENDP]; vdouble a; int slow_n = n % VECTLENDP; if(in_step != sizeof(double) || out_step != sizeof(double)) slow_n = n; for(i = 0; i < slow_n; i += VECTLENDP) { int j; for(j = 0; j < VECTLENDP && i + j < slow_n; j++) { tmp[j] = *(double *)in; in += in_step; } a = vloadu(tmp); a = xsin(a); vstoreu(tmp, a); for(j = 0; j < VECTLENDP && i + j < slow_n; j++) { *(double *)out = tmp[j]; out += out_step; } } if(n > slow_n) { double *in_array = (double *)in; double *out_array = (double *)out; for(i = 0; i < n - slow_n; i += VECTLENDP) { a = vloadu(in_array + i); a = xsin(a); vstoreu(out_array + i, a); } } } static PyMethodDef AvxmathMethods[] = { {NULL, NULL, 0, NULL} }; static PyUFuncGenericFunction funcs[1] = {&double_xsin}; static char types[] = {NPY_DOUBLE, NPY_DOUBLE}; static void *data[] = {NULL}; void register_xsin(PyObject *module) { PyObject *xsin, *d; import_array(); import_umath(); xsin = PyUFunc_FromFuncAndData(funcs, data, types, 1, 1, 1, PyUFunc_None, "sin", "AVX-accelerated sine calculation", 0); d = PyModule_GetDict(module); PyDict_SetItemString(d, "sin", xsin); Py_DECREF(xsin); } #if PY_VERSION_HEX >= 0x03000000 static struct PyModuleDef moduledef = { PyModuleDef_HEAD_INIT, "avxmath", NULL, -1, AvxmathMethods, NULL, NULL, NULL, NULL }; PyMODINIT_FUNC PyInit_avxmath(void) { PyObject *m; m = PyModule_Create(&moduledef); if (!m) { return NULL; } register_xsin(m); return m; } #else PyMODINIT_FUNC initavxmath(void) { PyObject *m; m = Py_InitModule("avxmath", AvxmathMethods); if (m == NULL) { return; } register_xsin(m); } #endif
setup.py
from the documentation , replacing the module name and adding the SLEEF library C files, compiler and linker flags. I saved the above source code for the module next to setup.py
in a file called avxmath.c
, renamed the sleef
directory simd
from the source code sleef
and also put it next to setup.py
. def configuration(parent_package='', top_path=None): import numpy from numpy.distutils.misc_util import Configuration config = Configuration('', parent_package, top_path) config.add_extension('avxmath', ['avxmath.c', 'sleef/sleefsimddp.c', 'sleef/sleefsimdsp.c'], extra_compile_args=['-O3', '-Wall', '-Wno-unused', '-Wno-attributes', '-DENABLE_AVX','-mavx'], extra_link_args=['-lm']) return config if __name__ == "__main__": from numpy.distutils.core import setup setup(configuration=configuration)
python setup.py build_ext --inplace
command python setup.py build_ext --inplace
, the result of the successful execution of which should be a ready-made module in the avxmath.so
file. Now you can check the performance of our function. Run python in the same directory as avxmath.so
, and check: >>> from numpy import array, pi >>> import avxmath >>> avxmath.sin(0) 0.0 >>> avxmath.sin(pi) 1.2246467991473532e-16 >>> avxmath.sin([0, pi/2, pi, 3*pi/2, 2*pi]) array([ 0.00000000e+00, 1.00000000e+00, 1.22464680e-16, -1.00000000e+00, -2.44929360e-16]) >>>
avxmath
module avxmath
imported and works without errors, you can do a little test of the performance and accuracy of the new function. import numpy import avxmath import time from numpy import random, pi COUNT=10 x = 2e4*random.random(40000000) - 1e4 t = time.clock() for i in xrange(COUNT): y1 = numpy.sin(x) duration1 = time.clock() - t print "numpy.sin %f sec" % duration1 t = time.clock() for i in xrange(COUNT): y2 = avxmath.sin(x) duration2 = time.clock() - t print "avxmath.sin %f sec" % duration2 delta = y2 - y1 rdelta = delta/y1 print "max absolute difference is %lg, relative %lg" % ( delta[abs(delta).argmax()], rdelta[abs(rdelta).argmax()]) print "speedup is %lg" % (duration1/duration2)
numpy.sin 15.510000 sec avxmath.sin 2.260000 sec max absolute difference is 2.22045e-16, relative 2.63873e-16 speedup is 6.86283
xsin
function xsin
a simple copying of memory, it is easy to verify that the acceleration failed 10 times due to the fact that about 1 second of the 2.26 seconds of execution we received was spent on overhead. Similarly, replacing the xsin
function xsin
the usual sine from math.h
, we find that the computation times using avxmath.sin
and numpy.sin
in our test will coincide with high accuracy.avxmath
module available on Github by reference .Source: https://habr.com/ru/post/198916/
All Articles