java , purec , simd and tester . In addition, there is a README file with a brief description of the library and a general Makefile pulling the Makefile from the listed directories. Naturally, I was most interested in the simd directory, which, as the name suggests, contained functions optimized using SIMD instructions.simd directory, simd clear that 4 variants of SIMD instructions are supported: SSE2 , AVX , AVX2 and FMA4 . The function prototypes are defined in the sleefsimd.h file, and the required set of instructions is selected when compiling with the -DENABLE_SSE2 , -DENABLE_AVX , -DENABLE_AVX2 or -DENABLE_FMA4 . The makefile collects executables for testing functions using each of the instruction sets: iutsse2 , iutavx , iutavx2 or iutfma4 . These files are called from the universal tester program (from the tester directory) and execute the commands received from the tester. The implementation of the commands is in the iut.c file, from where the use of the library becomes obvious. double xxsin(double d) { double s[VECTLENDP]; int i; for(i=0;i<VECTLENDP;i++) { s[i] = random()/(double)RAND_MAX*20000-10000; } int idx = random() & (VECTLENDP-1); s[idx] = d; vdouble a = vloadu(s); a = xsin(a); vstoreu(s, a); return s[idx]; } double ), you need to define an array of length VECTLENDP , fill in the arguments of the function of interest and pass it to the vloadu function, which will copy them to the required place for using SIMD and return the value of type vdouble . We pass the xsin value to the vdouble function, which calculates the sine value for all VECLENDP arguments at once and returns vdouble again. The result is unpacked into an array from double using the vstoreu function. #include <stdio.h> #include <stdlib.h> #include <time.h> #include <math.h> #include "sleefsimd.h" #define TESTSIZE (VECTLENDP*10000000) double s[TESTSIZE]; double r1[TESTSIZE]; double r2[TESTSIZE]; #define COUNT 10 int main(int argc, char *argv[]) { int k, i; clock_t t1, t2; double time1, time2; double max, rmax; srandom(time(NULL)); for(i = 0; i < TESTSIZE; i++) { s[i] = random()/(double)RAND_MAX*20000-10000; } printf("Testing sin, %d values\n", TESTSIZE*COUNT); t1 = clock(); for(k = 0; k < COUNT; k++) { for(i = 0; i < TESTSIZE; i++) { r1[i] = sin(s[i]); } } t2 = clock(); time1 = (double)(t2 - t1)/CLOCKS_PER_SEC; printf("Finish sin, spent time = %lg sec\n\n", time1); printf("Testing xsin\n"); t1 = clock(); for(k = 0; k < COUNT; k++) { for(i = 0; i < TESTSIZE; i += VECTLENDP) { vdouble a = vloadu(s+i); a = xsin(a); vstoreu(r2+i, a); } } t2 = clock(); time2 = (double)(t2 - t1)/CLOCKS_PER_SEC; printf("Finish xsin, spent time = %lg sec\n\n", time2); printf("Speed ratio: %lf\n", time1/time2); max = r1[0] - r2[0]; rmax = (r1[0] - r2[0])/r1[0]; for(i = 0; i < TESTSIZE; i++) { double delta = (r1[i] - r2[i]); if(abs(delta) > abs(max)) max = delta; delta = (r1[i] - r2[i])/r1[i]; if(abs(delta) > abs(max)) rmax = delta; } printf("Max absolute delta: %lg, relative delta %lg\n", max, rmax); return 0; } simd/speedtest.c in the SLEEF sources) with the following command: gcc -O3 -Wall -Wno-unused -Wno-attributes -DENABLE_AVX -mavx speedtest.c sleefsimddp.c sleefsimdsp.c -o speedtest -lm Testing sin, 400000000 values Finish sin, spent time = 14.95 sec Testing xsin Finish xsin, spent time = 1.31 sec Speed ratio: 11.412214 Max absolute delta: 5.55112e-17, relative delta 1.58441e-16 double itself), on one processor core! Of course, in a real application, the acceleration will be less, but the motivation for writing your numpy module is already enough. >>> from numpy import array, add >>> add(1, 2) 3 >>> add(array([1,2]), array([4,5])) # array([5, 7]) >>> add(array([1,2]), 1) # (.. ) array([2, 3]) >>> add(array([[1,2],[3,4]]), array([1,2])) # () array([[2, 4], [4, 6]]) static void double_xsin(char **args, npy_intp *dimensions, npy_intp* steps, void* data) { npy_intp i; npy_intp n = dimensions[0]; char *in = args[0], *out = args[1]; npy_intp in_step = steps[0], out_step = steps[1]; double tmp[VECTLENDP]; vdouble a; int slow_n = n % VECTLENDP; if(in_step != sizeof(double) || out_step != sizeof(double)) slow_n = n; for(i = 0; i < slow_n; i += VECTLENDP) { int j; for(j = 0; j < VECTLENDP && i + j < slow_n; j++) { tmp[j] = *(double *)in; in += in_step; } a = vloadu(tmp); a = xsin(a); vstoreu(tmp, a); for(j = 0; j < VECTLENDP && i + j < slow_n; j++) { *(double *)out = tmp[j]; out += out_step; } } if(n > slow_n) { double *in_array = (double *)in; double *out_array = (double *)out; for(i = 0; i < n - slow_n; i += VECTLENDP) { a = vloadu(in_array + i); a = xsin(a); vstoreu(out_array + i, a); } } } args array. In our case, the function has one input and one output, so the address of the input data is args[0] , the output is args[1] . The number of elements is passed to dimensions[0] . To navigate through the input data, you need to increment the pointer by steps[0] , on weekends - by steps[1] (it is important that the pointer is of type char , since steps specify values in bytes). Unfortunately, I was not able to find in the numpy documentation that the values in steps should be equal to the sizes of the corresponding data types, although the experiment showed that on my system for arrays of nonzero dimension this rule is fulfilled. In case of its violation, the calculations will be slower since additional copying of elements into and out of the tmp array is required. static PyUFuncGenericFunction funcs[] = {&double_xsin}; static char types[] = {NPY_DOUBLE, NPY_DOUBLE}; static void *data[] = {NULL}; types array, both input and output types are funcs , so it is longer than the funcs and data arrays. The array of data pointers allows for each type combination to specify its own additional parameter, which will be passed to C-Functions as the last argument to void* data . In particular, this can be used to implement different universal functions with one C-function.PyUFunc_FromFuncAndData and pass the arrays described above ( funcs , data and types ), as well as the number of input and output arguments of the universal function, the number of supported type combinations, the name of the function in the module and the documentation string. #include "Python.h" #include "numpy/ndarraytypes.h" #include "numpy/ufuncobject.h" #include "numpy/npy_3kcompat.h" #include "sleef/sleefsimd.h" /* The loop definition must precede the PyMODINIT_FUNC. */ static void double_xsin(char **args, npy_intp *dimensions, npy_intp* steps, void* data) { npy_intp i; npy_intp n = dimensions[0]; char *in = args[0], *out = args[1]; npy_intp in_step = steps[0], out_step = steps[1]; double tmp[VECTLENDP]; vdouble a; int slow_n = n % VECTLENDP; if(in_step != sizeof(double) || out_step != sizeof(double)) slow_n = n; for(i = 0; i < slow_n; i += VECTLENDP) { int j; for(j = 0; j < VECTLENDP && i + j < slow_n; j++) { tmp[j] = *(double *)in; in += in_step; } a = vloadu(tmp); a = xsin(a); vstoreu(tmp, a); for(j = 0; j < VECTLENDP && i + j < slow_n; j++) { *(double *)out = tmp[j]; out += out_step; } } if(n > slow_n) { double *in_array = (double *)in; double *out_array = (double *)out; for(i = 0; i < n - slow_n; i += VECTLENDP) { a = vloadu(in_array + i); a = xsin(a); vstoreu(out_array + i, a); } } } static PyMethodDef AvxmathMethods[] = { {NULL, NULL, 0, NULL} }; static PyUFuncGenericFunction funcs[1] = {&double_xsin}; static char types[] = {NPY_DOUBLE, NPY_DOUBLE}; static void *data[] = {NULL}; void register_xsin(PyObject *module) { PyObject *xsin, *d; import_array(); import_umath(); xsin = PyUFunc_FromFuncAndData(funcs, data, types, 1, 1, 1, PyUFunc_None, "sin", "AVX-accelerated sine calculation", 0); d = PyModule_GetDict(module); PyDict_SetItemString(d, "sin", xsin); Py_DECREF(xsin); } #if PY_VERSION_HEX >= 0x03000000 static struct PyModuleDef moduledef = { PyModuleDef_HEAD_INIT, "avxmath", NULL, -1, AvxmathMethods, NULL, NULL, NULL, NULL }; PyMODINIT_FUNC PyInit_avxmath(void) { PyObject *m; m = PyModule_Create(&moduledef); if (!m) { return NULL; } register_xsin(m); return m; } #else PyMODINIT_FUNC initavxmath(void) { PyObject *m; m = Py_InitModule("avxmath", AvxmathMethods); if (m == NULL) { return; } register_xsin(m); } #endif setup.py from the documentation , replacing the module name and adding the SLEEF library C files, compiler and linker flags. I saved the above source code for the module next to setup.py in a file called avxmath.c , renamed the sleef directory simd from the source code sleef and also put it next to setup.py . def configuration(parent_package='', top_path=None): import numpy from numpy.distutils.misc_util import Configuration config = Configuration('', parent_package, top_path) config.add_extension('avxmath', ['avxmath.c', 'sleef/sleefsimddp.c', 'sleef/sleefsimdsp.c'], extra_compile_args=['-O3', '-Wall', '-Wno-unused', '-Wno-attributes', '-DENABLE_AVX','-mavx'], extra_link_args=['-lm']) return config if __name__ == "__main__": from numpy.distutils.core import setup setup(configuration=configuration) python setup.py build_ext --inplace command python setup.py build_ext --inplace , the result of the successful execution of which should be a ready-made module in the avxmath.so file. Now you can check the performance of our function. Run python in the same directory as avxmath.so , and check: >>> from numpy import array, pi >>> import avxmath >>> avxmath.sin(0) 0.0 >>> avxmath.sin(pi) 1.2246467991473532e-16 >>> avxmath.sin([0, pi/2, pi, 3*pi/2, 2*pi]) array([ 0.00000000e+00, 1.00000000e+00, 1.22464680e-16, -1.00000000e+00, -2.44929360e-16]) >>> avxmath module avxmath imported and works without errors, you can do a little test of the performance and accuracy of the new function. import numpy import avxmath import time from numpy import random, pi COUNT=10 x = 2e4*random.random(40000000) - 1e4 t = time.clock() for i in xrange(COUNT): y1 = numpy.sin(x) duration1 = time.clock() - t print "numpy.sin %f sec" % duration1 t = time.clock() for i in xrange(COUNT): y2 = avxmath.sin(x) duration2 = time.clock() - t print "avxmath.sin %f sec" % duration2 delta = y2 - y1 rdelta = delta/y1 print "max absolute difference is %lg, relative %lg" % ( delta[abs(delta).argmax()], rdelta[abs(rdelta).argmax()]) print "speedup is %lg" % (duration1/duration2) numpy.sin 15.510000 sec avxmath.sin 2.260000 sec max absolute difference is 2.22045e-16, relative 2.63873e-16 speedup is 6.86283 xsin function xsin a simple copying of memory, it is easy to verify that the acceleration failed 10 times due to the fact that about 1 second of the 2.26 seconds of execution we received was spent on overhead. Similarly, replacing the xsin function xsin the usual sine from math.h , we find that the computation times using avxmath.sin and numpy.sin in our test will coincide with high accuracy.avxmath module available on Github by reference .Source: https://habr.com/ru/post/198916/
All Articles