Maximum accurate code measurement

In my article six months ago about long arithmetic there are speed measurements (throughput in cycles) of very short code fragments - just a few instructions. The measurement technique was crooked, but gave plausible results. Then it turned out that the results were wrong - the superficial approach always affects.

In this post I will describe a reliable method of “nanobenmarking” with minimal error and without connecting special libraries and drivers, which I finally came to. Applicability: comparison of single-threaded potential of processors, just interest.

I use only GCC - respectively, the method is ground for it. But I will make generalizations so that the owners of other compilers can figure out.

RDTSC command is a direct measure. Wikipedia rightly notes that it is unreliable and it is recommended to use special OS services. However, for micro-measurements, they work too long (hundreds of cycles) and unequally from launch to launch, which introduces unavoidable errors. By itself, RDTSC works no more than several tens of cycles - a constant number or one of a small set.
')
The only drawback of RDTSC in micro dimensions is the floating clock frequency of the processors, because time stamp counter always counts clock cycles in accordance with the standard multiplier. Fix factor - the task is not always trivial, look for " disable cpu scaling " in combination with the name of your OS and type of processor. A good solution in Gnome is the applet indicator-cpufreq .

The measuring harness consists of three nested cycles.

The inner loop controls the flow of data as in the work program.
In this spirit:

 type input1[n]; type input2[n]; type output[n]; ... for (int i = 0; i < n; i++) {  input1[i]  input2[i]    output[i] }

It is important that in the phrase “try on the code”, “code” in this article means a certain sequence of processor instructions. Therefore, the part of the cycle between the brackets either must be written in assembler or in C, but you must clearly understand what you are seeking from the compiler. To overcome the vigorous activity of GCC at -O3 , immediately add the options -fno-prefetch-loop-arrays , -fno-unroll-loops , -ftree-vectorizer-verbose=1 . -fno-tree-vectorize or -ftree-vectorize - already depending on what is required at the output - “as written” or a vectorized version of the cycle.

If you want to measure the processing of a particular input or the code without any input / output at all, still wrap it in a loop. To prevent GCC from -fno-gcse code, enable -fno-gcse (global common subexpression elimination), -fno-tree-pta (points-to analysis) and -fno-tree-pre (partial redundancy elimination). See all optimization options .

Start the loop align by 32 bytes. With -falign-loops ( -O2 ) GCC does it by itself.

The middle loop contains 2 clock ticks and a constant inner loop in the middle. Its role is to determine the minimum time in which an internal cycle can be executed. 20-30 iterations are enough for all the data to be cached, the initial and final RDTSC took the same time, and all the other stars agreed, if they exist :-)

The outer loop controls the length of the inner one . Place the initialization of the input data in it before the middle loop.

The external cycle is necessary, because the time reached on the average cycle always includes a constant — the time of initialization of the internal cycle + the cost of 1 error in predicting the transition (the smartest Intel cores make less mistakes). Therefore, the time from the average cycle cannot be simply divided into the number of iterations.

But even this is not all! The difference in the execution times of inner loops that differ in length by 1 iteration often varies considerably. The reason is the influence of different stages of the conveyor on each other. At a time when, at some point, the idea is to work, the following may actually occur:

data processing,
simple, because the next stage is still busy,
simple, because the data from the previous stage did not come.

In addition, some stages have unusual behavior, such as a predictor of transitions and a micro-operations planner. All this creates intricate effects.

As a result, a “pattern” of work with a length from 1 to 10-15 (?) Iterations is established on the conveyor:

Exact througput in cycles makes sense to count at least for 1 such pattern, and not 1 iteration.

As is easily seen from the figures in the examples below, even when measuring patterns, the variation in results remains. Presumably, in fact, RDTSC is not as good as described above :-)

So, having obtained the differences in the execution times of internal loops with lengths multiple to the step of the pattern, it remains to calculate the statistics.

Examples

Compare the measurement results (hereinafter, all the values in cycles) from the article on long arithmetic:

Surface method	7.5	5.5	5.5	7	five	2	2.5	3.25 (?) - 3.5
Smart method	7	6	6	7	five	2	2	3

All further tests were conducted on 2 cores: AMD K10 and Intel Core 2 Wolfdale .

It is important to evaluate the tools themselves.

Empty cycle

The inner loop looks like this:

 for (int i = 0; i < inner_len; i++) { asm volatile ( "" ); }

 K10 1.8 ± 0.7
 Core 2 10.0 ± 2.4 for 10 iterations.

Further (10, 1.0) - (pattern length, total on average on 1 iteration)

RDTSC

Without an average and internal cycle:

 typedef unsigned long long ull; inline ull rdtsc() { unsigned int lo, hi; asm volatile ( "rdtsc\n" : "=a" (lo), "=d" (hi) ); return ((ull)hi << 32) | lo; } ... for (int i = 0; i < TOTAL_VALUES; i++) { ull t1 = rdtsc(); ull t2 = rdtsc(); printf("%lld\n", t2 - t1); }

 K10 69.7 ± 1.5
 Core 2 31.0 ± 0.3

Approximate sine calculation

It is interesting to see how much you can save by calculating the sine Taylor row of the 3rd order. At angles from −π / 2 to π / 2, an accuracy of 2 decimal places is obtained. You can submit an application where it will be enough.

Frame:

 #include <cstdio> #include <cstdlib> #include <cmath> #include <limits> typedef unsigned long long ull; #define MIDDLE_LEN (20) #define TOTAL_VALUES (10000) #define VEC_LEN (1) // len in _numbers_ #define DATA_LEN (TOTAL_VALUES * VEC_LEN) inline ull rdtsc() { unsigned int lo, hi; asm volatile ( "rdtsc\n" : "=a" (lo), "=d" (hi) ); return ((ull)hi << 32) | lo; } typedef double my_float; #define BYTE_LEN (DATA_LEN * sizeof(my_float)) int main() { my_float *angles = (my_float *) malloc(BYTE_LEN); my_float *sines = (my_float *) malloc(BYTE_LEN);   for (int inner_len = 0; inner_len < DATA_LEN; inner_len += VEC_LEN) { for (int i = 0; i < inner_len; i++)  angles[i] ull inner_min = std::numeric_limits<ull>::max(); for (int mi = 0; mi < MIDDLE_LEN; mi++) { ull t1 = rdtsc(); for (int i = 0; i < inner_len; i += VEC_LEN) {   angles[i]  sines[i] } ull t = rdtsc() - t1; inner_min = t < inner_min ? t : inner_min; } //     printf("%lld\n", inner_min); } }

FSIN instruction - exact sine

It is her sin that causes math.h The generated micro-operations probably resemble this realization of the sine , since the speed of execution also depends on the angle. Therefore, accurate througput makes sense if the cycle calculates the sine of the same angle. The average over a random angle is needed for comparison with a rough calculation independent of the angle.

 //  my_float randoms[DATA_LEN]; for (int i = 0; i < DATA_LEN; i++) randoms[i] = rand() / 2.0 / RAND_MAX * M_PI; //   angles[i] = 0.0  0.0001  M_PI * 0.5  randoms[i]; //   asm volatile ( "fldl (%0)\n\t" "fsin\n\t" "fstpl (%1)\n\t" :: "r" (angles + i), "r" (sines + i) );

angle	0.0	0.0001	π / 2	random
K10	30.2 ± 10.3	89.8 ± 2.9	143.1 ± 8.5 (2, 71.6)	75.6
Core 2	40.0 ± 11.0	68.0 ± 5.6	88.0 ± 13.0	89.4

3rd order taylor series

 //  my_float d6 = 1.0 / 6.0; my_float d120 = 1.0 / 120.0; my_float randoms[DATA_LEN]; for (int i = 0; i < DATA_LEN; i++) randoms[i] = rand() / 2.0 / RAND_MAX * M_PI; //   angles[i] = randoms[i]; //   my_float x = angles[i]; sines[i] = x - x*x*x*d6 + x*x*x*x*x*d120;

K10	61.2 ± 15.6 (8, 7.7 )
Core 2	35.2 ± 16.8 (4, 8.8 )

Vector Taylor Series

If you just add the option GCC -ftree-vectorize , in fact there will be the same result (see above). And here are using Vector Extensions .

 #define VEC_LEN (2) typedef my_float float_vector __attribute__ ((vector_size (16))); ... //  float_vector d6_v = {1.0 / 6.0, 1.0 / 6.0}; float_vector d120_v = {1.0 / 120.0, 1.0 / 120.0}; my_float randoms[DATA_LEN]; for (int i = 0; i < DATA_LEN; i++) randoms[i] = rand() / 2.0 / RAND_MAX * M_PI; //   angles[i] = randoms[i]; //   float_vector x = *((float_vector *)(angles + i)); *((float_vector *)(sines + i)) = x - x*x*x*d6_v + x*x*x*x*x*d120_v;

K10	41.8 ± 14.2 (5, 8.4, 4.2 per 1 sine)
Core 2	44.3 ± 16.6 (5, 8.9, 4.5 )

The speed of executing 1 iteration is slightly lower than in the scalar version, and the calculation of the sine of 1 angle is almost 2 times higher.

It turns out that the calculation of the sine with an accuracy of up to 2 characters can be organized at least 10 times faster than usual.

Sources

P.S.

The described method does not profile the code. The likelihood that it will help with optimization is also extremely low , because even if the performance rests on the computing pipeline, solutions can always be compared with the usual clock () through a cycle of a million iterations.

Source: https://habr.com/ru/post/147852/

All Articles