Balancing Accuracy and Performance

There are several important aspects that must be taken into account when creating an application that performs any calculations, or more precisely, floating point operations. What do we expect and plan to get from such applications (in most cases, scientific)? First of all, we are interested in the accuracy of calculations - the result should be closest to the “correct” one. The other side of the coin is the stability of the results and portability of the application. It is important for us to be able to get the same, consistently repeating from start to start result, and on different machines / architectures. And last but not least, the point is performance. How fast will our application run with all this, and when will we get the results of our calculations?

The Intel compiler has a set of options that are responsible for controlling the optimization of floating-point calculations. Consider the most interesting key –fp-model, which, judging by the description in the documentation, controls the semantics of calculations over floating point numbers. By the way, it is worth noting that there are similar keys in other compilers, not only Intel's, we will also talk about this. In fact, with the help of this key we will be able to control the balance between performance and accuracy of calculations. Possible values that can be specified in the option –fp-model: precise, fast [= 1 | 2], strict, source, [no-] except (Linux *) or except [-] (Windows *) . Let's see what they give when compiling our code.

All these keys allow, in the end, to control the following compiler rules and answer the corresponding questions:

Safety of values (value safety)
Can the compiler produce transformations that can change the result?
For example, in safe mode, the compiler will not optimize the x / x expression to 1.0, because at runtime the value may be 0 or NaN. Also, changing the sequence of calculations (in particular, the law of associativity or distributivity) can be prohibited.
')
Accuracy in calculating expressions (floating-point expression evaluation)
How should the compiler round off intermediate results (which accuracy should be chosen)?

Access to environment (floating-point environment access)
Can the application change the environment settings (for example, rounding rules) at run time? In general, the FP environment is a set of registers that controls the operation of the FPU (Floating-point unit, floating-point co-processor). It includes control of rounding mode, exceptions, working with non-normalized numbers (dropping to 0), exception masks and other functions. With default keys it is assumed that the application does not have access to the FPU environment.

The operation of multiplication-addition (contractions)
Should the compiler generate multiply-add (Fsed multiply-add, FMA) operations, which allow to combine multiplication and addition in one instruction (no intermediate rounding up to N bits is required after multiplication before addition, unlike a pair of separate instructions)? Addition is performed on a more accurate internal representation, and only after it rounding occurs. This allows you to increase accuracy.

Precise floating point exceptions
Does the compiler take into account the possibility of exceptions when working with floating point operations?
Under certain conditions (for example, division by zero), an FPU may generate an exception. By default, exception semantics is disabled. It is worth noting that allowing exceptions and connecting exception semantics is not the same thing. In the latter case, the compiler takes into account the fact that operations with "floating" numbers can generate an exception. By the way, since the FPU is a separate part of the processor, an exception will not be generated immediately, but only when the CPU reaches (it checks for FPU exceptions) the following instruction over floating point numbers.

So, these 5 basic questions can be controlled by the compiler options as follows:

option	security	accuracy	Fma	environment	exceptions
precise source	varies	code code	Yes	not	not
strict	varies	code	not	Yes	Yes
fast = 1 (default)	insecure	unknown	Yes	not	not
fast = 2	insecure	unknown	Yes	not	not
except except-	does not change	code code	does not change	does not change	Yes not

Let's say we figured out what controls these options. Let's use examples to try to understand how to use them and for what.

For example, take the Kahan summation algorithm, which allows for more accurate results when summing up floating point numbers.

float KahanSum(const float A[], int n ) { float sum=0, Y, T; float C=0; //A running compensation for lost low-order bits. for (int i=0; i<n; i++) { Y = A[i] - C; //So far, so good: C is zero. T = sum + Y; //Alas, sum is big, Y small, so low-order digits of Y are lost. C = T - sum - Y; //(T - sum) recovers the high-order part of Y; //subtracting y recovers -(low part of y) sum = T; //Next time around, the lost low part will be added to y in a fresh attempt. } return sum; }

This algorithm will be very sensitive to possible optimizations by the compiler. If we apply algebraic transformations that the compiler can easily perform for a given model fast, we get the following:

Since C is out of cycle and is a constant equal to 0, then

 Y = A[i] - C ==> Y = A[i] T = sum + Y ==> T = sum + A[i] sum = T ==> sum = sum + A[i]

As a result, after optimization, we get the usual summation in the cycle, and this is far from what we expected. Therefore, it is important to prohibit the compiler to carry out such permutations and optimization, by setting, for example, the fp-model precise flag.

By the way, various combinations of these options are interesting, because there is a feeling that precise and source do the same thing. All models are divided into 3 groups:

• A: precise, fast, strict
• B: source
• C: except

Accordingly, they can be “mixed”, but with certain reservations:
• You cannot use fast and except together, and because fast is the default model, you cannot add except without other options from groups A and B.
• You can specify only one model from group A and one from group B. If you specify more than one, the last one in the compilation line (the one to the right) will be used.
• If the except is specified more than once, the last option will also be used.

Therefore, in the general case, precise and source will set the same model for working with floating-point numbers. But if you set fast and source together, then source will set the accuracy with which intermediate results should be rounded (according to the name, the accuracy used in the code will be used).

What about other compilers? There everything seems to be the same and the meaning remains the same - with the help of keys, the same 5 basic rules of the compiler are controlled when working with floating-point numbers, only the keys are different by default. For example, if you take the Microsoft compiler, then even the names of the options and models there are the same as those of the Intel compiler. For example, the fp: precise flag, which is the default. By the way, the emphasis there (by default) is still on the safety of computations, while Intel has an emphasis on performance (key fast = 1).
But there are differences in the behavior of the compiler - with the precise option, the Microsoft compiler will use the maximum precision (extended), that is, the code

 float a, b, c, d; double x; ... x = a*b + c*d;

will be interpreted by the compiler like this:

 float a, b, c, d; double x; ... register tmp1 = a*b; register tmp2 = c*d; register tmp3 = tmp1+tmp2; x = (double) tmp3;

Of course, the nuances are significant, but if you understand what to control and how, you can easily switch to another compiler, the main thing is that it gives the developer this control.

By the way, MSDN has an excellent article on this topic, which describes in detail the behavior of the Microsoft compiler, and discusses many examples.

We will return to one more example, this time in Fortran (I mentioned the scientific topic at the beginning):

 REAL T0, T1, T2; ... T0 = 4.0E + 0.1E + T1 + T2

The question is, how will this expression be considered with the –fp-model fast key? Based on the label presented above, it can be assumed that addition can be performed in any order, single (single), double (double), or extended double (extended) accuracy will be used when calculating intermediate results, while the constant value can be calculated in advance.

For example, the compiler can interpret our code as follows:

 REAL T0, T1, T2; ... T0 = (T1 + T2) + 4.1E;

 REAL T0, T1, T2; ... T0 = (T1 + 4.1E) + T2;

If the -fp-model source option is set (or -fp-model precise, they are equivalent in the case of separate use), addition will be performed strictly in the order specified in the code, single precision will be used (as in the code), the constant can be calculated in advance, using the default rounding:

 REAL T0, T1, T2; ... T0 = ((4.1E + T1) + T2);

Well, the most "hard" level of control over accuracy -fp-model strict.
In this case, we get something like this:

 REAL T0, T1, T2; ... T0 = REAL ((((REAL)4.0E + (REAL)0.1E) + (REAL)T1) + (REAL)T2);

As with all models other than fast, the precision specified in the code is used (single in this case). The constant is not calculated in advance because we do not know which rounding mode will be set during the execution of the application.

Actually, this is all that I wanted to talk about in this post. The topic of working with floating-point numbers is very extensive, and there is something else to tell about ... let's say, besides the considered options, there are also flags prec-div, prec-sqrt, -ftz, -assume: protect_parens, IEEE 754-2008 standard, which supported in many compilers and in which there is a lot of curious things ... so with proper interest, we will continue this conversation.

Source: https://habr.com/ru/post/160747/

All Articles

Balancing Accuracy and Performance

More articles: