C / C ++ code optimization

This article is a free translation of the article Optimizing C ++ / Code optimization / Faster operations. The original can be found on the link .

Foreword

Often, some elementary operations, seemingly conceptually similar, can differ dramatically in speed of execution. Therefore, when writing code, you need to be able to choose faster instructions. Of course, some compilers already have built-in optimization for specific processors, so sometimes these methods are useless, but even in this case, it’s not bad to know the possibility of code optimization. It is worth noting that some technologies can even improve performance, so you need to optimize it wisely.

Part 1

The location of members in the structures / classes

To locate members in structures / classes, it is necessary so that the most used variables are in the first 128 bytes, and then sort from the largest member to the smallest.

Suppose there is some structure:

struct { char msg[400]; double d; int i; };

If we assume that in the above structure, the msg member is used only for error messages, while other members are used for calculations, then you can speed up the calculation by changing the order of the members in the structure to the following:

 struct { double d; int i; char msg[400]; };

On some processors, addressing an element is more efficient if its distance from the beginning of the structure is less than 128 bytes. In the first example, the addressing of the d and i fields, using a pointer to the beginning of the structure, requires an offset of at least 400 bytes, whereas in the second, the offset is only a few bytes, which allows the use of more compact instructions.

Another example:

 struct { bool b; double d; short s; int i; };

If we calculate the size of the structure, 1 (bool) + 7 (padding) + 8 (double) + 2 (short) + 2 (padding) + 4 (int), we get 24 bytes, 9 of which will be used for alignment, which regarding the size of the structure quite a lot. Let's rewrite the same structure in this way:

 struct { double d; int i; short s; bool b; };

Let's count again - 8 (double) + 4 (int) + 2 (short) + 1 (bool) + 1 (padding). Now our structure will take 16 bytes. Sorting eliminated unnecessary caused, and therefore we have a more compact structure.

Conversion of type from a floating point to integer

C ++ language does not provide a primitive rounding operation for floating point numbers. The simplest method for converting a floating-point number x to the nearest integer n is the operator:

 n = int(floor(x + 0.5f));

Using this method, if x is exactly halfway between two integers, then n will be rounded up. For example, 0.5 -> 1; 1.5 -> 2; -0.5 -> 0; -1.5 -> -1; ...

Unfortunately, on a number of processors such an expression is compiled into very slow machine code. Some processors have specific instructions for rounding numbers. In particular, the Pentium family has a command fistp , which, as in the following code, gives a much faster, though not quite equivalent code:

 #if defined(__unix__) || defined(__GNUC__) // For 32-bit Linux, with Gnu/AT&T syntax __asm ("fldl %1 \n fistpl %0 " : "=m"(n) : "m"(x) : "memory" ); #else // For 32-bit Windows, with Intel/MASM syntax __asm fld qword ptr x; __asm fistp dword ptr n; #endif

The code above rounds x to the nearest integer, but if the value of x is exactly between integers, then n will be the closest even integer. For example, 0.5 -> 0; 1.5 -> 2; -0.5 -> 0; -1.5 -> -2; ...

If this result is valid or even desirable, and it is possible to use the built-in assembly, use this code. True, portability to other processor families leaves much to be desired.

Work with bats

You can magnify performance by using the knowledge of working with bits, there is a collection of hacks that will help with this. Some of the “tricks” cited there are, in fact, already used by some compilers, others are useful for solving rare problems, and some are generally useful only on certain platforms.

Array cell size

It happens that the size (can be checked with the help of sizeof) of small cells of arrays or vectors is a power of two, and the size of large cells is not. Direct access to the cell of the array is performed by multiplying the index by the size of the cell, which is a constant. If the second coefficient of this multiplication is equal to the power of two, then this operation is performed much faster, since it is carried out in the form of a bit shift. Similarly, in multidimensional arrays.

You can get the desired size by adding unused fields to structures and unused cells in arrays. For example, if each cell is a 3-tuple of floating point objects, it is enough to add a fourth floating point dummy object to each cell.

However, when accessing cells of a multidimensional array, in which the constant reaches a sufficiently large degree of two, then you risk data data contention, which can slow down the computation by 10 or more times. This happens only when the cells in the array exceed a certain size, which depends on the data cache, but ranges from 1 to 8 KB. Therefore, in case the algorithm has to process an array whose cells have or can have a size of 1024 bytes, then it is worth determining whether there is a data conflict in order to avoid it.

For example, a matrix of 100 x 512 float is 100 arrays of 512 elements each. Each cell of the first level array has a size of 512 x 4 = 2048 bytes, and therefore it is at risk of data conflict.

To detect competition, it is enough to add another cell ( float ) to each array from the array of the last level, but at the same time process the same cells as before and measure whether processing time is significantly reduced (at least by 20%). If so, then you should ensure a more stable operation of such an improvement. For this purpose you can use one of the following methods:

Add one or more unused items to the end of each array of the last level. For example, an array of double a [100] [1024] can become double a [100] [1026], even if the payload will also be up to 1024.
Store the dimensions of the array, divide it into rectangular blocks, and process all the cells in one block at a time.

Continued in the second part .

Source: https://habr.com/ru/post/339406/

All Articles