Some thoughts and tips on optimizing C ++ code

I wrote this article a long time ago for my blog, which is now abandoned. It seems to me that there is very useful information in it, so I would not like it to just disappear. It may well be that something is already outdated, I will be grateful if they point it out to me.

As a rule, the C ++ language is used where high speed is required. But in C ++, you can effortlessly get code that runs slower than some Python / Ruby. It is with this code that numerous comparisons of Any-Lang vs C ++ are operated.
')
In general, optimization can be of three types:

Optimization of ready, tested and working code.
Initially writing optimal code.
Just use optimal designs.

Specially deal with the optimization of the finished code should be only after the project is completed and used. As a rule, optimization is required only in a small part of the project. Therefore, you first need to find places in the code that eat up most of the CPU time. After all, what's the point of speeding up the code, even by 500%, if it takes only 1% of computer time? And it should be remembered that, as a rule, optimization of the algorithms themselves, and not of the code, provides a much greater speed gain. It is about her appearance that they say: “premature optimization is evil” (c).

The second type of optimization is the initial design of the code, taking into account the performance requirements. This design is not an early optimization.

The third type is not even completely optimization. Rather, it is the avoidance of non-optimal language constructs. The C ++ language is quite complex, when using it you often need to know how the code used is implemented. It is quite low-level, so that the programmer had to take into account the peculiarities of the processors and operating systems.

1. Features of C ++

1.1. Passing arguments

Pass arguments by reference or pointer, and not by value. Passing arguments by value leads to a complete copy of the object. And the more this object, the more you have to copy. And if a class object allocates memory in a heap, then it’s quite a disaster. Of course, simple types can and should be passed by value. But complex ones should be transmitted only by reference or by pointer.

// ,    void func1(Foo foo) { } // ,    void func2(const Foo &foo) { }

1.2. Exceptions

Use exceptions only where it is really necessary. The fact is that exceptions are a rather heavy mechanism. Therefore, do not use them as a goto substitute, exit nested loops, and the like. Simple rule: exceptions are generated only in exceptional situations. This does not mean that they should be abandoned altogether. The very use of exceptions gives meager overhead due to the small amount of additional code. The only real impact on performance is their improper use.

1.3. RTTI

In code that requires high performance, do not use RTTI. The RTTI mechanism in most compilers (or in all?) Is implemented through string comparison. Most often this is not critical. But sometimes the code may require high speed. In this case, another solution should be invented, for example, numeric class identifiers.

1.4. Increment and decrement

Use the prefix increment and decrement form. A C ++ developer should get into the habit of using only the prefix form everywhere ( ++i and --i ) and only if necessary the postfix form ( i++ ). The postfix form is implemented by saving and returning the temporary value of the object before changing it. Of course, in simple cases, in operations with built-in types, the compiler will be able to optimize the code and do without creating a temporary variable. But in the case of a custom class, it probably won't be optimized.

1.5. Do not create temporary objects - 1

Temporary objects are created, for example, with this code:

 std::string s1, s2, s3; ... std::string s = s1 + s2 + s3;

In this case, two unnecessary temporary objects are created: std::string tmp1 = s1 + s2; and std::string tmp2 = tmp1 + s3; . The correct string concatenation looks like this:

 std::string s1, s2, s3; ... std::string s = s1; s += s2; s += s3;

1.6. Do not create temporary objects - 2

Variable can be declared anywhere. And if this variable is a complex object, then it should not be declared in places where it may not be needed. Example:

 void foo(int x) { int a, b, c; std::string s; //      return if( x == 0 ) return; ... }

1.7. Memory reservation

Returning to the previous example (Section 1.5) - the completely correct concatenation method should be as follows:

 std::string s1, s2, s3; ... std::string s; s.reserve( s1.size() + s2.size() + s3.size() ); s += s1; s += s2; s += s3;

Memory allocation is very expensive. And, after first selecting it once with the reserve call, you can save a lot of CPU time. In the case of STL, this applies to the vector and string classes.

1.8. Generally avoid extra work

It seemed to me that this advice is in any book for a beginner, and the basic understanding of C ++ should be enough to understand ... However, I see that some inexperienced programmers stumble upon it.

 std::string s = ""; // 1 s = ""; // 2

Line 1 calls the constructor std::string(const char *) , which does not know that the string is empty. It will try to figure out its length, perform memory allocation and copy cycles, have a memory shortage handler, etc. This is more expensive than just

 std::string s; //

In line 2, the same situation. It operator = (const char *s) , which also does not know that the "programmer" just wanted to get an empty string. A simple call will be more effective:

 s.clear(); //

1.9. Estimate the cost of calling a function in for / while loops

Using STL, you do not have to worry about calling the roads function:

 std::vector<int> vec; ... for(size_t i = 0; i < vec.size(); ++i) { ... }

Because in this case it is cheap. This will be equivalent to the following code:

 size_t size = vec.size(); for(size_t i = 0; i < size; ++i)

However, this is not always the case. A frequent misconception up to C ++ 11 was that programmers expected the same algorithmic complexity from std::list::size , although in many implementations its complexity was O (N). It looked especially unpleasant where instead of the if( list.empty() ) call, if( list.size() > 0 ) .

1.10. Do not use vector where list or deque could do.

The vector container is designed to store in memory a continuous sequence of bytes. Therefore, when adding new items, if there is not enough memory, the container will have to allocate a new memory and copy the data from the old place to the new one. If this happens frequently, then the performance of the code can be significantly reduced. Unlike vector , list or deque containers do not store a continuous sequence of data, so copying is not required.

On the other hand, the use of a vector with a preliminary reservation (that is, a single allocation of all the necessary memory) is the fastest and most economical way. Because in the case of a list or deque small chunks of memory are allocated many times. When choosing a container, you should think about exactly what operations will be performed on it.

1.11. Links or pointers?

Try to use links, not pointers. Links do not require checks. The link directly points to the object, and the pointer contains the address to be read.

1.12. Constructor initialization list

Initialize the variables in the constructor initialization list. Otherwise, it turns out that first they will be initialized, and then they are assigned a value.

 //   class Foo { public: Foo() { a = 0; //  a    s = "string"; //     } private: int a; std::string s; }; //   class Foo { public: Foo() : a(0), s("string") //    {} private: int a; std::string s; };

2. Compilers

The compiler is able to perform many different optimizations. Sometimes he should be helped. Sometimes, on the contrary, an attempt to optimize manually will lead to the fact that the compiler will not be able to optimize the code as it would have done without such "help".

2.1. Unrolling loops

Modern processors contain several functional devices (ALUs and FPUs) and are able to execute commands in parallel, i.e., several commands can be executed in a single clock cycle on one core. Therefore, expanding the loop allows you to perform the operation in fewer steps. Also, the deployment reduces the number of comparisons and conditional transitions:

 for(int i = 0; i < len; ++i) { sum += s[i]; }

Must be deployed to something like

 if( len >= 4 ) { for(; i < len - 3; i += 4) { sum += s[i]; sum += s[i + 1]; sum += s[i + 2]; sum += s[i + 3]; } } switch(len % 4) { case 3: sum += s[i + 2]; case 2: sum += s[i + 1]; case 1: sum += s[i]; break; }

Here is a part of the assembly code, without a switch :

 .L5: movsx rdi, BYTE PTR [rbx+rax] movsx rdx, BYTE PTR [rbx+1+rax] movsx r11, BYTE PTR [rbx+2+rax] movsx r10, BYTE PTR [rbx+3+rax] movsx r9, BYTE PTR [rbx+4+rax] movsx r8, BYTE PTR [rbx+5+rax] add rsi, rdi movsx rdi, BYTE PTR [rbx+6+rax] add rsi, rdx movsx rdx, BYTE PTR [rbx+7+rax] add rax, 8 add rsi, r11 add rsi, r10 add rsi, r9 add rsi, r8 add rsi, rdi add rsi, rdx cmp rax, rcx jne .L5

It can be seen that in this case the increment is not 4, but 8 byte each. Additional conditions inside the loop or calculations affecting the loop counter will make it impossible to deploy the loop.

The compiler does the unrolling of loops independently. He should be helped only so that it can be done. Also, small cycles, it is desirable to combine into one. But in the presence of conditions or a large loop body, on the contrary, it is better to divide it into several cycles so that at least one of them is deployed by the compiler.

2.2. Laziness of calculations - 1

It should be remembered that the terms && (logical AND) and || (logical OR) compiler processes from left to right. When calculating the logical AND, if the first condition is false, the second will not even be calculated. Accordingly, in logical OR, if the first condition is true, there is no sense in calculating the second. Here is a simple example:

 const char *s; ... if( strlen(s) > 3 && s[0] == 'y' ) { ... }

We need a string of more than three characters so that the first character is y. In this case, strlen(s) is an expensive operation, and s[0] == 'y' is a cheap one. Accordingly, if you swap them, it is possible to calculate the length of the string and not have to:

 if( s && s[0] == 'y' && strlen(s) > 3 ) { ... }

2.3. Laziness of calculations - 2

The laziness of calculations only works until you overload the && or || operator . An overloaded statement is a function call:

bool operator && (1, 2)

Therefore, all arguments must be evaluated before the call.

2.4. Switch or if?

Whenever possible, try using switch . Unlike the if condition, the switch often optimized through a table. Example:

 int func(int i) { if(i == 1 ) return 10; else if( i == 2 ) return 20; else if( i == 3 ) return 30; else if( i == 4 ) return 40; else if( i == 5 ) return 50; return 100; }

broadcast on

 cmp edi, 1 mov eax, 10 je .L2 cmp edi, 2 mov al, 20 je .L2 cmp edi, 3 mov al, 30 je .L2 cmp edi, 4 mov al, 40 je .L2 mov al, 50 cmp edi, 5 mov edx, 100 cmovne eax, edx

And here is the code:

 int func(int i) { switch(i) { case 1: return 10; break; case 2: return 20; break; case 3: return 30; break; case 4: return 40; break; case 5: return 50; break; default: return 100; } }

broadcast on

 dec edi mov eax, 100 cmp edi, 4 ja .L10 mov eax, DWORD PTR CSWTCH.1[0+rdi*4] CSWTCH.1: .long 10 .long 20 .long 30 .long 40 .long 50

2.5. Inline keyword

It would seem that this keyword was invented in order to speed up the program. But some understand it too literally and begin to insert inline before each function. This causes the code to swell. The larger the code, the more memory it needs, and especially the memory in the processor's cache. Modern compilers have long ceased to pay attention to this word and decide for themselves when to do the function embedded, and when not. But programmers still try to inflate the code by inserting something like __forceinline . Use inline should only be where it is really necessary.

2.6. RVO - Return Value Optimization

This optimization allows the C ++ compiler to not create a copy of the return value. The compiler should be helped to use this optimization.

 //   std::string foo() { std::string s = "string"; return s; //     RVO } //   std::string foo() { return std::string("string"); //    RVO }

Therefore, one exit point, though more beautiful, but less effective:

 std::string bar(...) { std::string result; if( x ) { result = foo(); } return result; //   ,  RVO   } std::string bar(...) { if( x ) { return foo(); //    RVO } else { return std::string(); //    RVO } }

This advice has almost lost relevance, since the compilers have learned how to use NRVO well, and there is also the possibility of movement. However, NRVO may not always be involved, and not all objects have a motion constructor.

2.7. Alignment of structures

In the declaration of classes and structures, try to arrange the variables in descending order of their size. Particular attention should be paid to grouping together variables whose size is less than 8 bytes. Compilers, in order to optimize, align the addresses of variables, because accessing a variable of the long type at the aligned address takes only one processor clock cycle, and if the variable is not aligned, then two clock cycles on the i386 architecture. On some architectures, reading at an unallocated address is generally not possible. Roughly speaking, a non-aligned variable is located in several memory cells: the first part is in one and part is in the next. So, thanks to this alignment, a variable of 1 byte size will take 4 or 8 bytes. Here is an illustrative example:

 #include <stdio.h> class Foo { int a; char b; int c; char d; }; class Bar { int a; int c; char b; char d; }; int main() { printf( "%d - %d\n", sizeof(Foo), sizeof(Bar) ); return 0; }

On my machine, the output will be as follows:

 $ g++ test.cpp && ./a.out 16 - 12

Here it can be seen that the alignment was carried out on the border of four bytes. And absolutely identical classes Foo and Bar occupy a different volume in memory. Usually you can and do not pay attention. But if you want to create thousands of instances of a class, then the Bar option is preferable. Of course, the compiler itself has no right to rearrange the variables.

Of course, one should not calculate the size of each variable in order to pack them as closely as possible. The size of the variables may depend on the compiler, compilation parameters and architecture. Also, do not suppress alignment without real need.

3. Multithreading

It is important to know that writing multi-threaded code is not to place synchronization objects in the right place, but to write code that does not require synchronization.

3.1. Atomic operations

In general, almost any reference to the kernel is an expensive operation. As with memory, and with many other challenges. The fewer such calls the program makes, the better. In the case of synchronization, the additional overhead creates the need to switch the context while competing. Therefore, if there is a lot of competition and synchronization is performed using a mutex / critical section, the overhead can be very serious. And the more competition, the more significant they are. Here is an example of bad code from fairly well-known programs (as of this writing) LinuxDC ++ / StrongDC ++ and, probably, other similar programs based on the same code:

 static long safeInc(volatile long& v) { pthread_mutex_lock(&mtx); long ret = ++v; pthread_mutex_unlock(&mtx); return ret; } static long safeDec(volatile long& v) { pthread_mutex_lock(&mtx); long ret = --v; pthread_mutex_unlock(&mtx); return ret; } static long safeExchange(volatile long& target, long value) { pthread_mutex_lock(&mtx); long ret = target; target = value; pthread_mutex_unlock(&mtx); return ret; }

This code is compiled for building under Linux. For Windows, the code is correct:

 static long safeInc(volatile long& v) { return InterlockedIncrement(&v); } static long safeDec(volatile long& v) { return InterlockedDecrement(&v); } static long safeExchange(volatile long& target, long value) { return InterlockedExchange(&target, value); }

The difference is that critical sections are used for Linux, and atomic operations for Windows that do not require heavy mutexes.

3.2. Cache line

Try not to allow access of different streams to closely located parts of memory. For example, there is such a structure:

 struct Shared { int a; int b; int c; };

and threads accessing one of the variables. If the threads will be running on different cores, an event called Cache line ping-pong will occur: when two different cores need to see each other's changes and for this you have to flush the cache and request data from the RAM. In such cases, when threads need shared data, you need to insert a piece of memory between variables that will fit in the processor's Cache-Line . The difficulty is that the size of this Cache-Line is different for each processor. I am guided by the value of 128 bytes:

 struct Shared { int a; char pad1[128]; int b; char pad2[128]; int c; };

4. Operating systems

This is the level to which you should go down only if you understand well the functions used. Traps can trap in unexpected places.

4.1. Memory

Try to avoid frequent memory allocation. This is a very expensive operation. And the difference between “allocate 100 MB” and “allocate 1 MB” is small. Therefore, we must try to organize the code so as to allocate a large amount of memory in advance and use it without referring to the OS kernel.

If it is necessary to allocate memory often, keep in mind that the allocator built into the standard library is inefficient, especially in the case of active memory operations from parallel threads. Consider using an alternative allocator like nedmalloc or TCMalloc or memory pools like Boost.Pool .

4.2. I / O Buffering

System calls like read/write or ReadFile/WriteFile do not use buffering. Therefore, when reading one byte, a whole block of data will be read and a single byte will be given from it. When reading the next byte, the same block will be read again. Similarly, when writing: writing one byte will immediately write this byte. It is very, very inefficient. Therefore, you should use functions that provide buffering, for example fread/fwrite .

5. Processors

5.1. RAM is no longer RAM

RAM stands for Random Access Memory. However, to date, an attempt to use memory as a source with fast random access will not lead to anything good. Because memory access takes several hundred CPU cycles!

And the only thing that saves is the processor's cache, access to which costs about a dozen cycles ( source ). From this it follows that you should try to do so that all the necessary data are placed in the processor cache. This is not only the program data, but the code itself. If possible, use sequential memory access. When organizing a structure such as a hash table or associative structure, the keys should not contain unnecessary information so that their maximum number fits in the cache.

5.2. Signed or unsigned?

Most often, programmers do not think about whether a variable should be signed or not. For example, the length of the string - it can not be negative, as well as the weight or price of something and many other values. Probably, the range of values for the sign number is enough to store the desired value, but there is still a difference in the processor instructions. For example, this code:

 int func(int i) { return i / 4; }

will be broadcast in

 lea eax, [rdi+3] test edi, edi cmovns eax, edi sar eax, 2

And this one:

 unsigned int func(unsigned int i) { return i / 4; }

get much shorter:

 mov eax, edi shr eax, 2

Typical places where it is preferable to use unsigned numbers: division and multiplication, counters in cycles, array indices.

Floating-point data types cannot be unsigned. But for them, special processor commands are used.

5.3. Branching

The processor pipeline is called a pipeline, which processes a continuous stream of commands. The processor continually delivers commands to the pipeline, so that after the execution of one command, it immediately starts another. But when a branch occurs, that is, an if ... else , the processor does not know for which branch the commands should be used - for if or else . Therefore, he is trying to predict which one will be used. In case of an error in the prediction, you have to reset the pipeline data and load new commands into it, while the pipeline is idle.

This means that the sooner the transition predictor understands which branch the program will run, the less likely the pipeline will be reset. It is usually recommended to locate the most likely branch at the very beginning (that is, in the if condition).

6. Conclusion

So, in this article we looked at some ways to optimize C ++ code. I hope some of them were useful to you. If you know other methods not mentioned in the article, write them in the comments!

And finally, two more tips:

Prefer ready-made solutions. They have already been tested, it is not necessary to spend time on their support and refinement, they will be developed and corrected by the forces of other people. The invention of bicycles is very good for self-development, but very bad for team development.
Think more about how to make simple and understandable code, not fast. .

Source: https://habr.com/ru/post/279449/

All Articles