Fun starts 2: for_each vs accumulate

Formulation of the problem

Our goal is to understand how much you can trust the optimization of your code to the compiler, and is it possible to facilitate (complicate) its task?

Why is everything the same?

We repeat our measurements (see previous article ), asking the compiler to optimize our code. To prevent the compiler from being carried away, add the sum output to the code:

cout << "sum=" << sum << endl;

Compile option with accumulate:

 sum = accumulate(vec.begin(), vec.end(), 0);

All code

 #include <iostream> #include <sys/time.h> #include <iomanip> #include <vector> #include <algorithm> #include <functional> using namespace std; typedef int A; const int COUNT = 1000 * 1000 * 100; int main () { vector<A>vec(COUNT); generate(begin(vec), end(vec), std::rand); A sum = 0; struct timespec tm1, tm2; clock_gettime(CLOCK_REALTIME, &tm1); sum = accumulate(vec.begin(), vec.end(), A(0)); clock_gettime(CLOCK_REALTIME, &tm2); cout << "accumulate" << endl; double t1 = 1000.0 * tm1.tv_sec + tm1.tv_nsec / (1000.0 * 1000); double t2 = 1000.0 * tm2.tv_sec + tm2.tv_nsec / (1000.0 * 1000); cout << "t=" << setprecision(5) << t2 -t1 << " ms" << endl; cout << "sum=" << sum << endl; return 0; };

With maximum optimization:
')

 $ g++ -std=c++11 main.cpp -o test -O3 && ./test $ duration 33.995 ms mseconds

Let's compare this result with the result of for_each:

 for_each(vec.begin(), vec.end(), [&sum](int i) { sum += i; });

 $ g++ -std=c++11 main.cpp -o test -O3 && ./test $ duration 34.21 ms mseconds

Variants with an explicit cycle give a similar result.
Why did the speed after optimization become the same? In order to answer this question, let's look at the STL and see what the for_each function is:

 template<typename _InputIterator, typename _Function> for_each(_InputIterator __first, _InputIterator __last, _Function __f) { for (; __first != __last; ++__first) __f(*__first); return__f; }

As you can see, for_each is the loop , the only optimization - the compiler makes the for_each function built-in:

 for (; __first != __last; ++__first) [&sum](int i) { sum += i; }(*__first);

Here it seems reasonable to the compiler and make the lambda integrated. Vector iterators are inherently pointers, so in the end, the final assembly code looks like this:

 .L4: addq $1, %rax paddd (%rcx), %xmm0 addq $16, %rcx cmpq %r9, %rax jb .L4

Even if I wrote the code manually, I would not have done better . Since the program is small and there are very few variables, the compiler placed them in registers. No unnecessary copying or function calls - excellent, even addition takes place at once by two elements.

Now it becomes obvious why the “manual” optimization of loops and even the register keyword do not affect the speed of the final code.

Is the speed always the same?

Instead of int, let's summarize a simple classic the size of sizeof (int):

 class A { int v = 0; public: A() {} A(int i); A(const A& a); operator int(); A & operator +=(const A& a); A operator +(const A& a); };

UPDATE . Thank you 0xd34df00d, kmu1990 for the hint - the + operator = should return the link, but in this case it does not matter, we do not use the result of the operator.

And we will put its implementation in a separate file.

a.cpp

 A::A(int i) : v(i) {} A::A(const A &a) : v(av) {} A::operator int() { return v; } A & A::operator +=(const A &a){ v += av; return *this; } AA::operator +(const A &a) { return av + v; }

Now the option with for_each:

 for_each(vec.begin(), vec.end(), [&](A i) { sum += i; });

Compile and run:

 $ g++ -std=c++11 main.cpp a.cpp -o test -O3 && ./test $ duration 372.84 ms mseconds

And a simple loop:

 for(int i = 0; i < COUNT; ++i) { sum += vec[i]; };

Run:

 $ g++ -std=c++11 main.cpp a.cpp -o test -O3 && ./test $ duration 240.57 ms mseconds

Are the cycles still faster? Actually, no, just the correct version with for_each should now look like this:

 for_each(vec.begin(), vec.end(), [&](A &i) { sum += i; });

Then:

 $ g++ -std=c++11 main.cpp a.cpp -o test -O3 && ./test $ duration 240.8 ms mseconds

The fact is that the compiler does not have the right to simply remove the copying of the argument, since it does not know what kind of good things we are doing in the copy operator we wrote.

The speed of work for_each equaled the speed of the cycle, but the variant with accumulate:

 sum = accumulate(vec.begin(), vec.end(), A(0));

Still lags behind:

 $ g++ -std=c++11 main.cpp a.cpp -o test -O3 && ./test $ duration 410.52 ms mseconds

Why is that? Let's look at the assembler code of the variant with for_each:

 .L12: leaq 160(%rsp), %rsi leaq 112(%rsp), %rdi movq %rbx, %rdx call _ZN1ApLERKS_ addq $4, %rbx cmpq %rbp, %rbx jne .L12

And compare it with the option code with accumulate:

 .L7: leaq 144(%rsp), %rsi leaq 192(%rsp), %rdi movq %rbx, %rdx call _ZN1AplERKS_ movl 192(%rsp), %eax addq $4, %rbx cmpq %rbx, %rbp movl %eax, 144(%rsp) jne .L7

This is due to the fact that the compiler from the "+ =" operator generates a lighter code than from the sum assignment:

 template<typename _InputIterator, typename _Tp> accumulate(_InputIterator __first, _InputIterator __last, _Tp __init) { for (; __first != __last; ++__first) __init = __init + *__first; return __init; }

Hence the extra move operations.
By the way, does not it seem strange to you that a specialized function for calculating the sum gives a worse result than a cycle or the use of for_each?

Conclusion

The compiler is not as smart as a man. And it’s easy enough to change the code so that the compiler gets confused and gives us not exactly what we would like or expect.

If we want to get a guaranteed result , we need to do everything “by ourselves”, or each time we check what the compiler has improved and what has deteriorated there. The standard does not guarantee that for_each will be optimized as well as the cycle, and this is important if you are writing portable code.

If the speed of work is not critical - always choose STL. The code is more readable, and on average, the STL code is faster .

Source: https://habr.com/ru/post/260193/

All Articles

Fun starts 2: for_each vs accumulate

Formulation of the problem

Why is everything the same?

Is the speed always the same?

Conclusion

More articles: