On the question of "lost time"

We had a wonderful opportunity to conduct a small but extremely instructive tactical exercise.

The issues of optimizing programs that produce a significant amount of calculations, unfortunately, are not well covered in the literature and, as a rule, boil down to some general principles, the loyalty of which is completely obvious before reading the author's arguments, not even after. Since, in the post mentioned (look for quoted words), a non-interesting computational problem was proposed, which allows you to demonstrate these principles and specific optimization techniques in action, and a real post was created which, although somewhat deviates from the direction the author favors I see the solution to this problem on the MK class M3 and even Arduino, try, but still microcontrollers are intended for several other purposes), but nevertheless fit into the concept of the MK programming course.

So we begin.

Let me remind you of the task - you need to find five non-negative integers, such that the sum of the fifth powers of the first four of them is equal to the fifth power of the last. The wording is not quite correct and will be further refined, but it will come down for a start. A natural algorithm for solving this problem will consist in the enumeration of possible sets of numbers and the verification of the fulfillment of this condition for them. Immediately clarify the condition - numbers that do not exceed a certain limit, otherwise the search may never end if such a set does not exist at all (we don’t hope to seriously check ALL sets of five of the whole non-negative numbers, I don’t know how the readers, I’m exactly I will not live until the end of the search). We write this natural solution to immediately mark one important feature of modern programming, and here it is.

We will immediately agree that we launch all our programs in the MinGW compiler version 6.2 on the x86 machine of i3-2120 3.3 GHz, so that these numbers do not indicate. If your system parameters differ from mine, then the absolute values of the time indicators will be somewhat different, although the relative time of operation of different options should be close.
')
What we see is that among the sets of numbers smaller than 100, the set that meets the given condition was not found (in fact, there are many, but they do not suit us, because the numbers must be different and not equal to zero ... I have not told about this before ? Is another argument in favor of accurate formulation of TK) and in the process of iteration we checked 110 combinations. At the same time the program was 63 seconds.

We already understood that the program is incorrect due to incomplete formation of conditions, but before we begin to modify it, let's talk about the possible acceleration of calculations with the same algorithm. In my time there were books in which recommendations were given on machine-independent optimization of execution over time, and the following recommendations would be appropriate for this algorithm:

- make part of the calculations from the inner loop and form an intermediate sum of the first three numbers;
- to organize an internal cycle to reduce the index, not increase;
- to organize an internal loop in the form of do {} while, but not for;
- make the index variable register;
- to optimize the calculation of the fifth degree, which will reduce the number of multiplications from 4 to 3,
- to interleave the calculation of degrees, so as not to block the conveyor and so on ...
and initially I was going to carry out all these changes and show their impact on speed and demonstrate how tens of percent turn into times of increase in speed, but then refused this idea.

The fact is that these recommendations are somewhat outdated and currently the only thing that is required of a programmer is to allow the compiler to optimize at level 3. Yes, I need to know how to do this, in my case the optimization control is hidden in the Project / Settings section ... -Compile / Optimization, but that's all you need to know to do this optimization. Constant readers of my posts (flatter myself with the hope that there are such) have already become accustomed to seeing in my opuses moaning about the decline of morals and not enough green grass in modern electronics compared to the times of my entry into the profession, but now I’m not at all discarding the previously expressed of opinions, I must state with all certainty that modern compilers are much better than their predecessors and almost perfect in terms of generated code, if they are allowed to optimize.

Such a strong statement requires proof and here it is - we compile our program with optimization turned on and on the same processor we see the execution time of 20 seconds, that is, the performance increase by 3 times (mainly due to the first recommendation, although the others also contributed to the fifth parts of the acceleration) and without the slightest effort on our part is awesome. Those interested can try to carry out all the methods of speed increase that I have listed and look at the result, but I strongly recommend just looking at the produced assembly code and understanding that it is almost impossible to improve it. By joining the author of the original post, I can strongly recommend the site gcc.godbolt.org, which provides an opportunity to see the code generated by the gcc compiler for different architectures and for different optimization modes.

The performance increases are significant, the topic of algorithm-independent optimization has been exhausted, but the post does not end there - is there really something else to be done. Yes, there is something and how to improve the program (increase its speed) and here the compiler is no longer an assistant for us, because it does not understand the essence of the actions performed in the code and cannot suggest the necessary changes, only a programmer can do this. In my opinion, this is good luck, since even modern software development tools leave room for creativity, although my young colleague found certain inconveniences in advanced compilers, saying, “It was easier for you, you knew that the compiler would not optimize anything, hoped and did everything manually, and we have to decide from what point our part of work begins. ” Well, to say that if you do all the necessary optimization with pens, it will not be any worse (it will not be better either, since the work will be duplicated), so nothing prevents you from going to the end, I just suggested a shorter way.

We return to our program and look for ways to speed it up. Rather, before looking for such methods, you need to make sure that this action is necessary. We check the execution time at different values of the boundary and find that for values of 60 and 70 times differ by more than 2 times, which confirms our hypothesis about the running time of the algorithm as o (N ** 5): 70/60 = 1.17 ** 5 = 2.2 times. We will use the resulting formula for estimating the execution time of our program for the 1000 border and we will get 553 hours, which somewhat exceeds our expectations, for large numbers the time will not be shorter than that - it makes sense to optimize.

Further, paragraphs will indicate the principles for solving engineering problems of the TRIZ system, as I remember it.

The principle of exclusion, or remove reuse

While we do not see the possibility to change the resulting estimate of the execution time of the algorithm, but we can reduce the coefficient with the determining term. It can be seen that we iterate over all permutations of a given set of natural numbers, if we take into account the possibility of ranking them and the result is independent of the order of the operands, then we conclude that we need to consider combinations of elements of a given set. The search of combinations is easily implemented by changing the initial value of the loop variable for the next element, which we do and the speed of the program increases dramatically, by two orders of magnitude. Well, it should be so, we counted on a gain of 5! times, and that is exactly 120.

A small digression - in the original post, the author comes to a similar idea after writing out various combinations of indexes on a piece of paper and considering the results, since it honestly admits that it is not strong in combinatorics. Paying tribute to his efforts, I must declare that combinatorics and elements of probability theory (namely, elements, since TerVer in all its beauty and power cannot be called simple) is that part of mathematics, without which it is counterproductive to analyze algorithms. Of course, I am glad that the author managed to recreate the method of generating combinations, but he just needs to know everything, like a multiplication table. This is me about occasionally arising disputes about whether a programmer needs a mathematician, but this is my personal opinion, expressed in a private conversation, and it may be erroneous.

Returning to the main topic, this is exactly the optimization that the compiler will not do, because here you need to understand the essence of the problem and the algorithm that implements it, but he is not yet capable of this (the compiler with two commands still remains an unrealized ideal, although who knows that will be next). And one more important feature - usually the benefit from such optimization repeatedly exceeds any possible benefit from optimization performed by the compiler, which is confirmed in this case, since 120 >> 5. Let us summarize the results and try to extrapolate them to some large values, we get

Border __ Without optimization ____ Compiler ____ __ Our ________ Both
10 _________ 0.00 _____________ 0.00 __________ 0.00 ________ 0.00
50 _________ 2.32 ____________ 0.66 ________ 0.02 ________ 0.01
100 ________ 62.91 _____________ 19.93 _________ 0.49 ________ 0.16 (1.43)
150 _______ 476.20 ____________ 150.67 _________ 3.80 ________ 1.21 (10.99)
1000_______1800 h * ____________ 553 h * _________ 14 h * _______ 5 h * (35 h *)
(in brackets the calculation time for the type of the result is a long long).

We see that if the original version was practically unacceptable at border values above 200, then the last option is applicable up to 1000. And we even found a solution, and two, however, one of them is false because of overflow, but more on that later. for now let's try to improve the speed.

Principle of preliminary execution

The obvious solution - to calculate the values of the degrees in advance and select it from the table, undoubtedly gave a result, but, frankly, somewhat smaller than I hoped, apparently, with the implementation of multiplication, x86 is fine, but the choice from the array is worse. But still, an increase of almost 8 times is very nice. (For some reason, there is no performance gain for the variant with the result type int). Purely theoretically, I can imagine a compiler that will conduct such optimization, but so far I have not seen this. Let's pay attention to the fact that in this case we had to pay for the performance with additional memory, which is typical, since the two directions of optimization, in terms of time and memory, usually contradict each other.

#include "time.h" #include "stdio.h" #define LOOP(n,m) for (int n=m+1; n<=MAX; n++) #define REZ(i) (Pow5[i]) typedef long long int rez_t; int main(void) { rez_t Pow5[1001]; for (int i=0; i<=1000; i++) Pow5[i]=(rez_t)i*(rez_t)i*(rez_t)i*(rez_t)i*(rez_t)i; for (int MAX=50; MAX<=500; MAX+=50) { clock_t t = clock(); unsigned long long v=0; int r=0; LOOP(i1,0) LOOP(i2,i1) LOOP(i3,i2) { rez_t Sum = REZ(i1)+REZ(i2)+REZ(i3); LOOP(i4,i3) LOOP(i5,i4) { v++; if (Sum+REZ(i4)==REZ(i5)) { r++; printf("%4d %4d %4d %4d %4d \n",i1,i2,i3,i4,i5); }; }; }; t = clock() - t; printf ("MAX=%d time is %f seconds.\n", MAX, ((double)t)/CLOCKS_PER_SEC); printf(" %llu %d\n",v,r); }; return 1; };

The principle of restriction, or exclude inappropriate options

Go ahead and look for additional paths, for which we will somewhat reformulate the task - we need to find a number, the fifth power of which is equal to some specific number (representing the sum of the fifth power of four other numbers, but this does not matter). This formulation allows us to focus on the word “find” and immediately suggests the direction of optimization - speeding up the search, we had it linear by iterating through all possible values, and this is not the best way for an ordered data set. The first thought is a binary search and such a direction could give us a significant performance increase, but we will go the other way - for a start, we will stop looking at options that obviously cannot be a solution, as we look through the values in ascending order and the desired value has already been exceeded. We can expect a speed increase of two times, an exact analytical estimate is hardly possible and we do get acceleration almost 5 times.

A small note - in order to go to this method we had to change the type of the result to a double integer. The fact is that the proposed method requires the ordering of the fifth powers of candidate numbers, and to my deepest regret, n% m does not imply n% k> m% k if any of n or m exceeds k. You ask, where did k come from, because we just work with integers, but the fact is that in any practically implemented architecture, integers (if the speed of arithmetic operations with them is reasonable) have a limited bit depth and, therefore, take the result of each operations modulo the maximum representable number. If we have a 32-bit result, then this number will be (2 ** 32) -1 ~ 2 ** 32 = (2 ** 2) * (2 ** 30) = 4 * ((2 ** 10) ** 3) ~ 4 * ((10 ** 3) ** 3) = 4 * (10 ** 9) = 0.4 * (10 ** 10) and the fifth root of this number will be a number close to 100 (the exact value 84). then for a 64-bit result we get the largest value of the border 7130, and for 128-bit ~ 506.

 #include "time.h" #include "stdio.h" #define LOOP(n,m) for (int n=m+1; n<=MAX; n++) #define REZ(i) (Pow5[i]) typedef long long rez_t; int main(void) { rez_t Pow5[1001]; for (int i=0; i<=1000; i++) Pow5[i]=(rez_t)i*(rez_t)i*(rez_t)i*(rez_t)i*(rez_t)i; for (int MAX=50; MAX<=500; MAX+=50) { clock_t t = clock(); unsigned long long v=0; int r=0; LOOP(i1,0) LOOP(i2,i1) LOOP(i3,i2) { rez_t Sum3 = REZ(i1)+REZ(i2)+REZ(i3); int i4=i3+1; do { rez_t Sum4 = Sum3 + REZ(i4); int i5=i4+1; do { v++; if (Sum4==REZ(i5)) { r++; printf("%4d %4d %4d %4d %4d \n",i1,i2,i3,i4,i5); }; i5++; } while ((Sum4>=REZ(i5)) && (i5<=MAX)); i4++; } while (i4<=MAX); }; t = clock() - t; printf ("MAX=%d time is %f seconds.\n", MAX, ((double)t)/CLOCKS_PER_SEC); printf(" %llu %d\n",v,r); }; return 1; };

Evaluation principle, or use previous results (caterpillar)

Not bad, although it should be much better for binary search, but we take into account another consideration that makes this method preferable - we can significantly reduce the search time if we do not look at the numbers smaller than those found in the previous cycle, because they were less than the previous amount , it is guaranteed to be less than the current one, which is previous superior. It turns out that we can work on the principle of a caterpillar (“Caterpillar method”) for the fourth and fifth numbers, and using this method, the guaranteed number of iterations cannot exceed 2 * n, where n is the interval length and we are compared to the previous required number iterations of n * (n-1) / 2 we win in (n-1) / 4 times, which is very, very significant for large numbers. At the same time, the number of iterations is even better than for binary search, which is n * log2 (n) and exceeds our value already at n> 2 ** 2 = 4.

 #include "time.h" #include "stdio.h" #define LOOP(n,m) for (int n=m+1; n<=MAX; n++) #define REZ(i) (Pow5[i]) typedef long long rez_t; int main(void) { rez_t Pow5[1001]; for (int i=0; i<=1000; i++) Pow5[i]=(rez_t)i*(rez_t)i*(rez_t)i*(rez_t)i*(rez_t)i; for (int MAX=50; MAX<=500; MAX+=50) { clock_t t = clock(); unsigned long long v=0; int r=0; LOOP(i1,0) LOOP(i2,i1) LOOP(i3,i2) { rez_t Sum3 = REZ(i1)+REZ(i2)+REZ(i3); int i4=i3+1; int i5=i4+1; do { rez_t Sum4 = Sum3 + REZ(i4); if (Sum4 >= REZ(i5)) do { v++; if (Sum4==REZ(i5)) { r++; printf("%4d %4d %4d %4d %4d \n",i1,i2,i3,i4,i5); }; i5++; } while ((Sum4>=REZ(i5)) && (i5<=MAX)); i4++; } while ((i4<=MAX) && (i5<MAX)); }; t = clock() - t; printf ("MAX=%d time is %f seconds.\n", MAX, ((double)t)/CLOCKS_PER_SEC); printf(" %llu %d\n",v,r); }; return 1; };

The remarkable fact is that for the first time we managed to lower not the coefficient before the determining term, but its order, which leads to a very significant acceleration of work for large values of the boundaries of the considered numbers. We will summarize the results in the table again and rejoice at the obtained values.

Frontier __Table of degrees ___ Limit 5__ Caterpillar method__ Inversion method
100 ________ 0.3 _________________ 0.1 __________ 0.02 ____________ 0.04
150 ________ 1.8 _________________ 0.6 __________ 0.14 ____________ 0.28
200 ________ 6.6 _________________ 2.3 __________ 0.49 ____________ 0.36
300 _______ 50.4 ________________ 16.1 __________ 2.14 ____________ 2.26
1000_______6 h * ________________ 1.5 h * ______ 210.0 ____________ 880 *
5000_______2 g * ________________ 0.5 g * ________ 36 h *

Further calculations show that we can expect a result for the border of 10 thousand within 24 days, which is practically acceptable, although we will not do this.

Further consideration of the topic was planned taking into account the remarkable fifth degree property, namely, a small number of modulo 11 residues, but this topic is well described in the next post on Euler's hypothesis and I will not repeat, I will only note that there was no simple way to use this property ( and I don’t see it) and difficult ones require an indecent amount of memory.

Inversion method

Therefore, we continue the consideration of the question from the other side - we reformulate the problem as follows - to find four natural numbers, the sum of the fifth powers of which will be the fifth power of the natural number. At first glance, nothing much happened, but we can start the task from the end - we ask the required amount and try to compose it. Such an approach immediately allows us to formulate some restrictions on the appropriate numbers. For example, for the largest of four numbers, we know that it cannot exceed the desired number (this is trivial) and its fifth degree cannot be less than one-fourth to fifth degree of the desired number (and this is not so trivial, but obvious), which narrows the area valid candidates from n-1 to n * (1-root of the fifth degree of 4 = 1.32), that is, three times. Further, taking the candidate for the role of the fourth number, we can immediately calculate the remaining difference and choose the candidate for the third role taking into account similar considerations in relation to it. I personally do not undertake to evaluate the result of such optimization analytically, so we simply write a program and look at the result.

 #include "time.h" #include "stdio.h" #define LOOP(n,m) for (int n=m+1; n<=MAX; n++) #define REZ(i) (Pow5[i]) typedef long long rez_t; int main(void) { rez_t Pow5[1001]; for (int i=0; i<=1000; i++) { Pow5[i]=(rez_t)i*(rez_t)i*(rez_t)i*(rez_t)i*(rez_t)i; }; for (int MAX=100; MAX<=1000; MAX+=50) { clock_t t = clock(); unsigned long long v=0; int r=0; int i5=MAX; do { int i4=i5-1; do { rez_t Ost1=REZ(i5)-REZ(i4); int i3=i4-1; if (REZ(i4)*4 > Ost1) do { rez_t Ost2 = Ost1-REZ(i3); int i2=i3-1; if ((REZ(i3)*3 > Ost2) && (Ost2>0)) do { rez_t Ost3=Ost2-REZ(i2); int i1=i2-1; if ((REZ(i2)*2 > Ost3) && (Ost3>0)) do { v++; if (Ost3==REZ(i1)) { r++; printf("%4d %4d %4d %4d %4d \n",i1,i2,i3,i4,i5); }; i1--; } while (i1>0); i2--; } while (i2>0); i3--; } while (i3>0); i4--; } while (i4>0); i5--;} while (i5>0); t = clock() - t; printf ("MAX=%d time is %f seconds.\n", MAX, ((double)t)/CLOCKS_PER_SEC); printf(" %llu %d\n",v,r); }; return 1; };

Not bad, we managed to significantly reduce the coefficient, but the order remained the same and at large values we lose to the previous program.

And here comes the insight - we can start using the new method, significantly reducing the enumeration of the two older terms, and the two younger ones will be searched for using the caterpillar method, since we already know their sum !!! (The number of exclamation marks corresponds to the power of the idea). That is, we take the lower number equal to one, the following - the maximum possible, taking into account the older ones, and we will move the lower or upper limit of the interval depending on the result.

 #include "time.h" #include "stdio.h" #define LOOP(n,m) for (int n=m+1; n<=MAX; n++) #define REZ(i) (Pow5[i]) typedef long long rez_t; int main(void) { rez_t Pow5[1001]; for (int i=0; i<=1000; i++) { Pow5[i]=(rez_t)i*(rez_t)i*(rez_t)i*(rez_t)i*(rez_t)i; }; for (int MAX=100; MAX<=1000; MAX+=50) { clock_t t = clock(); unsigned long long v=0; int r=0; int i5=MAX; do { int i4=i5-1; do { rez_t Ost1=REZ(i5)-REZ(i4); int i3=i4-1; if (REZ(i4)*4 > Ost1) do { rez_t Ost2 = Ost1-REZ(i3); int i2=i3-1; int i1=1; if ((REZ(i3)*3 > Ost2) && (Ost2>0)) do { v++; rez_t Sum2=REZ(i1)+REZ(i2); if (Ost2==Sum2) { r++; printf("%4d %4d %4d %4d %4d \n",i1,i2,i3,i4,i5); }; if (Sum2>=Ost2) i2--; else i1++; } while (i1<=i2); i3--; } while (i3>0); i4--; } while (i4>0); i5--;} while (i5>0); t = clock() - t; printf ("MAX=%d time is %f seconds.\n", MAX, ((double)t)/CLOCKS_PER_SEC); printf(" %llu %d\n",v,r); }; return 1; };

Here is the program, and here is the result - for values up to a thousand, we count only 21.7 seconds - 10 times faster than the previous version, and we will not compare the initial version. With such a program, you can already aim at the values of the order of thousands, I leave it to the inquisitive reader.

Happy New Year everyone.

Source: https://habr.com/ru/post/318816/

All Articles