Optimization of long arithmetic in C ++

Happy New Year! I will describe a classic plot - optimization of long arithmetic in C ++ with the help of assembly inserts. However, on Habré it was not there yet, so after some hesitation I decided to post it here, you will forgive if you yourself once wrote the same thing and advanced further than me :-)

Suppose that in the BigInt class, which represents a long integer, the fields void * words are declared - mass words, int wc is the number of words in the array (the word length in bytes is specified at the compilation stage). Then the implementation of addition on pure C ++ can be like this:

... typedef unsigned int uInt; typedef unsigned long long int uLong; #define MAX_uInt ((uInt)(-1)) #define MUL ((uLong)MAX_uInt) // MUL: Max_uInt in Unsigned Long int BigInt::WS = sizeof(uInt); BigInt & BigInt::operator+=(const BigInt &rhs) { ... //  ,   uLong carry = 0; for ( uInt *th = (uInt*)words, *oz = (uInt*)rhs.words, *limit = oz + rhs.wc, *thl = th + wc; oz < limit || carry && th < thl; oz++, th++) { uLong t = carry + *th; t += oz < limit ? *oz : 0; if (t > MUL) { carry = 1; t &= MUL; } else carry = 0; *th = (uInt)t; } if (carry) { ... //     } return *this; }

It can be seen that C is not very suitable for solving our problem: it is impossible to add 8 bytes at a time, because there is no space left for a single bit of overflow.

We will consider how many clock cycles the processor spends on adding each 4 bytes from the array of words of a long integer. Like that:

 struct timespec t1, t2; BigInt *a = ..., *b = ...; //    - 400 clock_gettime(CLOCK_MONOTONIC_RAW, &t1); for (int j = 0; j < 10000000; j++) { *a += *b; *a -= *b; } clock_gettime(CLOCK_MONOTONIC_RAW, &t2); printf(...); //   t1  t2

Here is the progress on the GCC optimization keys:

-O1	-O2	-O3
7	6	6

')
Rewrite the body of the addition method using assembly inserts:

 // uInt  int  long long int BigInt & BigInt::operator+=(const BigInt &rhs) { ... //  ,   uInt *th = (uInt*)words, *oz = (uInt*)rhs.words; asm goto ( "clc\n" "l1:\t" "mov\t(%[oz]), %[ax]\n\t" "adc\t%[ax], (%[th])\n\t" "lahf\n\t" "add\t%[ws], %[th]\n\t" "add\t%[ws], %[oz]\n\t" "sahf\n\t" "loop\tl1\n\t" "jnc\t%l[nocarry]\n" : : [th] "r" (th), [oz] "r" (oz), [ws] "I" (sizeof(uInt)), [ax] "a" ((uInt)0), "c" (wc) : : nocarry ); ... //     nocarry: return *this; }

In the future, I will immediately write the result of the compilation (for uInt ≡ long long int):

 ... clc lo: mov (%rbx), %rax adc %rax, (%rdx) lahf add $8, %rdx add $8, %rbx sahf loop lo jnc nocarry ...

Briefly about what is happening here: in the bx and dx registers there are pointers to the current word, adc is a very convenient instruction that adds the source, the receiver and the overflow flag, what you need! lahf and sahf save and load the base flags — these instructions are needed because add clears the overflow flag. loop iterates as many times as it is in the cx register.

This is exactly (honestly) what I wrote first. This variant is performed in 7 cycles in both variants, respectively in 3.5 cycles for every four bytes in the “long version”. GCC is still high, but the resulting assembler is still optimized and optimized.

.

Bad rumors go around loop instructions. We must try without it, especially since it costs almost nothing:

 ... clc lo: mov (%rbx), %rax adc %rax, (%rdx) lahf add $8, %rdx add $8, %rbx sahf dec %rcx jnz lo jnc nocarry ...

The result is 5 (2.5) cycles! Never use loop - it saves the line, but it spends as much as 2 bold cycles on each iteration.

.

I would like to deal with the increase in pointers - is there any way not to touch the CF (carry flag - flag “carry”, overflow)? Fortunately, there is - the lea instruction calculates the reference to the memory and puts it into the receiver register, without touching the flags. Here's how to use it for pointer increments:

 ... clc lo: mov (%rbx), %rax adc %rax, (%rdx) lea 8(%rdx), %rdx lea 8(%rbx), %rbx dec %rcx jnz lo jnc nocarry ...

In fact

  lea 8 (% rdx),% rdx

does the same thing as

  add $ 8,% rdx

- adds the word size to the register.

The processor spends 2 clocks on this variant (that is, 1 clock per 4 bytes). Frankly, I did not expect that lahf / sahf take so much time.

.

What to do next? SSE , unfortunately, are not helpers here: as long as the 64-bit architecture is used, a nonexistent padc xmm instruction , xmm would generate essentially the same flow of microinstructions as the sequence of two ordinary additions, it is impossible to parallelize addition with overflow. Similarly, it means that it is necessary to write two additions in a row, to unfold the assembler cycle. Golden reception optimization.

 // WS = 16 ... clc lo: mov (%rbx), %rax adc %rax, (%rdx) mov 8(%rbx), %rax adc %rax, 8(%rdx) lea 16(%rbx), %rbx lea 16(%rdx), %rdx dec %rcx jnz lo jnc nocarry ...

Now, one iteration is performed in 3 clocks, which corresponds to 0.75 clocks per 4 bytes.

Hooray! GCC -O2 made 8 times! There is no time to think further! Once again, everyone with the coming!

Source: https://habr.com/ru/post/135590/

All Articles

Optimization of long arithmetic in C ++

More articles: