How update Rust 1.26 sped up my code more than three times

I want to share a little story about the power of LLVM and the advantages of high-level languages over assembler.

I work for Parity Technologies, a company that supports the Parity Ethereum client . In this client, we need fast 256-bit arithmetic, which we have to emulate at the software level, because no hardware supports it in hardware.

For a long time, we simultaneously do two implementations of arithmetic: one on Rust for stable assemblies and one with built-in assembly code (which is automatically used by the nightly version of the compiler). We do this because we store 256-bit numbers as arrays of 64-bit numbers, and in Rust there is no way to multiply two 64-bit numbers to get a result of more than 64 bits (since the integer Rust types only go to u64 ). This is despite the fact that x86_64 (our main target platform) natively supports 128-bit results of calculations with 64-bit numbers. So we divide each 64-bit number into two 32-bit ones (because you can multiply two 32-bit numbers and get a 64-bit result).

 impl U256 { fn full_mul(self, other: Self) -> U512 { let U256(ref me) = self; let U256(ref you) = other; let mut ret = [0u64; U512_SIZE]; for i in 0..U256_SIZE { let mut carry = 0u64; // `split` splits a 64-bit number into upper and lower halves let (b_u, b_l) = split(you[i]); for j in 0..U256_SIZE { // This process is so slow that it's faster to check for 0 and skip // it if possible. if me[j] != 0 || carry != 0 { let a = split(me[j]); // `mul_u32` multiplies a 64-bit number that's been split into // an `(upper, lower)` pair by a 32-bit number to get a 96-bit // result. Yes, 96-bit (it returns a `(u32, u64)` pair). let (c_l, overflow_l) = mul_u32(a, b_l, ret[i + j]); // Since we have to multiply by a 64-bit number, we have to do // this twice. let (c_u, overflow_u) = mul_u32(a, b_u, c_l >> 32); ret[i + j] = (c_l & 0xffffffff) + (c_u << 32); // Then we have to do this complex logic to set the result. Gross. let res = (c_u >> 32) + (overflow_u << 32); let (res, o1) = res.overflowing_add(overflow_l + carry); let (res, o2) = res.overflowing_add(ret[i + j + 1]); ret[i + j + 1] = res; carry = (o1 | o2) as u64; } } } U512(ret) } }

You do not even need to understand the code to see how sub-optimal it is. Checking the output of the compiler shows that the generated assembly code is extremely non-optimal. He does a lot of extra work, essentially just to get around the limitations of Rust. Therefore, we wrote our own version of the code in inline assembly. It is important to use our version of the assembly code, because x86_64 initially supports multiplication of two 64-bit values to get a 128-bit result. When Rust makes a * b , if a and b are in u64 format, the processor actually multiplies them and gets a 128-bit result, and then Rust simply discards the top 64 bits. We want to leave these 64 upper bits, and the only way to efficiently access them is to use the built-in assembly code.
')
As it is easy to guess, our implementation in assembly language was much faster:

  name u64.bench ns / iter inline_asm.bench ns / iter diff ns / iter diff% speedup 
  u256_full_mul 243,159 197,396 -45,763 -18.82% x 1.23 
  u256_mul 268,750 95,843 -172,907 -64.34% x 2.80 
  u256_mul_small 1,608 789 -819 -50.93% x 2.04

u256_full_mul checks the function above, u256_mul multiplies two 256-bit numbers to get a 256-bit result (in Rust you simply create a 512-bit result and discard the upper half, but in an assembler we have a separate implementation), and u256_mul_small multiplies two small 256-bit numbers As you can see, our assembly code is 2.8 times faster. It is much, much better. Unfortunately, it only works in the nightly version of the compiler, and even in it only for the x86_64 platform. The truth is that it took a lot of effort and a lot of unsuccessful attempts to make the Rust code “at least” twice as fast as the assembler. There simply was no good way to pass the necessary data to the compiler.

Everything has changed in the version of Rust 1.26 . Now we can write a as u128 * b as u128 - and the compiler will use the u64-to-u128 multiplication native to the x86_64 platform (even though you translate both numbers into u128 , it understands that they are "really" just u64 , but you just need the result u128 ). This means that our code now looks like this:

 impl U256 { fn full_mul(self, other: Self) -> U512 { let U256(ref me) = self; let U256(ref you) = other; let mut ret = [0u64; U512_SIZE]; for i in 0..U256_SIZE { let mut carry = 0u64; let b = you[i]; for j in 0..U256_SIZE { let a = me[j]; // This compiles down to just use x86's native 128-bit arithmetic let (hi, low) = split_u128(a as u128 * b as u128); let overflow = { let existing_low = &mut ret[i + j]; let (low, o) = low.overflowing_add(*existing_low); *existing_low = low; o }; carry = { let existing_hi = &mut ret[i + j + 1]; let hi = hi + overflow as u64; let (hi, o0) = hi.overflowing_add(carry); let (hi, o1) = hi.overflowing_add(*existing_hi); *existing_hi = hi; (o0 | o1) as u64 } } } U512(ret) } }

Although this is almost certainly slower than the i256 type native to LLVM, but the speed has increased very much. Here we compare with the original implementation of Rust:

  name u64.bench ns / iter u128.bench ns / iter diff ns / iter diff% speedup 
  u256_full_mul 243.159 73.416 -169.743 -69.81% x 3.31 
  u256_mul 268,750 85,797 -182,953 -68.08% x 3.13 
  u256_mul_small 1,608 558 -1,050 -65.30% x 2.88

Best of all, now the speed boost is done in a stable version of the compiler. Since we only compile client binaries with a stable version, earlier only users who compiled the client from source could get a speed advantage. Therefore, the current improvement has affected many users. But hey, that's not all! The new compiled code is significantly better than the implementation in assembly language, even in the test of multiplying 256-bit numbers to get a 256-bit result. This is despite the fact that the Rust code still first produces a 512-bit result, and then discards the upper half, but the implementation in assembly language does not:

  name inline_asm.bench ns / iter u128.bench ns / iter diff ns / iter diff% speedup 
  u256_full_mul 197,396 73,416 -123,980 -62.81% x 2.69 
  u256_mul 95,843 85,797 -10,046 -10.48% x 1.12 
  u256_mul_small 789 558 -231 -29.28% x 1.41

For full multiplication, this is a very powerful improvement, especially since the original code used highly optimized assembler incantations. Now the faint of heart can come out, because I'm going to dive into the generated assembler.

Here is a hand-written assembly code. I presented it without comments, because I want to comment on the version that is actually created by the compiler (since, as you will see, the asm! macro asm! Hides more than you would expect):

 impl U256 { /// Multiplies two 256-bit integers to produce full 512-bit integer /// No overflow possible pub fn full_mul(self, other: U256) -> U512 { let self_t: &[u64; 4] = &self.0; let other_t: &[u64; 4] = &other.0; let mut result: [u64; 8] = unsafe { ::core::mem::uninitialized() }; unsafe { asm!(" mov $8, %rax mulq $12 mov %rax, $0 mov %rdx, $1 mov $8, %rax mulq $13 add %rax, $1 adc $$0, %rdx mov %rdx, $2 mov $8, %rax mulq $14 add %rax, $2 adc $$0, %rdx mov %rdx, $3 mov $8, %rax mulq $15 add %rax, $3 adc $$0, %rdx mov %rdx, $4 mov $9, %rax mulq $12 add %rax, $1 adc %rdx, $2 adc $$0, $3 adc $$0, $4 xor $5, $5 adc $$0, $5 xor $6, $6 adc $$0, $6 xor $7, $7 adc $$0, $7 mov $9, %rax mulq $13 add %rax, $2 adc %rdx, $3 adc $$0, $4 adc $$0, $5 adc $$0, $6 adc $$0, $7 mov $9, %rax mulq $14 add %rax, $3 adc %rdx, $4 adc $$0, $5 adc $$0, $6 adc $$0, $7 mov $9, %rax mulq $15 add %rax, $4 adc %rdx, $5 adc $$0, $6 adc $$0, $7 mov $10, %rax mulq $12 add %rax, $2 adc %rdx, $3 adc $$0, $4 adc $$0, $5 adc $$0, $6 adc $$0, $7 mov $10, %rax mulq $13 add %rax, $3 adc %rdx, $4 adc $$0, $5 adc $$0, $6 adc $$0, $7 mov $10, %rax mulq $14 add %rax, $4 adc %rdx, $5 adc $$0, $6 adc $$0, $7 mov $10, %rax mulq $15 add %rax, $5 adc %rdx, $6 adc $$0, $7 mov $11, %rax mulq $12 add %rax, $3 adc %rdx, $4 adc $$0, $5 adc $$0, $6 adc $$0, $7 mov $11, %rax mulq $13 add %rax, $4 adc %rdx, $5 adc $$0, $6 adc $$0, $7 mov $11, %rax mulq $14 add %rax, $5 adc %rdx, $6 adc $$0, $7 mov $11, %rax mulq $15 add %rax, $6 adc %rdx, $7 " : /* $0 */ "={r8}"(result[0]), /* $1 */ "={r9}"(result[1]), /* $2 */ "={r10}"(result[2]), /* $3 */ "={r11}"(result[3]), /* $4 */ "={r12}"(result[4]), /* $5 */ "={r13}"(result[5]), /* $6 */ "={r14}"(result[6]), /* $7 */ "={r15}"(result[7]) : /* $8 */ "m"(self_t[0]), /* $9 */ "m"(self_t[1]), /* $10 */ "m"(self_t[2]), /* $11 */ "m"(self_t[3]), /* $12 */ "m"(other_t[0]), /* $13 */ "m"(other_t[1]), /* $14 */ "m"(other_t[2]), /* $15 */ "m"(other_t[3]) : "rax", "rdx" : ); } U512(result) } }

And here is what it generates. I commented extensively on the code so that you can understand what is going on, even if you have never worked with an assembler in your life, but you need some basic concepts of low-level programming, such as the difference between memory and registers. If you need a textbook on the structure of the CPU, you can start with the Wikipedia article on the structure and implementation of processors :

 bigint::U256::full_mul: ///  -   Rust pushq %r15 pushq %r14 pushq %r13 pushq %r12 subq $0x40, %rsp ///     ... movq 0x68(%rsp), %rax movq 0x70(%rsp), %rcx movq 0x78(%rsp), %rdx movq 0x80(%rsp), %rsi movq 0x88(%rsp), %r8 movq 0x90(%rsp), %r9 movq 0x98(%rsp), %r10 movq 0xa0(%rsp), %r11 /// ...     ///    Rust.   ///  ,      ///       movq %rax, 0x38(%rsp) movq %rcx, 0x30(%rsp) movq %rdx, 0x28(%rsp) movq %rsi, 0x20(%rsp) ///   -  ,   ///     . movq %r8, 0x18(%rsp) movq %r9, 0x10(%rsp) movq %r10, 0x8(%rsp) movq %r11, (%rsp) ///       , ///         . /// /// for i in 0..U256_SIZE { /// for j in 0..U256_SIZE { /// /* Loop body */ /// } /// } ///       ///  "%rax".      ///  `%rax`,     . ///     `asm!`   , ///     . movq 0x38(%rsp), %rax ///      .  ///  ,    ,    . ///  ,   . mulq 0x18(%rsp) /// `mulq`  64-       /// 64    `%rax`  `%rdx`, .  ///    `%r8` ( 64  512- ), ///    `%r9` (  64  ). movq %rax, %r8 movq %rdx, %r9 ///      `i = 0, j = 1` movq 0x38(%rsp), %rax mulq 0x10(%rsp) ///       ,    ///     . addq %rax, %r9 ///    0,   CPU  " " /// (    ) ///   .      , ///   1  `rdx`,    . adcq $0x0, %rdx ///    64   (   ///  )    64  . movq %rdx, %r10 ///    `j = 2`  `j = 3` movq 0x38(%rsp), %rax mulq 0x8(%rsp) addq %rax, %r10 adcq $0x0, %rdx movq %rdx, %r11 movq 0x38(%rsp), %rax mulq (%rsp) addq %rax, %r11 adcq $0x0, %rdx movq %rdx, %r12 ///      `i = 1`, `i = 2`  `i = 3` movq 0x30(%rsp), %rax mulq 0x18(%rsp) addq %rax, %r9 adcq %rdx, %r10 adcq $0x0, %r11 adcq $0x0, %r12 ///  `xor`     `%r13`. ,  ///   (     ),  ///   . xorq %r13, %r13 adcq $0x0, %r13 xorq %r14, %r14 adcq $0x0, %r14 xorq %r15, %r15 adcq $0x0, %r15 movq 0x30(%rsp), %rax mulq 0x10(%rsp) addq %rax, %r10 adcq %rdx, %r11 adcq $0x0, %r12 adcq $0x0, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x30(%rsp), %rax mulq 0x8(%rsp) addq %rax, %r11 adcq %rdx, %r12 adcq $0x0, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x30(%rsp), %rax mulq (%rsp) addq %rax, %r12 adcq %rdx, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x28(%rsp), %rax mulq 0x18(%rsp) addq %rax, %r10 adcq %rdx, %r11 adcq $0x0, %r12 adcq $0x0, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x28(%rsp), %rax mulq 0x10(%rsp) addq %rax, %r11 adcq %rdx, %r12 adcq $0x0, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x28(%rsp), %rax mulq 0x8(%rsp) addq %rax, %r12 adcq %rdx, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x28(%rsp), %rax mulq (%rsp) addq %rax, %r13 adcq %rdx, %r14 adcq $0x0, %r15 movq 0x20(%rsp), %rax mulq 0x18(%rsp) addq %rax, %r11 adcq %rdx, %r12 adcq $0x0, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x20(%rsp), %rax mulq 0x10(%rsp) addq %rax, %r12 adcq %rdx, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x20(%rsp), %rax mulq 0x8(%rsp) addq %rax, %r13 adcq %rdx, %r14 adcq $0x0, %r15 movq 0x20(%rsp), %rax mulq (%rsp) addq %rax, %r14 adcq %rdx, %r15 ///      ,   ///     movq %r8, (%rdi) movq %r9, 0x8(%rdi) movq %r10, 0x10(%rdi) movq %r11, 0x18(%rdi) movq %r12, 0x20(%rdi) movq %r13, 0x28(%rdi) movq %r14, 0x30(%rdi) movq %r15, 0x38(%rdi) movq %rdi, %rax addq $0x40, %rsp popq %r12 popq %r13 popq %r14 popq %r15 retq

As can be seen from the comments, there are many flaws in the code. Multiplication is performed on variables from memory, not from registers, extra store and load operations are performed, and the CPU is also forced to produce a lot of store and load before receiving a “real” code (multiplication-addition cycle). This is very important: although the CPU does save and load in parallel with the calculations, but the code is written in such a way that the CPU does not start calculations until everything is loaded. This is because the asm macro hides a lot of details. Essentially, you tell the compiler to put the input where it wants it, and then substitute it into your assembler code with string manipulations. The compiler saves everything in registers, but then we command it to put the input arrays into memory ( "m" before the input parameters) so that it loads everything into memory again. There is a way to optimize this code, but it is very difficult even for an experienced professional. Such a code is error prone - if you did not reset the output registers with a series of xor instructions, the code will sometimes fail, but not always, producing seemingly random values that depend on the internal state of the calling function. It is probably possible to speed it up by replacing "m" with "r" (I did not test it, because I discovered this only when I started checking for an article why the old assembler code is so slower), but looking at the source is not obvious. Immediately understand only a specialist with a deep knowledge of the syntax of the assembler LLVM.

For comparison, the Rust code using u128 looks completely understandable: what is written is what you get. Even if you did not intend to optimize, you will probably write something similar as the simplest solution. At the same time, the LLVM code is very high quality. You may notice that it is not too different from the code written by hand, but it solves some problems (commented below), and also includes a couple of optimizations that I would not even think about. I could not find any significant optimizations that he missed.

Here is the generated assembler code:

 bigint::U256::full_mul: ///  pushq %rbp movq %rsp, %rbp pushq %r15 pushq %r14 pushq %r13 pushq %r12 pushq %rbx subq $0x48, %rsp movq 0x10(%rbp), %r11 movq 0x18(%rbp), %rsi movq %rsi, -0x38(%rbp) ///   ,    , ///       ( /// `movq 0x30(%rbp), %rax`),    `%rax`  ///  `mulq`.  ,     ///        ///      , ///       . movq 0x30(%rbp), %rcx movq %rcx, %rax /// LLVM   ,     mulq %r11 /// LLVM  `%rdx` ( )  ,  ///   .  `%rax` /// ( )   ,   ///       .  , ///       ,    ///   . movq %rdx, %r9 movq %rax, -0x70(%rbp) movq %rcx, %rax mulq %rsi movq %rax, %rbx movq %rdx, %r8 movq 0x20(%rbp), %rsi movq %rcx, %rax mulq %rsi /// LLVM   `%r13`  ,   ///   `%r13`    . movq %rsi, %r13 movq %r13, -0x40(%rbp) /// ,     ,    , ///   LLVM     . movq %rax, %r10 movq %rdx, %r14 movq 0x28(%rbp), %rdx movq %rdx, -0x48(%rbp) movq %rcx, %rax mulq %rdx movq %rax, %r12 movq %rdx, -0x58(%rbp) movq 0x38(%rbp), %r15 movq %r15, %rax mulq %r11 addq %r9, %rbx adcq %r8, %r10 ///        `%rcx` pushfq popq %rcx addq %rax, %rbx movq %rbx, -0x68(%rbp) adcq %rdx, %r10 ///       ///  `%r8`. pushfq popq %r8 /// LLVM     `%rcx`,   ///     .  . ///         , ///        ///      . /// /// ,  LLVM     /// ,     `%rcx` , ///         /// (    `popq %rcx`    /// `pushq %rcx`,    ).    /// ,   . pushq %rcx popfq adcq %r14, %r12 pushfq popq %rax movq %rax, -0x50(%rbp) movq %r15, %rax movq -0x38(%rbp), %rsi mulq %rsi movq %rdx, %rbx movq %rax, %r9 addq %r10, %r9 adcq $0x0, %rbx pushq %r8 popfq adcq $0x0, %rbx /// `setb`      ///     . `setb`  ///   1   ,   ///   (    `mov`,   ///      ) setb -0x29(%rbp) addq %r12, %rbx setb %r10b movq %r15, %rax mulq %r13 movq %rax, %r12 movq %rdx, %r8 movq 0x40(%rbp), %r14 movq %r14, %rax mulq %r11 movq %rdx, %r13 movq %rax, %rcx movq %r14, %rax mulq %rsi movq %rdx, %rsi addq %r9, %rcx movq %rcx, -0x60(%rbp) ///       `%r12`  `%rbx`   ///   `%rcx`.     , ///      . `leaq`  ///   ,       /// ,      `&((void*)first)[second]`  /// `first + second`  C.     .  ///   -  . leaq (%r12,%rbx), %rcx ///       ,  ///  . adcq %rcx, %r13 pushfq popq %rcx addq %rax, %r13 adcq $0x0, %rsi pushq %rcx popfq adcq $0x0, %rsi setb -0x2a(%rbp) orb -0x29(%rbp), %r10b addq %r12, %rbx movzbl %r10b, %ebx adcq %r8, %rbx setb %al movq -0x50(%rbp), %rcx pushq %rcx popfq adcq -0x58(%rbp), %rbx setb %r8b orb %al, %r8b movq %r15, %rax mulq -0x48(%rbp) movq %rdx, %r12 movq %rax, %rcx addq %rbx, %rcx movzbl %r8b, %eax adcq %rax, %r12 addq %rsi, %rcx setb %r10b movq %r14, %rax mulq -0x40(%rbp) movq %rax, %r8 movq %rdx, %rsi movq 0x48(%rbp), %r15 movq %r15, %rax mulq %r11 movq %rdx, %r9 movq %rax, %r11 movq %r15, %rax mulq -0x38(%rbp) movq %rdx, %rbx addq %r13, %r11 leaq (%r8,%rcx), %rdx adcq %rdx, %r9 pushfq popq %rdx addq %rax, %r9 adcq $0x0, %rbx pushq %rdx popfq adcq $0x0, %rbx setb %r13b orb -0x2a(%rbp), %r10b addq %r8, %rcx movzbl %r10b, %ecx adcq %rsi, %rcx setb %al addq %r12, %rcx setb %r8b orb %al, %r8b movq %r14, %rax movq -0x48(%rbp), %r14 mulq %r14 movq %rdx, %r10 movq %rax, %rsi addq %rcx, %rsi movzbl %r8b, %eax adcq %rax, %r10 addq %rbx, %rsi setb %cl orb %r13b, %cl movq %r15, %rax mulq -0x40(%rbp) movq %rdx, %rbx movq %rax, %r8 addq %rsi, %r8 movzbl %cl, %eax adcq %rax, %rbx setb %al addq %r10, %rbx setb %cl orb %al, %cl movq %r15, %rax mulq %r14 addq %rbx, %rax movzbl %cl, %ecx adcq %rcx, %rdx movq -0x70(%rbp), %rcx movq %rcx, (%rdi) movq -0x68(%rbp), %rcx movq %rcx, 0x8(%rdi) movq -0x60(%rbp), %rcx movq %rcx, 0x10(%rdi) movq %r11, 0x18(%rdi) movq %r9, 0x20(%rdi) movq %r8, 0x28(%rdi) movq %rax, 0x30(%rdi) movq %rdx, 0x38(%rdi) movq %rdi, %rax addq $0x48, %rsp popq %rbx popq %r12 popq %r13 popq %r14 popq %r15 popq %rbp retq

Although there are a few more instructions in the generated LLVM version, the number of the slowest directives (load and store) is kept to a minimum. For the most part, LLVM avoids redundant work, and also applies many bold optimizations. As a result, the code runs significantly faster.

This is not the first time that a carefully written implementation of Rust has surpassed our build code. A few months ago, I rewrote the implementation of addition and subtraction on Rust, speeding them up by 20% and 15%, respectively, compared to the implementation in assembly language. There, in order to surpass the assembly code, 128-bit arithmetic was not required (to use full hardware support on Rust, you only need to specify u64:: checked_add/ checked_sub ), although who knows, maybe in the future we will use 128-bit arithmetic and further increase the speed.

The code from this example can be viewed here , and the code for the implementation of addition / subtraction is here . I should note that, although the latter already shows in the benchmark the multiplication superiority over the implementation in assembler, but in fact this is due to the fact that in most cases the benchmark performed the multiplication by 0. Hm. If some lesson can be learned from this, so this informed optimization is impossible without representative benchmarks.

, LLVM . , - . , ( ), . — , , , ó — . Rust , , , . .

Source: https://habr.com/ru/post/358974/

All Articles

How update Rust 1.26 sped up my code more than three times

More articles: