u64
). This is despite the fact that x86_64 (our main target platform) natively supports 128-bit results of calculations with 64-bit numbers. So we divide each 64-bit number into two 32-bit ones (because you can multiply two 32-bit numbers and get a 64-bit result). impl U256 { fn full_mul(self, other: Self) -> U512 { let U256(ref me) = self; let U256(ref you) = other; let mut ret = [0u64; U512_SIZE]; for i in 0..U256_SIZE { let mut carry = 0u64; // `split` splits a 64-bit number into upper and lower halves let (b_u, b_l) = split(you[i]); for j in 0..U256_SIZE { // This process is so slow that it's faster to check for 0 and skip // it if possible. if me[j] != 0 || carry != 0 { let a = split(me[j]); // `mul_u32` multiplies a 64-bit number that's been split into // an `(upper, lower)` pair by a 32-bit number to get a 96-bit // result. Yes, 96-bit (it returns a `(u32, u64)` pair). let (c_l, overflow_l) = mul_u32(a, b_l, ret[i + j]); // Since we have to multiply by a 64-bit number, we have to do // this twice. let (c_u, overflow_u) = mul_u32(a, b_u, c_l >> 32); ret[i + j] = (c_l & 0xffffffff) + (c_u << 32); // Then we have to do this complex logic to set the result. Gross. let res = (c_u >> 32) + (overflow_u << 32); let (res, o1) = res.overflowing_add(overflow_l + carry); let (res, o2) = res.overflowing_add(ret[i + j + 1]); ret[i + j + 1] = res; carry = (o1 | o2) as u64; } } } U512(ret) } }
a * b
, if a
and b
are in u64
format, the processor actually multiplies them and gets a 128-bit result, and then Rust simply discards the top 64 bits. We want to leave these 64 upper bits, and the only way to efficiently access them is to use the built-in assembly code.name u64.bench ns / iter inline_asm.bench ns / iter diff ns / iter diff% speedup u256_full_mul 243,159 197,396 -45,763 -18.82% x 1.23 u256_mul 268,750 95,843 -172,907 -64.34% x 2.80 u256_mul_small 1,608 789 -819 -50.93% x 2.04
u256_full_mul
checks the function above, u256_mul
multiplies two 256-bit numbers to get a 256-bit result (in Rust you simply create a 512-bit result and discard the upper half, but in an assembler we have a separate implementation), and u256_mul_small
multiplies two small 256-bit numbers As you can see, our assembly code is 2.8 times faster. It is much, much better. Unfortunately, it only works in the nightly version of the compiler, and even in it only for the x86_64 platform. The truth is that it took a lot of effort and a lot of unsuccessful attempts to make the Rust code “at least” twice as fast as the assembler. There simply was no good way to pass the necessary data to the compiler.a as u128 * b as u128
- and the compiler will use the u64-to-u128 multiplication native to the x86_64 platform (even though you translate both numbers into u128
, it understands that they are "really" just u64
, but you just need the result u128
). This means that our code now looks like this: impl U256 { fn full_mul(self, other: Self) -> U512 { let U256(ref me) = self; let U256(ref you) = other; let mut ret = [0u64; U512_SIZE]; for i in 0..U256_SIZE { let mut carry = 0u64; let b = you[i]; for j in 0..U256_SIZE { let a = me[j]; // This compiles down to just use x86's native 128-bit arithmetic let (hi, low) = split_u128(a as u128 * b as u128); let overflow = { let existing_low = &mut ret[i + j]; let (low, o) = low.overflowing_add(*existing_low); *existing_low = low; o }; carry = { let existing_hi = &mut ret[i + j + 1]; let hi = hi + overflow as u64; let (hi, o0) = hi.overflowing_add(carry); let (hi, o1) = hi.overflowing_add(*existing_hi); *existing_hi = hi; (o0 | o1) as u64 } } } U512(ret) } }
i256
type native to LLVM, but the speed has increased very much. Here we compare with the original implementation of Rust:name u64.bench ns / iter u128.bench ns / iter diff ns / iter diff% speedup u256_full_mul 243.159 73.416 -169.743 -69.81% x 3.31 u256_mul 268,750 85,797 -182,953 -68.08% x 3.13 u256_mul_small 1,608 558 -1,050 -65.30% x 2.88
name inline_asm.bench ns / iter u128.bench ns / iter diff ns / iter diff% speedup u256_full_mul 197,396 73,416 -123,980 -62.81% x 2.69 u256_mul 95,843 85,797 -10,046 -10.48% x 1.12 u256_mul_small 789 558 -231 -29.28% x 1.41
asm!
macro asm!
Hides more than you would expect): impl U256 { /// Multiplies two 256-bit integers to produce full 512-bit integer /// No overflow possible pub fn full_mul(self, other: U256) -> U512 { let self_t: &[u64; 4] = &self.0; let other_t: &[u64; 4] = &other.0; let mut result: [u64; 8] = unsafe { ::core::mem::uninitialized() }; unsafe { asm!(" mov $8, %rax mulq $12 mov %rax, $0 mov %rdx, $1 mov $8, %rax mulq $13 add %rax, $1 adc $$0, %rdx mov %rdx, $2 mov $8, %rax mulq $14 add %rax, $2 adc $$0, %rdx mov %rdx, $3 mov $8, %rax mulq $15 add %rax, $3 adc $$0, %rdx mov %rdx, $4 mov $9, %rax mulq $12 add %rax, $1 adc %rdx, $2 adc $$0, $3 adc $$0, $4 xor $5, $5 adc $$0, $5 xor $6, $6 adc $$0, $6 xor $7, $7 adc $$0, $7 mov $9, %rax mulq $13 add %rax, $2 adc %rdx, $3 adc $$0, $4 adc $$0, $5 adc $$0, $6 adc $$0, $7 mov $9, %rax mulq $14 add %rax, $3 adc %rdx, $4 adc $$0, $5 adc $$0, $6 adc $$0, $7 mov $9, %rax mulq $15 add %rax, $4 adc %rdx, $5 adc $$0, $6 adc $$0, $7 mov $10, %rax mulq $12 add %rax, $2 adc %rdx, $3 adc $$0, $4 adc $$0, $5 adc $$0, $6 adc $$0, $7 mov $10, %rax mulq $13 add %rax, $3 adc %rdx, $4 adc $$0, $5 adc $$0, $6 adc $$0, $7 mov $10, %rax mulq $14 add %rax, $4 adc %rdx, $5 adc $$0, $6 adc $$0, $7 mov $10, %rax mulq $15 add %rax, $5 adc %rdx, $6 adc $$0, $7 mov $11, %rax mulq $12 add %rax, $3 adc %rdx, $4 adc $$0, $5 adc $$0, $6 adc $$0, $7 mov $11, %rax mulq $13 add %rax, $4 adc %rdx, $5 adc $$0, $6 adc $$0, $7 mov $11, %rax mulq $14 add %rax, $5 adc %rdx, $6 adc $$0, $7 mov $11, %rax mulq $15 add %rax, $6 adc %rdx, $7 " : /* $0 */ "={r8}"(result[0]), /* $1 */ "={r9}"(result[1]), /* $2 */ "={r10}"(result[2]), /* $3 */ "={r11}"(result[3]), /* $4 */ "={r12}"(result[4]), /* $5 */ "={r13}"(result[5]), /* $6 */ "={r14}"(result[6]), /* $7 */ "={r15}"(result[7]) : /* $8 */ "m"(self_t[0]), /* $9 */ "m"(self_t[1]), /* $10 */ "m"(self_t[2]), /* $11 */ "m"(self_t[3]), /* $12 */ "m"(other_t[0]), /* $13 */ "m"(other_t[1]), /* $14 */ "m"(other_t[2]), /* $15 */ "m"(other_t[3]) : "rax", "rdx" : ); } U512(result) } }
bigint::U256::full_mul: /// - Rust pushq %r15 pushq %r14 pushq %r13 pushq %r12 subq $0x40, %rsp /// ... movq 0x68(%rsp), %rax movq 0x70(%rsp), %rcx movq 0x78(%rsp), %rdx movq 0x80(%rsp), %rsi movq 0x88(%rsp), %r8 movq 0x90(%rsp), %r9 movq 0x98(%rsp), %r10 movq 0xa0(%rsp), %r11 /// ... /// Rust. /// , /// movq %rax, 0x38(%rsp) movq %rcx, 0x30(%rsp) movq %rdx, 0x28(%rsp) movq %rsi, 0x20(%rsp) /// - , /// . movq %r8, 0x18(%rsp) movq %r9, 0x10(%rsp) movq %r10, 0x8(%rsp) movq %r11, (%rsp) /// , /// . /// /// for i in 0..U256_SIZE { /// for j in 0..U256_SIZE { /// /* Loop body */ /// } /// } /// /// "%rax". /// `%rax`, . /// `asm!` , /// . movq 0x38(%rsp), %rax /// . /// , , . /// , . mulq 0x18(%rsp) /// `mulq` 64- /// 64 `%rax` `%rdx`, . /// `%r8` ( 64 512- ), /// `%r9` ( 64 ). movq %rax, %r8 movq %rdx, %r9 /// `i = 0, j = 1` movq 0x38(%rsp), %rax mulq 0x10(%rsp) /// , /// . addq %rax, %r9 /// 0, CPU " " /// ( ) /// . , /// 1 `rdx`, . adcq $0x0, %rdx /// 64 ( /// ) 64 . movq %rdx, %r10 /// `j = 2` `j = 3` movq 0x38(%rsp), %rax mulq 0x8(%rsp) addq %rax, %r10 adcq $0x0, %rdx movq %rdx, %r11 movq 0x38(%rsp), %rax mulq (%rsp) addq %rax, %r11 adcq $0x0, %rdx movq %rdx, %r12 /// `i = 1`, `i = 2` `i = 3` movq 0x30(%rsp), %rax mulq 0x18(%rsp) addq %rax, %r9 adcq %rdx, %r10 adcq $0x0, %r11 adcq $0x0, %r12 /// `xor` `%r13`. , /// ( ), /// . xorq %r13, %r13 adcq $0x0, %r13 xorq %r14, %r14 adcq $0x0, %r14 xorq %r15, %r15 adcq $0x0, %r15 movq 0x30(%rsp), %rax mulq 0x10(%rsp) addq %rax, %r10 adcq %rdx, %r11 adcq $0x0, %r12 adcq $0x0, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x30(%rsp), %rax mulq 0x8(%rsp) addq %rax, %r11 adcq %rdx, %r12 adcq $0x0, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x30(%rsp), %rax mulq (%rsp) addq %rax, %r12 adcq %rdx, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x28(%rsp), %rax mulq 0x18(%rsp) addq %rax, %r10 adcq %rdx, %r11 adcq $0x0, %r12 adcq $0x0, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x28(%rsp), %rax mulq 0x10(%rsp) addq %rax, %r11 adcq %rdx, %r12 adcq $0x0, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x28(%rsp), %rax mulq 0x8(%rsp) addq %rax, %r12 adcq %rdx, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x28(%rsp), %rax mulq (%rsp) addq %rax, %r13 adcq %rdx, %r14 adcq $0x0, %r15 movq 0x20(%rsp), %rax mulq 0x18(%rsp) addq %rax, %r11 adcq %rdx, %r12 adcq $0x0, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x20(%rsp), %rax mulq 0x10(%rsp) addq %rax, %r12 adcq %rdx, %r13 adcq $0x0, %r14 adcq $0x0, %r15 movq 0x20(%rsp), %rax mulq 0x8(%rsp) addq %rax, %r13 adcq %rdx, %r14 adcq $0x0, %r15 movq 0x20(%rsp), %rax mulq (%rsp) addq %rax, %r14 adcq %rdx, %r15 /// , /// movq %r8, (%rdi) movq %r9, 0x8(%rdi) movq %r10, 0x10(%rdi) movq %r11, 0x18(%rdi) movq %r12, 0x20(%rdi) movq %r13, 0x28(%rdi) movq %r14, 0x30(%rdi) movq %r15, 0x38(%rdi) movq %rdi, %rax addq $0x40, %rsp popq %r12 popq %r13 popq %r14 popq %r15 retq
asm
macro hides a lot of details. Essentially, you tell the compiler to put the input where it wants it, and then substitute it into your assembler code with string manipulations. The compiler saves everything in registers, but then we command it to put the input arrays into memory ( "m"
before the input parameters) so that it loads everything into memory again. There is a way to optimize this code, but it is very difficult even for an experienced professional. Such a code is error prone - if you did not reset the output registers with a series of xor
instructions, the code will sometimes fail, but not always, producing seemingly random values that depend on the internal state of the calling function. It is probably possible to speed it up by replacing "m"
with "r"
(I did not test it, because I discovered this only when I started checking for an article why the old assembler code is so slower), but looking at the source is not obvious. Immediately understand only a specialist with a deep knowledge of the syntax of the assembler LLVM.u128
looks completely understandable: what is written is what you get. Even if you did not intend to optimize, you will probably write something similar as the simplest solution. At the same time, the LLVM code is very high quality. You may notice that it is not too different from the code written by hand, but it solves some problems (commented below), and also includes a couple of optimizations that I would not even think about. I could not find any significant optimizations that he missed. bigint::U256::full_mul: /// pushq %rbp movq %rsp, %rbp pushq %r15 pushq %r14 pushq %r13 pushq %r12 pushq %rbx subq $0x48, %rsp movq 0x10(%rbp), %r11 movq 0x18(%rbp), %rsi movq %rsi, -0x38(%rbp) /// , , /// ( /// `movq 0x30(%rbp), %rax`), `%rax` /// `mulq`. , /// /// , /// . movq 0x30(%rbp), %rcx movq %rcx, %rax /// LLVM , mulq %r11 /// LLVM `%rdx` ( ) , /// . `%rax` /// ( ) , /// . , /// , /// . movq %rdx, %r9 movq %rax, -0x70(%rbp) movq %rcx, %rax mulq %rsi movq %rax, %rbx movq %rdx, %r8 movq 0x20(%rbp), %rsi movq %rcx, %rax mulq %rsi /// LLVM `%r13` , /// `%r13` . movq %rsi, %r13 movq %r13, -0x40(%rbp) /// , , , /// LLVM . movq %rax, %r10 movq %rdx, %r14 movq 0x28(%rbp), %rdx movq %rdx, -0x48(%rbp) movq %rcx, %rax mulq %rdx movq %rax, %r12 movq %rdx, -0x58(%rbp) movq 0x38(%rbp), %r15 movq %r15, %rax mulq %r11 addq %r9, %rbx adcq %r8, %r10 /// `%rcx` pushfq popq %rcx addq %rax, %rbx movq %rbx, -0x68(%rbp) adcq %rdx, %r10 /// /// `%r8`. pushfq popq %r8 /// LLVM `%rcx`, /// . . /// , /// /// . /// /// , LLVM /// , `%rcx` , /// /// ( `popq %rcx` /// `pushq %rcx`, ). /// , . pushq %rcx popfq adcq %r14, %r12 pushfq popq %rax movq %rax, -0x50(%rbp) movq %r15, %rax movq -0x38(%rbp), %rsi mulq %rsi movq %rdx, %rbx movq %rax, %r9 addq %r10, %r9 adcq $0x0, %rbx pushq %r8 popfq adcq $0x0, %rbx /// `setb` /// . `setb` /// 1 , /// ( `mov`, /// ) setb -0x29(%rbp) addq %r12, %rbx setb %r10b movq %r15, %rax mulq %r13 movq %rax, %r12 movq %rdx, %r8 movq 0x40(%rbp), %r14 movq %r14, %rax mulq %r11 movq %rdx, %r13 movq %rax, %rcx movq %r14, %rax mulq %rsi movq %rdx, %rsi addq %r9, %rcx movq %rcx, -0x60(%rbp) /// `%r12` `%rbx` /// `%rcx`. , /// . `leaq` /// , /// , `&((void*)first)[second]` /// `first + second` C. . /// - . leaq (%r12,%rbx), %rcx /// , /// . adcq %rcx, %r13 pushfq popq %rcx addq %rax, %r13 adcq $0x0, %rsi pushq %rcx popfq adcq $0x0, %rsi setb -0x2a(%rbp) orb -0x29(%rbp), %r10b addq %r12, %rbx movzbl %r10b, %ebx adcq %r8, %rbx setb %al movq -0x50(%rbp), %rcx pushq %rcx popfq adcq -0x58(%rbp), %rbx setb %r8b orb %al, %r8b movq %r15, %rax mulq -0x48(%rbp) movq %rdx, %r12 movq %rax, %rcx addq %rbx, %rcx movzbl %r8b, %eax adcq %rax, %r12 addq %rsi, %rcx setb %r10b movq %r14, %rax mulq -0x40(%rbp) movq %rax, %r8 movq %rdx, %rsi movq 0x48(%rbp), %r15 movq %r15, %rax mulq %r11 movq %rdx, %r9 movq %rax, %r11 movq %r15, %rax mulq -0x38(%rbp) movq %rdx, %rbx addq %r13, %r11 leaq (%r8,%rcx), %rdx adcq %rdx, %r9 pushfq popq %rdx addq %rax, %r9 adcq $0x0, %rbx pushq %rdx popfq adcq $0x0, %rbx setb %r13b orb -0x2a(%rbp), %r10b addq %r8, %rcx movzbl %r10b, %ecx adcq %rsi, %rcx setb %al addq %r12, %rcx setb %r8b orb %al, %r8b movq %r14, %rax movq -0x48(%rbp), %r14 mulq %r14 movq %rdx, %r10 movq %rax, %rsi addq %rcx, %rsi movzbl %r8b, %eax adcq %rax, %r10 addq %rbx, %rsi setb %cl orb %r13b, %cl movq %r15, %rax mulq -0x40(%rbp) movq %rdx, %rbx movq %rax, %r8 addq %rsi, %r8 movzbl %cl, %eax adcq %rax, %rbx setb %al addq %r10, %rbx setb %cl orb %al, %cl movq %r15, %rax mulq %r14 addq %rbx, %rax movzbl %cl, %ecx adcq %rcx, %rdx movq -0x70(%rbp), %rcx movq %rcx, (%rdi) movq -0x68(%rbp), %rcx movq %rcx, 0x8(%rdi) movq -0x60(%rbp), %rcx movq %rcx, 0x10(%rdi) movq %r11, 0x18(%rdi) movq %r9, 0x20(%rdi) movq %r8, 0x28(%rdi) movq %rax, 0x30(%rdi) movq %rdx, 0x38(%rdi) movq %rdi, %rax addq $0x48, %rsp popq %rbx popq %r12 popq %r13 popq %r14 popq %r15 popq %rbp retq
u64:: checked_add/ checked_sub
), although who knows, maybe in the future we will use 128-bit arithmetic and further increase the speed.Source: https://habr.com/ru/post/358974/
All Articles