Open
Description
This is an enhancement request. While u16, u32 and u64 numbers get divided by a small constant divisor using a fast algorithm, the same isn't true for u128 numbers:
fn digits_sum0(mut n: u16) -> u16 {
let mut total = 0;
while n != 0 {
total += n % 10;
n /= 10;
}
total
}
fn digits_sum1(mut n: u32) -> u32 {
let mut total = 0;
while n != 0 {
total += n % 10;
n /= 10;
}
total
}
fn digits_sum2(mut n: u64) -> u64 {
let mut total = 0;
while n != 0 {
total += n % 10;
n /= 10;
}
total
}
fn digits_sum3(mut n: u128) -> u128 {
let mut total = 0;
while n != 0 {
total += n % 10;
n /= 10;
}
total
}
Generate asm (with -O):
digits_sum0:
xor eax, eax
test di, di
je .LBB0_2
.LBB0_1:
movzx ecx, di
imul edx, ecx, 52429
shr edx, 19
lea esi, [rdx + rdx]
lea esi, [rsi + 4*rsi]
sub edi, esi
add eax, edi
mov edi, edx
cmp ecx, 10
jae .LBB0_1
.LBB0_2:
ret
digits_sum1:
xor eax, eax
test edi, edi
je .LBB0_3
mov r8d, 3435973837
.LBB0_2:
mov edx, edi
imul rdx, r8
shr rdx, 35
lea esi, [rdx + rdx]
lea esi, [rsi + 4*rsi]
mov ecx, edi
sub ecx, esi
add eax, ecx
cmp edi, 10
mov edi, edx
jae .LBB0_2
.LBB0_3:
ret
digits_sum2:
xor ecx, ecx
test rdi, rdi
je .LBB1_3
movabs r8, -3689348814741910323
.LBB1_2:
mov rax, rdi
mul r8
shr rdx, 3
lea rax, [rdx + rdx]
lea rax, [rax + 4*rax]
mov rsi, rdi
sub rsi, rax
add rcx, rsi
cmp rdi, 10
mov rdi, rdx
jae .LBB1_2
.LBB1_3:
mov rax, rcx
ret
digits_sum3:
push r15
push r14
push r13
push r12
push rbx
mov rax, rdi
or rax, rsi
je .LBB2_1
mov rbx, rsi
mov r15, rdi
xor r14d, r14d
mov r13d, 10
xor r12d, r12d
.LBB2_4:
mov edx, 10
xor ecx, ecx
mov rdi, r15
mov rsi, rbx
call __udivti3@PLT
mov rcx, rax
mov rsi, rdx
mul r13
lea rdi, [rsi + 4*rsi]
lea rdx, [rdx + 2*rdi]
mov rdi, r15
sub rdi, rax
mov rax, rbx
sbb rax, rdx
add r14, rdi
adc r12, rax
cmp r15, 10
sbb rbx, 0
mov r15, rcx
mov rbx, rsi
jae .LBB2_4
jmp .LBB2_2
.LBB2_1:
xor r14d, r14d
xor r12d, r12d
.LBB2_2:
mov rax, r14
mov rdx, r12
pop rbx
pop r12
pop r13
pop r14
pop r15
ret
The faster algorithm is short enough and it could be added to rustc:
http://ridiculousfish.com/blog/posts/labor-of-division-episode-i.html
http://ridiculousfish.com/blog/posts/labor-of-division-episode-ii.html
http://libdivide.com/
http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
nagisa commentedon Oct 6, 2018
No point in doing it in rustc, if it should be done in LLVM.
leonardo-m commentedon Oct 8, 2018
See also: https://crates.io/crates/specialized-div-rem
scottmcm commentedon Oct 11, 2018
Do you have benchmarks that the multiply algorithm is actually faster for
u128
?Note that the 32-bit one does
which is using a 64-bit multiply (two-argument form, with
r
registers).So it's possible that the
u128
version would need a 256-bit multiply, and it isn't obvious to me whether that, and the corresponding fixups, would be faster overall than__udivti3
.leonardo-m commentedon Oct 11, 2018
I have no proof, but this is worth exploring. The specialized-div-rem shows that we could go much faster than the current div and rem operations.
nagisa commentedon Oct 11, 2018
@scottmcm I’m fairly sure that it would need at most 3 multiplications and a few additions to calculate the value necessary for shifting.
AaronKutch commentedon Sep 13, 2020
The fact that LLVM cannot reduce the division to multiplications by a constant might be caused by #44545 or is an independent issue. I extracted the small divisor path from my algorithms, and the compiler was able to reduce the divisions to multiplications by a constant.
I have checked for correctness with my fuzz tester.
nagisa commentedon Sep 13, 2020
I investigated this recently for #76017 (comment).
It is a bug in LLVM and it isn’t like it is impossible for it to strength-reduce, its just that doing so:
a) requires the upper-half of the multiplication result (i.e. for 128-bit multiplication it requires the upper 128-bits of a 256-bit result); and
b) calculating that cannot be easily made less conservative because there are a couple of bad backends in LLVM – namely RISCV and 32-bit ARM.
I think I know an alternative way to resolve it, though, just haven’t had the time to get back to it.
Turns out it isn’t on 32-bit targets!
nagisa commentedon Sep 19, 2020
I’ve drafted a LLVM differential to fix this https://reviews.llvm.org/D87976.
leonardo-m commentedon Mar 8, 2021
In the meantime GCC had implemented it:
Rustc 1.52.0-nightly (152f660 2021-02-17):
GCC thunk 11.0.1 20210307 (experimental):
nagisa commentedon Mar 8, 2021
Yes, the primary couple of concerns I've seen blocking the LLVM diff is:
and the regression on codegen quality in certain corner cases.
The former can probably be resolved by some sort of a target property. The latter is harder, I suspect.
est31 commentedon Apr 8, 2023
Btw, clang is also able to optimize this for C (godbolt):
gives:
Still an issue for Rust though.
cc also #103126 where the size reductions of libcore thanks to that PR are probably due to this issue (see also this zulip discussion).
nikic commentedon Apr 8, 2023
[citation needed]
This should be fixed with the LLVM 16 update in nightly, and as far as I can tell it is: https://rust.godbolt.org/z/E85djEnTY
4 remaining items