Closed
Description
Feature gate: #![feature(isqrt)]
This is a tracking issue for the functions {u8,u16,u32,u64,i128,usize}::isqrt
and {i8,i16,i32,i64,i128,isize}::{isqrt,checked_isqrt}
, which compute the integer square root, addressing issue #89273.
Public API
For every suffix N
among 8
, 16
, 32
, 64
, 128
and size
, the feature isqrt
introduces the methods
const fn uN::isqrt() -> uN;
const fn iN::isqrt() -> iN;
const fn iN::checked_isqrt() -> Option<iN>;
Expand to see the full API
const fn u8::isqrt() -> u8;
const fn i8::isqrt() -> i8;
const fn i8::checked_isqrt() -> Option<i8>;
const fn u16::isqrt() -> u16;
const fn i16::isqrt() -> i16;
const fn i16::checked_isqrt() -> Option<i16>;
const fn u32::isqrt() -> u32;
const fn i32::isqrt() -> i32;
const fn i32::checked_isqrt() -> Option<i32>;
const fn u64::isqrt() -> u64;
const fn i64::isqrt() -> i64;
const fn i64::checked_isqrt() -> Option<i64>;
const fn u128::isqrt() -> u128;
const fn i128::isqrt() -> i128;
const fn i128::checked_isqrt() -> Option<i128>;
const fn usize::isqrt() -> usize;
const fn isize::isqrt() -> isize;
const fn isize::checked_isqrt() -> Option<isize>;
Steps / History
- Final comment period (FCP)1
- Stabilization PR
Unresolved Questions
- None yet.
Activity
ChaiTRex commentedon Sep 29, 2023
Once #116176 is merged, I would like to submit improved tests and compiler optimization hints for signed integers with tighter upper bounds than with the corresponding unsigned integers.
Separately, I'm going to try to use
f32
andf64
to speed upisqrt
, as I think that, up to 64-bit integers, the square root function on those might be faster and any errors in the outputs can be adjusted for. This would have the downsides of making the functions non-const
and multiplying the written implementations, as doing this effectively would require different code for different bit sizes (for example,u32
could usef64
's 53-bit mantissa with no errors, whileu64
would need to usef64
with an error adjustment).ChaiTRex commentedon Sep 30, 2023
In a benchmarking repository, I've implemented some methods based on floating-point instructions. I get the following speed improvements with AMD Ryzen 9 5900X:
Speed of original and floating-point methods
Edit: Rebenchmarked because I used the Karatsuba square root algorithm for 128-bit integers to get a major speedup.
cfg(target_feature)
support for hardware floating point detection. #64514ryanavella commentedon Jan 17, 2024
@ChaiTRex I noticed in your benchmarking repository that you've commented out (what I presume was previously) a lookup table implementation for
u8
andi8
, how did that perform?I benchmarked a few naive lookup tables, and at least on my machine (ThinkPad T490, i5-8365U) the results look comparable, arguably even competitive.
Benchmarks
Here are my unsigned implementations for what I'm currently calling lut16, lut32, lut64, and lut256, which offer different tradeoffs between table size and branchiness:
(note: for the signed benchmarks I delegated to the unsigned implementation, on the assumption that we don't want to bloat up binaries with redundant lookup tables)
lut16
lut32
lut64
lut256
ChaiTRex commentedon Jan 25, 2024
@ryanavella
I didn't include either the
table
orlibgmp
versions because of copyright concerns.table
usesfred_sqrt
from part 3c of Paul Hsieh's Square Root page (note that the 256u8
table there contains fixed-point square roots where the integer part is the most significant four bits). This table is then used foru32
'sisqrt
withfred_sqrt
edited by me to use its repetitiveness to significantly reduce the number of branches.Benchmarks, including commented out implementations
It should be noted that
libgmp
was somewhat faster thantable
last time I checked, but now they're essentially equivalent.Since the concern is reducing the amount of memory used, one other concern is code size:
lut16
lut32
lut64
lut256
Since each value in your tables takes up four bits, two of them can fit in a
u8
, so maybe that can help as well.It should be noted that
lut256
can probably be inlined into one memory access with a small computation of the address.tfpf commentedon Feb 4, 2024
Is there an estimate on when this might become available in stable?
ryanavella commentedon Feb 4, 2024
I'm not sure if these are blocking stabilization necessarily, but these are some outstanding questions that I'd like to see answered:
x.checked_sqrt().unwrap()
)cfg(target_feature)
support for hardware floating point detection. #64514)Also, we could always use more testers and more benchmarks.
tfpf commentedon Feb 16, 2024
My two cents, as a relatively new user (started using Rust just a year ago), who is absolutely not an expert in compiler/library design.
x.ilog2()
, which has been stabilised, panics ifx <= 0
. (I assume that's not the same as 'having a panicking API'?) Whatever is true of it should also be true ofx.isqrt()
whenx < 0
.ryanavella commentedon Feb 16, 2024
Maybe I'm reading the musl sources wrong, but I don't see an implementation for integer square root? I did find this lookup table used for (floating-point)
sqrt
. Keep in mind that this implementation is probably only relevant to platforms without hardware floating point.tfpf commentedon Feb 16, 2024
Nice find; I didn’t check thoroughly. Yes, they don’t have an integer square root function, but I was trying to see whether they used any lookup tables at all, since musl is a relatively new library (compared to how old C is) comparable in age to Rust. (Probably not fair to compare a library and a language, though.)
ChaiTRex commentedon Feb 18, 2024
@ryanavella With regard to your questions:
I agree with what @tfpf said about maintaining consistency with the behavior of
ilog
functions.I think that, for a first release, it's not necessary to have floating point support, as a decently fast, non-FP (because some targets have no FP at all or slow FP square root) method for
u8
and the Karatsuba method for everything else will be pretty fast.As far as a proof, a rough one is available in the 34th comment of this linked discussion.
If it matters, I also looked into the rounding mode used and found that Rust uses and requires the default rounding mode and that the default rounding mode for IEEE-754 is "round-to-nearest-ties-to-even".
It would be nice if the
s
andz
optimization settings were available as acfg
flag so that, just by the developer choosing in the standard way whether to optimize for speed or size, theisqrt
implementations would then be able to use that to choose between the smallest code+table size or the fastest available implementation.That and FP-detection may make thoroughly testing it difficult, however.
I'd caution that licensing issues should be considered before using or starting from and modifying musl's code so that the two licenses for the Rust standard library (including who's given credit for the code) are all that's needed to use the code.
ChaiTRex commentedon Feb 18, 2024
In my last comment in this thread, I pointed out a discussion I had started on internals.rust-lang.org.
I also started a discussion on the Rust Zulip, where I was informed that
core::intrinsics::const_eval_select
can be used to take two implementations ofisqrt
, oneconst fn
and a faster one that can't beconst fn
(like here with the floating point operations) and meld them into a combinedconst fn
that uses the fast version except in aconst
context.The
const fn
version could use a large table to speed itself up without affecting the executable, as that version is only used at compile time.49 remaining items