Tracking Issue for `{u8,i8,...}::isqrt` #116226

New issue

Closed

#131391

Closed

Tracking Issue for {u8,i8,...}::isqrt#116226

#131391

Labels

C-tracking-issueT-libs-apidisposition-mergefinished-final-comment-period

FedericoStra

opened

on Sep 28, 2023

· edited by compiler-errors

Contributor

Feature gate: #![feature(isqrt)]

This is a tracking issue for the functions {u8,u16,u32,u64,i128,usize}::isqrt and {i8,i16,i32,i64,i128,isize}::{isqrt,checked_isqrt}, which compute the integer square root, addressing issue #89273.

Public API

For every suffix N among 8, 16, 32, 64, 128 and size, the feature isqrt introduces the methods

const fn uN::isqrt() -> uN;
const fn iN::isqrt() -> iN;
const fn iN::checked_isqrt() -> Option<iN>;

Expand to see the full API

const fn u8::isqrt() -> u8;
const fn i8::isqrt() -> i8;
const fn i8::checked_isqrt() -> Option<i8>;


const fn u16::isqrt() -> u16;
const fn i16::isqrt() -> i16;
const fn i16::checked_isqrt() -> Option<i16>;


const fn u32::isqrt() -> u32;
const fn i32::isqrt() -> i32;
const fn i32::checked_isqrt() -> Option<i32>;


const fn u64::isqrt() -> u64;
const fn i64::isqrt() -> i64;
const fn i64::checked_isqrt() -> Option<i64>;


const fn u128::isqrt() -> u128;
const fn i128::isqrt() -> i128;
const fn i128::checked_isqrt() -> Option<i128>;


const fn usize::isqrt() -> usize;
const fn isize::isqrt() -> isize;
const fn isize::checked_isqrt() -> Option<isize>;

Steps / History

- Forum discussion history:
  https://users.rust-lang.org/t/u32-integer-square-root/86909/2?u=scottmcm
  https://users.rust-lang.org/t/integer-square-root-algorithm/13529/5?u=scottmcm
  To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
- Implementation: Add "integer square root" method to integer primitive types #116176
- Final comment period (FCP)¹
- Stabilization PR
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.

Unresolved Questions

None yet.

https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html ↩

added

mentioned this

Add "integer square root" method to integer primitive types #116176

ChaiTRex

Contributor

Once #116176 is merged, I would like to submit improved tests and compiler optimization hints for signed integers with tighter upper bounds than with the corresponding unsigned integers.

Separately, I'm going to try to use f32 and f64 to speed up isqrt, as I think that, up to 64-bit integers, the square root function on those might be faster and any errors in the outputs can be adjusted for. This would have the downsides of making the functions non-const and multiplying the written implementations, as doing this effectively would require different code for different bit sizes (for example, u32 could use f64's 53-bit mantissa with no errors, while u64 would need to use f64 with an error adjustment).

ChaiTRex

Contributor

In a benchmarking repository, I've implemented some methods based on floating-point instructions. I get the following speed improvements with AMD Ryzen 9 5900X:

Speed of original and floating-point methods

original_i8             time:   [10.921 ns 10.963 ns 11.010 ns]
floating_i8             time:   [9.7668 ns 9.8591 ns 9.9453 ns]

original_u8             time:   [9.5130 ns 9.5649 ns 9.6401 ns]
floating_u8             time:   [5.3333 ns 5.3488 ns 5.3632 ns]

original_i16            time:   [12.231 ns 12.257 ns 12.289 ns]
floating_i16            time:   [8.5368 ns 8.5592 ns 8.5828 ns]

original_u16            time:   [12.858 ns 12.875 ns 12.896 ns]
floating_u16            time:   [5.3525 ns 5.3536 ns 5.3549 ns]

original_i32            time:   [11.941 ns 11.945 ns 11.950 ns]
floating_i32            time:   [7.5067 ns 7.5106 ns 7.5152 ns]

original_u32            time:   [15.381 ns 15.420 ns 15.478 ns]
floating_u32            time:   [3.8586 ns 3.9171 ns 3.9841 ns]

original_i64            time:   [19.061 ns 19.064 ns 19.068 ns]
floating_i64            time:   [7.2079 ns 7.2427 ns 7.2740 ns]

original_u64            time:   [28.456 ns 28.502 ns 28.549 ns]
floating_u64            time:   [5.8539 ns 5.8567 ns 5.8601 ns]

original_i128           time:   [130.68 ns 130.93 ns 131.15 ns]
floating_i128           time:   [14.931 ns 14.996 ns 15.134 ns]

original_u128           time:   [248.76 ns 248.81 ns 248.86 ns]
floating_u128           time:   [17.346 ns 17.380 ns 17.410 ns]

Edit: Rebenchmarked because I used the Karatsuba square root algorithm for 128-bit integers to get a major speedup.

ChaiTRex

mentioned this

on Oct 3, 2023

Needed: cfg(target_feature) support for hardware floating point detection. #64514

ryanavella

@ChaiTRex I noticed in your benchmarking repository that you've commented out (what I presume was previously) a lookup table implementation for u8 and i8, how did that perform?

I benchmarked a few naive lookup tables, and at least on my machine (ThinkPad T490, i5-8365U) the results look comparable, arguably even competitive.

Benchmarks

floating+karatsuba_i8 time: [10.002 ns 10.008 ns 10.015 ns]
floating+karatsuba_u8 time: [4.8382 ns 4.8987 ns 4.9682 ns]
floating_i8           time: [11.109 ns 11.162 ns 11.207 ns]
floating_u8           time: [6.2175 ns 6.2430 ns 6.2621 ns]
karatsuba_i8         time: [9.8523 ns 9.9603 ns 10.052 ns]
karatsuba_u8         time: [5.0738 ns 5.1621 ns 5.2458 ns]
original_i8          time: [13.251 ns 13.406 ns 13.539 ns]
original_u8          time: [10.680 ns 10.696 ns 10.709 ns]

lut16_i8             time: [11.032 ns 11.033 ns 11.034 ns]
lut16_u8             time: [7.9575 ns 7.9678 ns 7.9778 ns]
lut32_i8             time: [11.323 ns 11.328 ns 11.334 ns]
lut32_u8             time: [6.7746 ns 6.8290 ns 6.8980 ns]
lut64_i8             time: [10.113 ns 10.150 ns 10.214 ns]
lut64_u8             time: [5.8406 ns 5.8711 ns 5.9037 ns]
lut256_i8            time: [10.007 ns 10.026 ns 10.051 ns]
lut256_u8            time: [4.9602 ns 4.9606 ns 4.9611 ns]

Here are my unsigned implementations for what I'm currently calling lut16, lut32, lut64, and lut256, which offer different tradeoffs between table size and branchiness:

(note: for the signed benchmarks I delegated to the unsigned implementation, on the assumption that we don't want to bloat up binaries with redundant lookup tables)

lut16

impl UnsignedIsqrt for u8 {
    fn isqrt(self) -> Self {
        const LUT: [u8; 16] = [
            3, 5, 6, 7,
            8, 9, 10, 11,
            11, 12, 13, 13,
            14, 14, 15, 15,
        ];
        let mut est = LUT[(self >> 4) as usize];
        let mut square = est * est;
        if square > self {
            square -= 2*est - 1;
            est -= 1;
            if square > self {
                square -= 2*est - 1;
                est -= 1;
                if square > self {
                    // square -= 2*est - 1;
                    est -= 1;
                }
            }
        }
        est
    }
}

lut32

impl UnsignedIsqrt for u8 {
    fn isqrt(self) -> Self {
        const LUT: [u8; 32] = [
            2, 3, 4, 5, 6, 6, 7, 7,
            8, 8, 9, 9, 10, 10, 10, 11,
            11, 11, 12, 12, 12, 13, 13, 13,
            14, 14, 14, 14, 15, 15, 15, 15,
        ];
        let mut est = LUT[(self >> 3) as usize];
        let mut square = est * est;
        if square > self {
            square -= 2*est - 1;
            est -= 1;
            if square > self {
                // square -= 2*est - 1;
                est -= 1;
            }
        }
        est
    }
}

lut64

impl UnsignedIsqrt for u8 {
    fn isqrt(self) -> Self {
        const LUT: [u8; 64] = [
            1, 2, 3, 3, 4, 4, 5, 5,
            5, 6, 6, 6, 7, 7, 7, 7,
            8, 8, 8, 8, 9, 9, 9, 9,
            9, 10, 10, 10, 10, 10, 11, 11,
            11, 11, 11, 11, 12, 12, 12, 12,
            12, 12, 13, 13, 13, 13, 13, 13,
            13, 14, 14, 14, 14, 14, 14, 14,
            15, 15, 15, 15, 15, 15, 15, 15,
        ];
        let est = LUT[(self >> 2) as usize];
        if est * est > self {
            est - 1
        } else {
            est
        }
    }
}

lut256

impl UnsignedIsqrt for u8 {
    fn isqrt(self) -> Self {
        const LUT: [u8; 256] = [
            0, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3,
            4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5,
            5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
            6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
            8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
            8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
            9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 
            10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 
            11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 
            12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 
            12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 
            13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
            13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 
            14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 
            14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 
            15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 
        ];
        LUT[self as usize]
    }
}

ChaiTRex

Contributor

@ryanavella

@ChaiTRex I noticed in your benchmarking repository that you've commented out (what I presume was previously) a lookup table implementation for u8 and i8, how did that perform?

I didn't include either the table or libgmp versions because of copyright concerns.

table uses fred_sqrt from part 3c of Paul Hsieh's Square Root page (note that the 256 u8 table there contains fixed-point square roots where the integer part is the most significant four bits). This table is then used for u32's isqrt with fred_sqrt edited by me to use its repetitiveness to significantly reduce the number of branches.

Benchmarks, including commented out implementations

original_i8             time:   [8.8528 ns 8.8812 ns 8.9134 ns]
floating_i8             time:   [7.1974 ns 7.2120 ns 7.2323 ns]
floating+karatsuba_i8   time:   [6.6171 ns 6.6359 ns 6.6577 ns]
karatsuba_i8            time:   [6.8114 ns 6.8362 ns 6.8623 ns]
table_i8                time:   [6.6069 ns 6.6273 ns 6.6469 ns]
libgmp_i8               time:   [6.7207 ns 6.7242 ns 6.7282 ns]

original_u8             time:   [7.3317 ns 7.3425 ns 7.3522 ns]
floating_u8             time:   [4.6880 ns 4.7037 ns 4.7192 ns]
floating+karatsuba_u8   time:   [2.5223 ns 2.5303 ns 2.5377 ns]
karatsuba_u8            time:   [3.1860 ns 3.1993 ns 3.2164 ns]
table_u8                time:   [2.5434 ns 2.5499 ns 2.5565 ns]
libgmp_u8               time:   [2.5312 ns 2.5382 ns 2.5450 ns]

original_i16            time:   [9.9607 ns 9.9781 ns 9.9970 ns]
floating_i16            time:   [6.8501 ns 6.8723 ns 6.8935 ns]
floating+karatsuba_i16  time:   [6.8666 ns 6.8890 ns 6.9088 ns]
karatsuba_i16           time:   [7.5343 ns 7.5497 ns 7.5659 ns]
table_i16               time:   [7.3142 ns 7.3333 ns 7.3528 ns]
libgmp_i16              time:   [7.4119 ns 7.4263 ns 7.4410 ns]

original_u16            time:   [10.318 ns 10.332 ns 10.349 ns]
floating_u16            time:   [6.5575 ns 6.5711 ns 6.5827 ns]
floating+karatsuba_u16  time:   [3.2735 ns 3.2822 ns 3.2902 ns]
karatsuba_u16           time:   [4.6550 ns 4.6614 ns 4.6684 ns]
table_u16               time:   [3.7317 ns 3.7352 ns 3.7390 ns]
libgmp_u16              time:   [3.6700 ns 3.6795 ns 3.6894 ns]

original_i32            time:   [11.113 ns 11.138 ns 11.160 ns]
floating_i32            time:   [6.6519 ns 6.6616 ns 6.6697 ns]
floating+karatsuba_i32  time:   [6.5736 ns 6.5880 ns 6.6057 ns]
karatsuba_i32           time:   [8.2693 ns 8.2823 ns 8.3002 ns]
table_i32               time:   [7.2038 ns 7.2214 ns 7.2417 ns]
libgmp_i32              time:   [7.2301 ns 7.2389 ns 7.2476 ns]

original_u32            time:   [13.647 ns 13.680 ns 13.719 ns]
floating_u32            time:   [2.8695 ns 2.8776 ns 2.8846 ns]
floating+karatsuba_u32  time:   [2.7922 ns 2.8011 ns 2.8093 ns]
karatsuba_u32           time:   [7.5492 ns 7.6069 ns 7.6858 ns]
table_u32               time:   [4.7263 ns 4.7436 ns 4.7614 ns]
libgmp_u32              time:   [4.4509 ns 4.4625 ns 4.4740 ns]

original_i64            time:   [19.217 ns 19.287 ns 19.371 ns]
floating_i64            time:   [9.1175 ns 9.1324 ns 9.1488 ns]
floating+karatsuba_i64  time:   [9.3864 ns 9.4072 ns 9.4309 ns]
karatsuba_i64           time:   [13.713 ns 13.737 ns 13.770 ns]
table_i64               time:   [11.280 ns 11.285 ns 11.290 ns]
libgmp_i64              time:   [11.359 ns 11.420 ns 11.488 ns]

original_u64            time:   [28.746 ns 28.808 ns 28.886 ns]
floating_u64            time:   [5.7362 ns 5.7439 ns 5.7525 ns]
floating+karatsuba_u64  time:   [5.8237 ns 5.8340 ns 5.8452 ns]
karatsuba_u64           time:   [18.599 ns 18.628 ns 18.674 ns]
table_u64               time:   [12.631 ns 12.656 ns 12.686 ns]
libgmp_u64              time:   [12.943 ns 12.947 ns 12.951 ns]

original_i128           time:   [129.46 ns 129.99 ns 130.48 ns]
floating_i128           time:   [16.033 ns 16.112 ns 16.224 ns]
floating+karatsuba_i128 time:   [15.891 ns 15.918 ns 15.948 ns]
karatsuba_i128          time:   [21.446 ns 21.527 ns 21.597 ns]
table_i128              time:   [17.982 ns 18.018 ns 18.052 ns]
libgmp_i128             time:   [17.814 ns 17.832 ns 17.851 ns]

original_u128           time:   [250.32 ns 251.21 ns 252.13 ns]
floating_u128           time:   [20.254 ns 20.288 ns 20.333 ns]
floating+karatsuba_u128 time:   [19.654 ns 19.725 ns 19.804 ns]
karatsuba_u128          time:   [30.799 ns 30.853 ns 30.915 ns]
table_u128              time:   [23.780 ns 23.824 ns 23.873 ns]
libgmp_u128             time:   [23.687 ns 23.777 ns 23.864 ns]

It should be noted that libgmp was somewhat faster than table last time I checked, but now they're essentially equivalent.

Here are my unsigned implementations for what I'm currently calling lut16, lut32, lut64, and lut256, which offer different tradeoffs between table size and branchiness:

Since the concern is reducing the amount of memory used, one other concern is code size:

Implementation	Table size	Code size	Total size
`lut16`	16	0x4a = 74	90
`lut32`	32	0x3d = 61	93
`lut64`	64	0x20 = 32	96
`lut256`	256	0x10 = 16	272

Since each value in your tables takes up four bits, two of them can fit in a u8, so maybe that can help as well.

It should be noted that lut256 can probably be inlined into one memory access with a small computation of the address.

tfpf

Is there an estimate on when this might become available in stable?

ryanavella

Is there an estimate on when this might become available in stable?

I'm not sure if these are blocking stabilization necessarily, but these are some outstanding questions that I'd like to see answered:

Whether we even want to provide a panicking API for signed integers (it is kind of redundant with x.checked_sqrt().unwrap())
Whether to use floating point acceleration (and if so, a proof that it yields the same results regardless of FP optimizations and FPU quirks)
How to detect hardware floating point support (Needed: cfg(target_feature) support for hardware floating point detection. #64514)
What kind of tradeoff is acceptable between code size, table size (if applicable), and speed

Also, we could always use more testers and more benchmarks.

tfpf

My two cents, as a relatively new user (started using Rust just a year ago), who is absolutely not an expert in compiler/library design.

x.ilog2(), which has been stabilised, panics if x <= 0. (I assume that's not the same as 'having a panicking API'?) Whatever is true of it should also be true of x.isqrt() when x < 0.
From a cursory glance at musl (used on lightweight OSes like Alpine), it appears to not use lookup tables. Perhaps the same could be done here?
- A 64-bit system could be assumed to be not terribly memory-constrained, so full use of lookup tables might be considered.

ryanavella

From a cursory glance at musl (used on lightweight OSes like Alpine), it appears to not use lookup tables. Perhaps the same could be done here?

Maybe I'm reading the musl sources wrong, but I don't see an implementation for integer square root? I did find this lookup table used for (floating-point) sqrt. Keep in mind that this implementation is probably only relevant to platforms without hardware floating point.

tfpf

Nice find; I didn’t check thoroughly. Yes, they don’t have an integer square root function, but I was trying to see whether they used any lookup tables at all, since musl is a relatively new library (compared to how old C is) comparable in age to Rust. (Probably not fair to compare a library and a language, though.)

ChaiTRex

Contributor

@ryanavella With regard to your questions:

Whether we even want to provide a panicking API for signed integers (it is kind of redundant with x.checked_sqrt().unwrap())

I agree with what @tfpf said about maintaining consistency with the behavior of ilog functions.

Whether to use floating point acceleration (and if so, a proof that it yields the same results regardless of FP optimizations and FPU quirks)

I think that, for a first release, it's not necessary to have floating point support, as a decently fast, non-FP (because some targets have no FP at all or slow FP square root) method for u8 and the Karatsuba method for everything else will be pretty fast.

As far as a proof, a rough one is available in the 34th comment of this linked discussion.

If it matters, I also looked into the rounding mode used and found that Rust uses and requires the default rounding mode and that the default rounding mode for IEEE-754 is "round-to-nearest-ties-to-even".

What kind of tradeoff is acceptable between code size, table size (if applicable), and speed

It would be nice if the s and z optimization settings were available as a cfg flag so that, just by the developer choosing in the standard way whether to optimize for speed or size, the isqrt implementations would then be able to use that to choose between the smallest code+table size or the fastest available implementation.

That and FP-detection may make thoroughly testing it difficult, however.

Maybe I'm reading the musl sources wrong, but I don't see an implementation for integer square root? I did find this lookup table used for (floating-point) sqrt.

I'd caution that licensing issues should be considered before using or starting from and modifying musl's code so that the two licenses for the Rust standard library (including who's given credit for the code) are all that's needed to use the code.

ChaiTRex

Contributor

In my last comment in this thread, I pointed out a discussion I had started on internals.rust-lang.org.

I also started a discussion on the Rust Zulip, where I was informed that core::intrinsics::const_eval_select can be used to take two implementations of isqrt, one const fn and a faster one that can't be const fn (like here with the floating point operations) and meld them into a combined const fn that uses the fast version except in a const context.

The const fn version could use a large table to speed itself up without affecting the executable, as that version is only used at compile time.

49 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

C-tracking-issueT-libs-apidisposition-mergefinished-final-comment-period

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

Stabilize `isqrt` featurerust-lang/rust

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tracking Issue for `{u8,i8,...}::isqrt` #116226

Public API

Steps / History

Unresolved Questions

49 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Tracking Issue for {u8,i8,...}::isqrt #116226

Description

Public API

Steps / History

Unresolved Questions

Footnotes

Activity

ChaiTRex commented on Sep 29, 2023

ChaiTRex commented on Sep 30, 2023

ryanavella commented on Jan 17, 2024

ChaiTRex commented on Jan 25, 2024

tfpf commented on Feb 4, 2024

ryanavella commented on Feb 4, 2024

tfpf commented on Feb 16, 2024

ryanavella commented on Feb 16, 2024

tfpf commented on Feb 16, 2024

ChaiTRex commented on Feb 18, 2024

ChaiTRex commented on Feb 18, 2024

49 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions

Tracking Issue for `{u8,i8,...}::isqrt` #116226