Skip to content

Rewriteis_ascii using slice::as_chunks #144837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Kmeakin
Copy link
Contributor

@Kmeakin Kmeakin commented Aug 2, 2025

Generalize the x86-64+sse2 version of is_ascii to be architecture-neutral, and rewrite it using slice::as_chunks. The new version is both shorter (in terms of Rust source code) and smaller (in terms of produced assembly).

Compare the assembly generated before and after:
https://godbolt.org/z/MWKdnaYoK

@rustbot
Copy link
Collaborator

rustbot commented Aug 2, 2025

r? @tgross35

rustbot has assigned @tgross35.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Aug 2, 2025
@rust-log-analyzer

This comment has been minimized.

@Kmeakin Kmeakin force-pushed the km/optimize-is-ascii branch from c43111d to 7cbbbc6 Compare August 2, 2025 18:08
@hanna-kruppe
Copy link
Contributor

The simpler source code and shorter assembly seem to boil down to two changes:

  1. Using unaligned loads for every full chunk, instead of trying to align all loads except possibly the first and the last one.
  2. Always using the simple byte-by-byte loop for the last bytes.len() % CHUNK_SIZE bytes, instead of trying to handle it with an unaligned load that overlaps with the preceding chunk.

The first one seems quite reasonable in many cases. It probably causes a huge performance regression for targets that don't have efficient unaligned loads, but to be fair, those are becoming less common and less important over time.

The second change may be quite problematic for some common input sizes, though. Try benchmarking before vs. after on an input that's 2 * CHUNK_SIZE - 1 bytes long, or with a random short input lengths that make the branches and iteration counts less predictable.

@okaneco
Copy link
Contributor

okaneco commented Aug 2, 2025

There are some benchmarks in library/core/benches/ascii/is_ascii.rs and I added more in #130733, also a codegen test.

When I originally made that PR, new uses of const_eval_select seemed to be discouraged when making a function const, and then the situation was a little different by the time it was reviewed and merged.

However, the usize-aligned path is probably still needed for targets without SIMD like i586-unknown-linux-gnu since it can do SWAR ASCII checks instead of byte at a time.

@Kmeakin Kmeakin force-pushed the km/optimize-is-ascii branch from 7cbbbc6 to ebb9522 Compare August 6, 2025 00:35
@Kmeakin Kmeakin marked this pull request as draft August 6, 2025 00:36
@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 6, 2025
@rust-log-analyzer

This comment has been minimized.

@Kmeakin Kmeakin force-pushed the km/optimize-is-ascii branch from ebb9522 to e4222eb Compare August 6, 2025 01:10
@Kmeakin
Copy link
Contributor Author

Kmeakin commented Aug 6, 2025

Benchmark results

AArch64 (Apple M3):

benchmarks:
    ascii::is_ascii::long::is_ascii_simd_08                  13.42ns/iter +/- 1.01
    ascii::is_ascii::long::is_ascii_simd_16                   7.41ns/iter +/- 0.33
    ascii::is_ascii::long::is_ascii_simd_32                   9.56ns/iter +/- 0.19
    ascii::is_ascii::long::is_ascii_swar_1                   15.10ns/iter +/- 0.16
    ascii::is_ascii::long::is_ascii_swar_2                   10.05ns/iter +/- 2.03
    ascii::is_ascii::long::is_ascii_swar_4                    6.53ns/iter +/- 0.86
    ascii::is_ascii::medium::is_ascii_simd_08                 2.87ns/iter +/- 0.97
    ascii::is_ascii::medium::is_ascii_simd_16                 1.87ns/iter +/- 0.08
    ascii::is_ascii::medium::is_ascii_simd_32                 1.15ns/iter +/- 0.01
    ascii::is_ascii::medium::is_ascii_swar_1                  4.00ns/iter +/- 0.63
    ascii::is_ascii::medium::is_ascii_swar_2                  3.75ns/iter +/- 0.10
    ascii::is_ascii::medium::is_ascii_swar_4                  2.92ns/iter +/- 0.08
    ascii::is_ascii::medium_15::is_ascii_simd_08              0.68ns/iter +/- 0.02
    ascii::is_ascii::medium_15::is_ascii_simd_16              0.63ns/iter +/- 0.06
    ascii::is_ascii::medium_15::is_ascii_simd_32              2.27ns/iter +/- 0.03
    ascii::is_ascii::medium_15::is_ascii_swar_1               3.18ns/iter +/- 0.04
    ascii::is_ascii::medium_15::is_ascii_swar_2               2.38ns/iter +/- 0.03
    ascii::is_ascii::medium_15::is_ascii_swar_4               2.92ns/iter +/- 0.04
    ascii::is_ascii::short::is_ascii_simd_08                  2.12ns/iter +/- 0.04
    ascii::is_ascii::short::is_ascii_simd_16                  2.92ns/iter +/- 0.06
    ascii::is_ascii::short::is_ascii_simd_32                  3.71ns/iter +/- 0.04
    ascii::is_ascii::short::is_ascii_swar_1                   2.16ns/iter +/- 0.04
    ascii::is_ascii::short::is_ascii_swar_2                   2.16ns/iter +/- 0.04
    ascii::is_ascii::short::is_ascii_swar_4                   2.16ns/iter +/- 0.05
    ascii::is_ascii::unaligned_both_long::is_ascii_simd_08   13.25ns/iter +/- 0.37
    ascii::is_ascii::unaligned_both_long::is_ascii_simd_16    7.42ns/iter +/- 0.07
    ascii::is_ascii::unaligned_both_long::is_ascii_simd_32    9.62ns/iter +/- 0.09
    ascii::is_ascii::unaligned_both_long::is_ascii_swar_1    15.00ns/iter +/- 0.24
    ascii::is_ascii::unaligned_both_long::is_ascii_swar_2     9.97ns/iter +/- 0.12
    ascii::is_ascii::unaligned_both_long::is_ascii_swar_4     5.77ns/iter +/- 0.05
    ascii::is_ascii::unaligned_both_medium::is_ascii_simd_08  1.43ns/iter +/- 0.05
    ascii::is_ascii::unaligned_both_medium::is_ascii_simd_16  2.13ns/iter +/- 0.10
    ascii::is_ascii::unaligned_both_medium::is_ascii_simd_32  4.61ns/iter +/- 0.10
    ascii::is_ascii::unaligned_both_medium::is_ascii_swar_1   3.18ns/iter +/- 0.02
    ascii::is_ascii::unaligned_both_medium::is_ascii_swar_2   2.39ns/iter +/- 0.03
    ascii::is_ascii::unaligned_both_medium::is_ascii_swar_4   2.91ns/iter +/- 0.04
    ascii::is_ascii::unaligned_head_long::is_ascii_simd_08   13.53ns/iter +/- 0.12
    ascii::is_ascii::unaligned_head_long::is_ascii_simd_16    7.16ns/iter +/- 0.07
    ascii::is_ascii::unaligned_head_long::is_ascii_simd_32    9.53ns/iter +/- 0.09
    ascii::is_ascii::unaligned_head_long::is_ascii_swar_1    15.51ns/iter +/- 0.27
    ascii::is_ascii::unaligned_head_long::is_ascii_swar_2    10.22ns/iter +/- 0.10
    ascii::is_ascii::unaligned_head_long::is_ascii_swar_4     6.28ns/iter +/- 0.15
    ascii::is_ascii::unaligned_head_medium::is_ascii_simd_08  3.09ns/iter +/- 0.03
    ascii::is_ascii::unaligned_head_medium::is_ascii_simd_16  5.84ns/iter +/- 0.28
    ascii::is_ascii::unaligned_head_medium::is_ascii_simd_32  9.53ns/iter +/- 0.14
    ascii::is_ascii::unaligned_head_medium::is_ascii_swar_1   3.71ns/iter +/- 0.04
    ascii::is_ascii::unaligned_head_medium::is_ascii_swar_2   2.65ns/iter +/- 0.06
    ascii::is_ascii::unaligned_head_medium::is_ascii_swar_4   2.92ns/iter +/- 0.03
    ascii::is_ascii::unaligned_tail_long::is_ascii_simd_08   13.77ns/iter +/- 0.14
    ascii::is_ascii::unaligned_tail_long::is_ascii_simd_16    7.67ns/iter +/- 0.33
    ascii::is_ascii::unaligned_tail_long::is_ascii_simd_32    9.55ns/iter +/- 0.09
    ascii::is_ascii::unaligned_tail_long::is_ascii_swar_1    15.15ns/iter +/- 0.10
    ascii::is_ascii::unaligned_tail_long::is_ascii_swar_2     9.97ns/iter +/- 0.10
    ascii::is_ascii::unaligned_tail_long::is_ascii_swar_4     5.93ns/iter +/- 0.08
    ascii::is_ascii::unaligned_tail_medium::is_ascii_simd_08  1.55ns/iter +/- 0.04
    ascii::is_ascii::unaligned_tail_medium::is_ascii_simd_16  2.25ns/iter +/- 0.04
    ascii::is_ascii::unaligned_tail_medium::is_ascii_simd_32  4.76ns/iter +/- 0.07
    ascii::is_ascii::unaligned_tail_medium::is_ascii_swar_1   3.18ns/iter +/- 0.10
    ascii::is_ascii::unaligned_tail_medium::is_ascii_swar_2   2.65ns/iter +/- 0.02
    ascii::is_ascii::unaligned_tail_medium::is_ascii_swar_4   2.65ns/iter +/- 0.01

x86 (AMD Ryzen 9 9950X):

benchmarks:
   ascii::is_ascii::long::is_ascii_simd_08                   8.57ns/iter +/- 0.33
   ascii::is_ascii::long::is_ascii_simd_16                   3.81ns/iter +/- 0.01
   ascii::is_ascii::long::is_ascii_simd_32                   2.85ns/iter +/- 0.07
   ascii::is_ascii::long::is_ascii_swar_1                   10.87ns/iter +/- 5.34
   ascii::is_ascii::long::is_ascii_swar_2                    6.32ns/iter +/- 0.02
   ascii::is_ascii::long::is_ascii_swar_4                    4.81ns/iter +/- 0.08
   ascii::is_ascii::medium::is_ascii_simd_08                 1.20ns/iter +/- 0.40
   ascii::is_ascii::medium::is_ascii_simd_16                 0.92ns/iter +/- 0.00
   ascii::is_ascii::medium::is_ascii_simd_32                 0.65ns/iter +/- 0.06
   ascii::is_ascii::medium::is_ascii_swar_1                  2.76ns/iter +/- 0.05
   ascii::is_ascii::medium::is_ascii_swar_2                  2.59ns/iter +/- 0.02
   ascii::is_ascii::medium::is_ascii_swar_4                  2.47ns/iter +/- 0.06
   ascii::is_ascii::medium_15::is_ascii_simd_08              0.58ns/iter +/- 0.00
   ascii::is_ascii::medium_15::is_ascii_simd_16              0.45ns/iter +/- 0.00
   ascii::is_ascii::medium_15::is_ascii_simd_32              1.62ns/iter +/- 0.00
   ascii::is_ascii::medium_15::is_ascii_swar_1               2.40ns/iter +/- 0.08
   ascii::is_ascii::medium_15::is_ascii_swar_2               2.22ns/iter +/- 0.01
   ascii::is_ascii::medium_15::is_ascii_swar_4               2.22ns/iter +/- 0.00
   ascii::is_ascii::short::is_ascii_simd_08                  1.45ns/iter +/- 0.00
   ascii::is_ascii::short::is_ascii_simd_16                  1.99ns/iter +/- 1.05
   ascii::is_ascii::short::is_ascii_simd_32                  1.50ns/iter +/- 1.24
   ascii::is_ascii::short::is_ascii_swar_1                   1.27ns/iter +/- 0.00
   ascii::is_ascii::short::is_ascii_swar_2                   1.18ns/iter +/- 0.01
   ascii::is_ascii::short::is_ascii_swar_4                   1.27ns/iter +/- 0.00
   ascii::is_ascii::unaligned_both_long::is_ascii_simd_08    6.31ns/iter +/- 0.51
   ascii::is_ascii::unaligned_both_long::is_ascii_simd_16    3.43ns/iter +/- 0.16
   ascii::is_ascii::unaligned_both_long::is_ascii_simd_32    3.75ns/iter +/- 0.12
   ascii::is_ascii::unaligned_both_long::is_ascii_swar_1     9.19ns/iter +/- 0.02
   ascii::is_ascii::unaligned_both_long::is_ascii_swar_2     6.49ns/iter +/- 0.02
   ascii::is_ascii::unaligned_both_long::is_ascii_swar_4     4.87ns/iter +/- 0.06
   ascii::is_ascii::unaligned_both_medium::is_ascii_simd_08  1.03ns/iter +/- 0.03
   ascii::is_ascii::unaligned_both_medium::is_ascii_simd_16  1.45ns/iter +/- 0.01
   ascii::is_ascii::unaligned_both_medium::is_ascii_simd_32  3.24ns/iter +/- 0.01
   ascii::is_ascii::unaligned_both_medium::is_ascii_swar_1   2.22ns/iter +/- 0.00
   ascii::is_ascii::unaligned_both_medium::is_ascii_swar_2   2.05ns/iter +/- 0.00
   ascii::is_ascii::unaligned_both_medium::is_ascii_swar_4   2.40ns/iter +/- 0.01
   ascii::is_ascii::unaligned_head_long::is_ascii_simd_08    8.88ns/iter +/- 0.22
   ascii::is_ascii::unaligned_head_long::is_ascii_simd_16    4.01ns/iter +/- 0.01
   ascii::is_ascii::unaligned_head_long::is_ascii_simd_32    4.00ns/iter +/- 0.43
   ascii::is_ascii::unaligned_head_long::is_ascii_swar_1     9.68ns/iter +/- 0.05
   ascii::is_ascii::unaligned_head_long::is_ascii_swar_2     6.42ns/iter +/- 0.02
   ascii::is_ascii::unaligned_head_long::is_ascii_swar_4     5.76ns/iter +/- 0.06
   ascii::is_ascii::unaligned_head_medium::is_ascii_simd_08  2.05ns/iter +/- 0.10
   ascii::is_ascii::unaligned_head_medium::is_ascii_simd_16  5.01ns/iter +/- 0.02
   ascii::is_ascii::unaligned_head_medium::is_ascii_simd_32  6.08ns/iter +/- 0.39
   ascii::is_ascii::unaligned_head_medium::is_ascii_swar_1   2.58ns/iter +/- 0.01
   ascii::is_ascii::unaligned_head_medium::is_ascii_swar_2   2.46ns/iter +/- 0.02
   ascii::is_ascii::unaligned_head_medium::is_ascii_swar_4   2.59ns/iter +/- 0.01
   ascii::is_ascii::unaligned_tail_long::is_ascii_simd_08    5.68ns/iter +/- 0.07
   ascii::is_ascii::unaligned_tail_long::is_ascii_simd_16    3.15ns/iter +/- 0.04
   ascii::is_ascii::unaligned_tail_long::is_ascii_simd_32    3.04ns/iter +/- 0.07
   ascii::is_ascii::unaligned_tail_long::is_ascii_swar_1     9.58ns/iter +/- 0.02
   ascii::is_ascii::unaligned_tail_long::is_ascii_swar_2     6.38ns/iter +/- 0.01
   ascii::is_ascii::unaligned_tail_long::is_ascii_swar_4     4.86ns/iter +/- 0.07
   ascii::is_ascii::unaligned_tail_medium::is_ascii_simd_08  1.10ns/iter +/- 0.00
   ascii::is_ascii::unaligned_tail_medium::is_ascii_simd_16  1.54ns/iter +/- 0.00
   ascii::is_ascii::unaligned_tail_medium::is_ascii_simd_32  3.23ns/iter +/- 0.04
   ascii::is_ascii::unaligned_tail_medium::is_ascii_swar_1   2.59ns/iter +/- 0.16
   ascii::is_ascii::unaligned_tail_medium::is_ascii_swar_2   2.40ns/iter +/- 0.01
   ascii::is_ascii::unaligned_tail_medium::is_ascii_swar_4   2.32ns/iter +/- 0.02

@rust-log-analyzer

This comment has been minimized.

@Kmeakin Kmeakin force-pushed the km/optimize-is-ascii branch from e4222eb to 6ac4350 Compare August 6, 2025 01:26
@rust-log-analyzer

This comment has been minimized.

@Kmeakin Kmeakin force-pushed the km/optimize-is-ascii branch from 6ac4350 to 5edf425 Compare August 6, 2025 21:03
@Kmeakin Kmeakin marked this pull request as ready for review August 6, 2025 21:03
@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Aug 6, 2025
Copy link
Contributor

@tgross35 tgross35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed much more straightforward and I think the results speak reasonably well for themselves.

The previous implementation had a lot useful of info, could you keep / add some more docs here? E.g.:

  • How dispatching to simd vs. swar was chosen
  • What we expect is_ascii_simd to turn into, or what is required for it to become efficient simd
  • How UNROLL_FACTOR and CHUNK_SIZE were picked

It would also be good to rebase and rerun the benchmarks/codegen demos, we just got a LLVM upgrade so the exact optimizations may be slightly different.

Comment on lines -330 to -338
/// ASCII test *without* the chunk-at-a-time optimizations.
///
/// This is carefully structured to produce nice small code -- it's smaller in
/// `-O` than what the "obvious" ways produces under `-C opt-level=s`. If you
/// touch it, be sure to run (and update if needed) the assembly test.
#[unstable(feature = "str_internals", issue = "none")]
#[doc(hidden)]
#[inline]
pub const fn is_ascii_simple(mut bytes: &[u8]) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check the code size for this compared to the other implementations? I wonder if it may be worth using for cfg(feature = "optimize_for_size").

Also, any idea which test it is talking about?

Comment on lines +361 to +364
#[inline(always)]
fn is_ascii_scalar(bytes: &[u8]) -> bool {
bytes.iter().all(u8::is_ascii)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_ascii_const (formerly is_ascii_simple) seems to optimize pretty well based on the comments, does this get better codegen?

// sufficient alignment for `usize`, because it's a weird edge case.
if len < USIZE_SIZE || len < align_offset || USIZE_SIZE < align_of::<usize>() {
return is_ascii_simple(s);
if cfg!(all(target_arch = "x86_64", target_feature = "sse2")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i686 also has sse2, would there be any advantage here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants