-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Rewriteis_ascii
using slice::as_chunks
#144837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
c43111d
to
7cbbbc6
Compare
The simpler source code and shorter assembly seem to boil down to two changes:
The first one seems quite reasonable in many cases. It probably causes a huge performance regression for targets that don't have efficient unaligned loads, but to be fair, those are becoming less common and less important over time. The second change may be quite problematic for some common input sizes, though. Try benchmarking before vs. after on an input that's |
There are some benchmarks in When I originally made that PR, new uses of However, the usize-aligned path is probably still needed for targets without SIMD like |
7cbbbc6
to
ebb9522
Compare
This comment has been minimized.
This comment has been minimized.
ebb9522
to
e4222eb
Compare
Benchmark resultsAArch64 (Apple M3):
x86 (AMD Ryzen 9 9950X):
|
This comment has been minimized.
This comment has been minimized.
e4222eb
to
6ac4350
Compare
This comment has been minimized.
This comment has been minimized.
6ac4350
to
5edf425
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed much more straightforward and I think the results speak reasonably well for themselves.
The previous implementation had a lot useful of info, could you keep / add some more docs here? E.g.:
- How dispatching to simd vs. swar was chosen
- What we expect
is_ascii_simd
to turn into, or what is required for it to become efficient simd - How
UNROLL_FACTOR
andCHUNK_SIZE
were picked
It would also be good to rebase and rerun the benchmarks/codegen demos, we just got a LLVM upgrade so the exact optimizations may be slightly different.
/// ASCII test *without* the chunk-at-a-time optimizations. | ||
/// | ||
/// This is carefully structured to produce nice small code -- it's smaller in | ||
/// `-O` than what the "obvious" ways produces under `-C opt-level=s`. If you | ||
/// touch it, be sure to run (and update if needed) the assembly test. | ||
#[unstable(feature = "str_internals", issue = "none")] | ||
#[doc(hidden)] | ||
#[inline] | ||
pub const fn is_ascii_simple(mut bytes: &[u8]) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check the code size for this compared to the other implementations? I wonder if it may be worth using for cfg(feature = "optimize_for_size")
.
Also, any idea which test it is talking about?
#[inline(always)] | ||
fn is_ascii_scalar(bytes: &[u8]) -> bool { | ||
bytes.iter().all(u8::is_ascii) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_ascii_const
(formerly is_ascii_simple
) seems to optimize pretty well based on the comments, does this get better codegen?
// sufficient alignment for `usize`, because it's a weird edge case. | ||
if len < USIZE_SIZE || len < align_offset || USIZE_SIZE < align_of::<usize>() { | ||
return is_ascii_simple(s); | ||
if cfg!(all(target_arch = "x86_64", target_feature = "sse2")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i686 also has sse2, would there be any advantage here?
Generalize the x86-64+sse2 version of
is_ascii
to be architecture-neutral, and rewrite it usingslice::as_chunks
. The new version is both shorter (in terms of Rust source code) and smaller (in terms of produced assembly).Compare the assembly generated before and after:
https://godbolt.org/z/MWKdnaYoK