Compare→bitmask SIMD lowering: where custom SIMD actually helps#8280
Draft
joseph-isaacs wants to merge 10 commits into
Draft
Compare→bitmask SIMD lowering: where custom SIMD actually helps#8280joseph-isaacs wants to merge 10 commits into
joseph-isaacs wants to merge 10 commits into
Conversation
…lowering Classify every collect-bool / bitmask-pack site in the workspace by whether it can benefit from a vector compare -> opmask -> kmov lowering (vptestmb/vpcmpd) instead of the scalar `packed |= (pred as u64) << i` idiom, which LLVM's SLP vectorizer turns into a slow vpsllvq shift-OR reduction. Findings: the chokepoint is BitBufferMut::collect_bool -> collect_bool_word; the hottest contiguous callers are primitive `between` and FastLanes `stream_predicate`. PEXT/PDEP/VBMI paths (count_ones, bool filter, bit_transpose, intersect_by_rank) are already SIMD and left untouched. Includes compiler-pass provenance (rustc emits scalar IR; SLPVectorizer is the culprit; X86 ISel knows the good lowering) and measured speedups (~20x at L1 for collect_bool). Analysis only; no kernels changed. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…itmask Adds `pack_nonzero_bytes` (runtime-dispatched AVX-512BW vptestmb / AVX2 movemask / scalar fallback) and routes `BitBufferMut::from(&[u8])` and `From<&[bool]>` through it via a new `from_nonzero_bytes` constructor. The previous path went through the closure-generic `collect_bool` -> `collect_bool_word` scalar `packed |= (b != 0) << i` idiom, which LLVM's SLP vectorizer lowers to a slow vpsllvq shift-OR reduction instead of a mask-compare. Benchmarks on the default (baseline) build show the real `BitBufferMut::from(&[u8])` API going from ~910us to ~27us at 1 MiB (~34x; 1.16 GB/s -> 39 GB/s), and 12-25x even when the scalar path is allowed to auto-vectorize with target-cpu=native. Adds a `matches_reference` unit test and three divan benches (pack_truthy_bytes, pack_truthy_bytes_simd, bitbuffer_from_u8). All 784 vortex-buffer tests pass; lib clippy clean. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Adds between_i32_scalar (the exact `collect_bool_words` predicate that `primitive between` uses) vs between_i32_simd (AVX-512 vpcmpd + kmovw), showing 29-71x for the typed-comparison generalization of pack_nonzero_bytes. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ndings Adds offsets_eq_scalar/offsets_eq_simd (the varbin compare_offsets_to_empty shape: off[i]==off[i+1] -> vpcmpeqd+kmovw), measuring 5.4-14x. Updates the audit doc with: the three patterns unified as one parameterized shape; the std::simd finding (portable_simd to_bitmask lowers byte-identically to the intrinsic, ~zero perf delta, but nightly-only); the in-tree benchmark table for all three shapes; and the full ranked typed-comparison site list from the exhaustive search (28 sites, top-3 contiguous targets). Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
An adversarial re-review of the slowness claims found two corrections: - The in-tree scalar gap is TWO stacked factors, not one: a bounds-checked closure (~5.5x under AVX-512, a cheap safe source fix) AND the SLP shift-OR idiom (~8x residual, the real LLVM gap). Prior text understated bounds-checks. - "std::simd byte-identical to the intrinsic" was overstated: the hot loop is identical with equivalent perf, but from_slice adds a bounds-check branch. Confirmed unchanged: it is NOT a build-flag artifact (no target-cpu flag gets scalar within 2x of the intrinsic; asm shows vpsllvq reduction even with full AVX-512 and a known trip count), and the default-build magnitude (40-70x) holds. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Adds `pack_slice_predicate(words, values, pred)`: a safe, stable (no `unsafe`, no SIMD intrinsics) packer that takes a slice + per-element predicate and iterates via `chunks_exact(64)`. Unlike the index-closure `collect_bool_words` (`|i| values[i] > k`), there is no per-element bounds check, so LLVM can auto-vectorize the scalar shift-OR loop the index-closure form blocks. Quantified against the bounds-checked closure and the AVX-512 intrinsic (new `*_chunked` benches; see dev-notes/collect-bool-simd-audit.md): - i32 `between`: recovers ~6x at the default baseline build and ~9.6-11.5x under target-cpu=native; at 1Mi/native it lands within 1.04x of hand AVX-512. - byte `!=0`: ~no change at baseline, ~4x under native. This isolates the bounds-checked index closure as the dominant, cheaply-fixable factor for the realistic compare->bitmask shapes, separate from the residual LLVM SLP shift-OR gap that only intrinsics/std::simd close. Includes an rstest correctness test against `collect_bool_words` across tail sizes (caught and fixed a zip over-consumption bug in the remainder write). Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…e slice packer
Add `BitBuffer{,Mut}::collect_bool_slice(values, pred)` (wrapping
`pack_slice_predicate`) and use it in `primitive` and `decimal` `between`,
replacing the `BitBuffer::collect_bool` index closure that did
`unsafe { *slice.get_unchecked(idx) }`. This drops the `unsafe`.
Performance: NEUTRAL. Validated three independent ways (clean standalone builds
calling the real APIs, Valgrind callgrind instruction counts, and the
codspeed-divan benches run locally):
- SSE2 baseline: 839 -> 835 us/1Mi; callgrind 260.4M -> 262.3M instructions.
- target-cpu=native: 174 -> 173 us/1Mi; 1.90 -> 1.79 us/16Ki.
The i32 compare already auto-vectorizes in production (direct, fully-inlined
calls), so this is a safety/readability cleanup, not a speedup. NOTE: the
"recovers ~6-11x" figures in 0320bbe were a divan-harness artifact -- the bench
harness blocks auto-vectorization of the index-closure form only; production is
unaffected. The genuine win from this line of work is `pack_nonzero_bytes`
(byte != 0 pack), which does not auto-vectorize and is ~50-100x faster than
develop's scalar `collect_bool` at the SSE2 baseline.
Adds A/B benches (between_i32_unchecked, between_bitbuffer_{original,new}) used
for this validation.
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
https://claude.ai/code/session_01FyAXXAdpt5hbmZRDgAs3ED
Wrap the multi-arg `#[allow(...)]` across lines per `cargo +nightly fmt`. No behavior change. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01FyAXXAdpt5hbmZRDgAs3ED
Add `compare_lowering` bench comparing the best portable scalar form against runtime-dispatched SIMD (avx512->avx2->scalar) for three compare->bitmask lowerings, all writing the same u64 bitmask: * u8 != 0 (byte truthiness pack) * i32 > 5 (single comparison) * 5 < i32 < 10 (between) SIMD paths always do real vector work under CodSpeed's `+avx2` build (u8 reuses the production `pack_nonzero_bytes`). A `verify()` step cross-checks every variant against the scalar reference before benchmarking so a miscompiled lowering fails loudly instead of reporting fast-but-wrong. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01FyAXXAdpt5hbmZRDgAs3ED
- Remove dev-notes/collect-bool-simd-audit.md: it lacked a REUSE/SPDX header (reuse-check failure) and its numbers are superseded by the validated findings in PR #8280's description. - Allow `clippy::many_single_char_names` in the vortex_bitbuffer bench (terse SIMD/math names), matching compare_lowering, to fix the `-D warnings` lint job. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01FyAXXAdpt5hbmZRDgAs3ED
Merging this PR will degrade performance by 16.61%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | varbinview_zip_block_mask |
2.9 ms | 3.7 ms | -21.59% |
| ❌ | Simulation | varbinview_zip_fragmented_mask |
6.1 ms | 6.9 ms | -11.31% |
| 🆕 | Simulation | between_scalar_pack[1048576] |
N/A | 2.6 ms | N/A |
| 🆕 | Simulation | between_scalar_pack[16384] |
N/A | 41.2 µs | N/A |
| 🆕 | Simulation | between_simd_bench[1048576] |
N/A | 2.3 ms | N/A |
| 🆕 | Simulation | between_simd_bench[16384] |
N/A | 36.3 µs | N/A |
| 🆕 | Simulation | gt_scalar_pack[1048576] |
N/A | 2.5 ms | N/A |
| 🆕 | Simulation | gt_scalar_pack[16384] |
N/A | 39.7 µs | N/A |
| 🆕 | Simulation | gt_simd_bench[1048576] |
N/A | 2.2 ms | N/A |
| 🆕 | Simulation | gt_simd_bench[16384] |
N/A | 34.9 µs | N/A |
| 🆕 | Simulation | u8_scalar_pack[1048576] |
N/A | 979.9 µs | N/A |
| 🆕 | Simulation | u8_scalar_pack[16384] |
N/A | 16.1 µs | N/A |
| 🆕 | Simulation | u8_scalar_swar[1048576] |
N/A | 934.2 µs | N/A |
| 🆕 | Simulation | u8_scalar_swar[16384] |
N/A | 15.2 µs | N/A |
| 🆕 | Simulation | u8_simd[1048576] |
N/A | 610.9 µs | N/A |
| 🆕 | Simulation | u8_simd[16384] |
N/A | 10 µs | N/A |
| 🆕 | Simulation | between_bitbuffer_new[1024] |
N/A | 5.7 µs | N/A |
| 🆕 | Simulation | between_bitbuffer_new[1048576] |
N/A | 2.8 ms | N/A |
| 🆕 | Simulation | between_bitbuffer_new[16384] |
N/A | 47 µs | N/A |
| 🆕 | Simulation | between_bitbuffer_new[262144] |
N/A | 692.4 µs | N/A |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/vector-bitpack-lowering-yKOkC (5d476ef) with develop (e06d80b)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch investigates where hand-written SIMD genuinely beats compiler
auto-vectorization for compare → bitmask lowerings, and lands the one place
it clearly does plus a safety cleanup. It is intentionally draft — the
headline is the analysis, validated by the
compare_loweringbenchmark (and,I expect, by the CodSpeed run on this PR).
What's here
pack_nonzero_bytes(already on the branch): runtime-dispatched(avx512 → avx2 → scalar) SIMD pack of
u8 != 0into a bitmask, wired intothe
BitBuffer/BitBufferMutFrom<&[u8]>/From<&[bool]>paths.betweensafety cleanup: route primitive/decimalbetweenthrough a newsafe
collect_bool_slice(drops anunsafe { *slice.get_unchecked(idx) }).This is perf-neutral, not a speedup — see below.
compare_loweringbench: best portable scalar vs runtime-dispatched SIMDfor three lowerings (
u8 != 0,i32 > 5,5 < i32 < 10), all writing thesame
u64bitmask, with averify()correctness gate.The finding: it depends entirely on build target and working set
Vortex ships baseline x86-64 (SSE2) by default (no
target-cpuin.cargo/config.toml); CodSpeed builds with+avx2; the wall-clock bench CIuses
-C target-cpu=native(AVX-512). Whether scalar matches SIMD changesacross all three.
Wall-clock (
taskset -c 0, median µs), all writing au64bitmask:u8 != 0— SIMD always wins, no scalar matchesThe compiler can't synthesize the byte→bit pack; only a SIMD movemask/
vptestmbdoes it well. (The "best scalar" here is a carry-free SWAR — the textbook
haszerotrick is wrong for an exact per-byte mask due to inter-byte borrow.)i32 > 5and5 < i32 < 10— scalar matches SIMD only with AVX-512Compare→bitmask needs mask registers (AVX-512
vpcmpd→k) to vectorize well.At SSE2/AVX2 the compiler leaves it scalar, so dispatched SIMD wins 2.7–11× in
cache. At native the compiler auto-vectorizes and scalar matches SIMD; at
DRAM sizes the op is memory-bandwidth-bound and the two converge regardless.
Honest caveats
betweenrewrite is a safety change (removesunsafe), not a speedup;an earlier "win" I reported was a divan harness artifact (it pessimizes the
index-closure form; production calls these directly and vectorizes).
bandwidth. So I expect it to show SIMD doing far fewer instructions than scalar
for all three ops at all sizes — including the
i32DRAM cases wherewall-clock is at parity. That corroborates "SIMD is more work-efficient" but
does not contradict the wall-clock memory-bound parity above.
Tests / checks
cargo nextest run -p vortex-array between(30 passed)compare_loweringverify()gate passes (all SIMD/SWAR variants match scalar)cargo clippy -p vortex-buffer --bench compare_loweringcleancargo +nightly fmt -p vortex-bufferBenchmarks run locally on one pinned core at SSE2 /
+avx2/target-cpu=native.Generated by Claude Code