Skip to content

Compare→bitmask SIMD lowering: where custom SIMD actually helps#8280

Draft
joseph-isaacs wants to merge 10 commits into
developfrom
claude/vector-bitpack-lowering-yKOkC
Draft

Compare→bitmask SIMD lowering: where custom SIMD actually helps#8280
joseph-isaacs wants to merge 10 commits into
developfrom
claude/vector-bitpack-lowering-yKOkC

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

Summary

This branch investigates where hand-written SIMD genuinely beats compiler
auto-vectorization for compare → bitmask lowerings, and lands the one place
it clearly does plus a safety cleanup. It is intentionally draft — the
headline is the analysis, validated by the compare_lowering benchmark (and,
I expect, by the CodSpeed run on this PR).

What's here

  1. pack_nonzero_bytes (already on the branch): runtime-dispatched
    (avx512 → avx2 → scalar) SIMD pack of u8 != 0 into a bitmask, wired into
    the BitBuffer/BitBufferMut From<&[u8]> / From<&[bool]> paths.
  2. between safety cleanup: route primitive/decimal between through a new
    safe collect_bool_slice (drops an unsafe { *slice.get_unchecked(idx) }).
    This is perf-neutral, not a speedup — see below.
  3. compare_lowering bench: best portable scalar vs runtime-dispatched SIMD
    for three lowerings (u8 != 0, i32 > 5, 5 < i32 < 10), all writing the
    same u64 bitmask, with a verify() correctness gate.

The finding: it depends entirely on build target and working set

Vortex ships baseline x86-64 (SSE2) by default (no target-cpu in
.cargo/config.toml); CodSpeed builds with +avx2; the wall-clock bench CI
uses -C target-cpu=native (AVX-512). Whether scalar matches SIMD changes
across all three.

Wall-clock (taskset -c 0, median µs), all writing a u64 bitmask:

u8 != 0 — SIMD always wins, no scalar matches

target best scalar @16ki SIMD @16ki best scalar @1mi SIMD @1mi
SSE2 swar 2.5 0.13 (19×) swar 159 29 (5.4×)
AVX2 swar 1.8 0.13 (14×) swar 117 30 (3.9×)
native swar 1.4 0.13 (11×) swar 87 30 (2.9×)

The compiler can't synthesize the byte→bit pack; only a SIMD movemask/vptestmb
does it well. (The "best scalar" here is a carry-free SWAR — the textbook
haszero trick is wrong for an exact per-byte mask due to inter-byte borrow.)

i32 > 5 and 5 < i32 < 10 — scalar matches SIMD only with AVX-512

target scalar @16ki SIMD @16ki scalar @1mi SIMD @1mi
SSE2 12.2 1.1 (11×) 783 164 (4.8×)
AVX2 3.0 1.1 (2.7×) 197 165 (1.2×)
native 1.6 1.1 (1.4×) 161 161 (≈1.0×)

Compare→bitmask needs mask registers (AVX-512 vpcmpd→k) to vectorize well.
At SSE2/AVX2 the compiler leaves it scalar, so dispatched SIMD wins 2.7–11× in
cache. At native the compiler auto-vectorizes and scalar matches SIMD; at
DRAM sizes the op is memory-bandwidth-bound and the two converge regardless.

Honest caveats

  • The between rewrite is a safety change (removes unsafe), not a speedup;
    an earlier "win" I reported was a divan harness artifact (it pessimizes the
    index-closure form; production calls these directly and vectorizes).
  • CodSpeed runs in simulation/instruction-count mode, which ignores memory
    bandwidth. So I expect it to show SIMD doing far fewer instructions than scalar
    for all three ops at all sizes — including the i32 DRAM cases where
    wall-clock is at parity. That corroborates "SIMD is more work-efficient" but
    does not contradict the wall-clock memory-bound parity above.

Tests / checks

  • cargo nextest run -p vortex-array between (30 passed)
  • compare_lowering verify() gate passes (all SIMD/SWAR variants match scalar)
  • cargo clippy -p vortex-buffer --bench compare_lowering clean
  • cargo +nightly fmt -p vortex-buffer

Benchmarks run locally on one pinned core at SSE2 / +avx2 / target-cpu=native.


Generated by Claude Code

…lowering

Classify every collect-bool / bitmask-pack site in the workspace by whether it
can benefit from a vector compare -> opmask -> kmov lowering (vptestmb/vpcmpd)
instead of the scalar `packed |= (pred as u64) << i` idiom, which LLVM's SLP
vectorizer turns into a slow vpsllvq shift-OR reduction.

Findings: the chokepoint is BitBufferMut::collect_bool -> collect_bool_word; the
hottest contiguous callers are primitive `between` and FastLanes
`stream_predicate`. PEXT/PDEP/VBMI paths (count_ones, bool filter, bit_transpose,
intersect_by_rank) are already SIMD and left untouched. Includes compiler-pass
provenance (rustc emits scalar IR; SLPVectorizer is the culprit; X86 ISel knows
the good lowering) and measured speedups (~20x at L1 for collect_bool).

Analysis only; no kernels changed.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…itmask

Adds `pack_nonzero_bytes` (runtime-dispatched AVX-512BW vptestmb / AVX2
movemask / scalar fallback) and routes `BitBufferMut::from(&[u8])` and
`From<&[bool]>` through it via a new `from_nonzero_bytes` constructor.

The previous path went through the closure-generic `collect_bool` ->
`collect_bool_word` scalar `packed |= (b != 0) << i` idiom, which LLVM's SLP
vectorizer lowers to a slow vpsllvq shift-OR reduction instead of a
mask-compare. Benchmarks on the default (baseline) build show the real
`BitBufferMut::from(&[u8])` API going from ~910us to ~27us at 1 MiB (~34x;
1.16 GB/s -> 39 GB/s), and 12-25x even when the scalar path is allowed to
auto-vectorize with target-cpu=native.

Adds a `matches_reference` unit test and three divan benches
(pack_truthy_bytes, pack_truthy_bytes_simd, bitbuffer_from_u8). All 784
vortex-buffer tests pass; lib clippy clean.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Adds between_i32_scalar (the exact `collect_bool_words` predicate that
`primitive between` uses) vs between_i32_simd (AVX-512 vpcmpd + kmovw),
showing 29-71x for the typed-comparison generalization of pack_nonzero_bytes.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ndings

Adds offsets_eq_scalar/offsets_eq_simd (the varbin compare_offsets_to_empty
shape: off[i]==off[i+1] -> vpcmpeqd+kmovw), measuring 5.4-14x.

Updates the audit doc with: the three patterns unified as one parameterized
shape; the std::simd finding (portable_simd to_bitmask lowers byte-identically
to the intrinsic, ~zero perf delta, but nightly-only); the in-tree benchmark
table for all three shapes; and the full ranked typed-comparison site list from
the exhaustive search (28 sites, top-3 contiguous targets).

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
An adversarial re-review of the slowness claims found two corrections:
- The in-tree scalar gap is TWO stacked factors, not one: a bounds-checked
  closure (~5.5x under AVX-512, a cheap safe source fix) AND the SLP shift-OR
  idiom (~8x residual, the real LLVM gap). Prior text understated bounds-checks.
- "std::simd byte-identical to the intrinsic" was overstated: the hot loop is
  identical with equivalent perf, but from_slice adds a bounds-check branch.

Confirmed unchanged: it is NOT a build-flag artifact (no target-cpu flag gets
scalar within 2x of the intrinsic; asm shows vpsllvq reduction even with full
AVX-512 and a known trip count), and the default-build magnitude (40-70x) holds.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Adds `pack_slice_predicate(words, values, pred)`: a safe, stable (no `unsafe`,
no SIMD intrinsics) packer that takes a slice + per-element predicate and
iterates via `chunks_exact(64)`. Unlike the index-closure `collect_bool_words`
(`|i| values[i] > k`), there is no per-element bounds check, so LLVM can
auto-vectorize the scalar shift-OR loop the index-closure form blocks.

Quantified against the bounds-checked closure and the AVX-512 intrinsic
(new `*_chunked` benches; see dev-notes/collect-bool-simd-audit.md):
- i32 `between`: recovers ~6x at the default baseline build and ~9.6-11.5x under
  target-cpu=native; at 1Mi/native it lands within 1.04x of hand AVX-512.
- byte `!=0`: ~no change at baseline, ~4x under native.

This isolates the bounds-checked index closure as the dominant, cheaply-fixable
factor for the realistic compare->bitmask shapes, separate from the residual
LLVM SLP shift-OR gap that only intrinsics/std::simd close.

Includes an rstest correctness test against `collect_bool_words` across tail
sizes (caught and fixed a zip over-consumption bug in the remainder write).

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…e slice packer

Add `BitBuffer{,Mut}::collect_bool_slice(values, pred)` (wrapping
`pack_slice_predicate`) and use it in `primitive` and `decimal` `between`,
replacing the `BitBuffer::collect_bool` index closure that did
`unsafe { *slice.get_unchecked(idx) }`. This drops the `unsafe`.

Performance: NEUTRAL. Validated three independent ways (clean standalone builds
calling the real APIs, Valgrind callgrind instruction counts, and the
codspeed-divan benches run locally):
- SSE2 baseline: 839 -> 835 us/1Mi; callgrind 260.4M -> 262.3M instructions.
- target-cpu=native: 174 -> 173 us/1Mi; 1.90 -> 1.79 us/16Ki.

The i32 compare already auto-vectorizes in production (direct, fully-inlined
calls), so this is a safety/readability cleanup, not a speedup. NOTE: the
"recovers ~6-11x" figures in 0320bbe were a divan-harness artifact -- the bench
harness blocks auto-vectorization of the index-closure form only; production is
unaffected. The genuine win from this line of work is `pack_nonzero_bytes`
(byte != 0 pack), which does not auto-vectorize and is ~50-100x faster than
develop's scalar `collect_bool` at the SSE2 baseline.

Adds A/B benches (between_i32_unchecked, between_bitbuffer_{original,new}) used
for this validation.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
https://claude.ai/code/session_01FyAXXAdpt5hbmZRDgAs3ED
Wrap the multi-arg `#[allow(...)]` across lines per `cargo +nightly fmt`.
No behavior change.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
https://claude.ai/code/session_01FyAXXAdpt5hbmZRDgAs3ED
Add `compare_lowering` bench comparing the best portable scalar form against
runtime-dispatched SIMD (avx512->avx2->scalar) for three compare->bitmask
lowerings, all writing the same u64 bitmask:

  * u8 != 0       (byte truthiness pack)
  * i32 > 5       (single comparison)
  * 5 < i32 < 10  (between)

SIMD paths always do real vector work under CodSpeed's `+avx2` build (u8 reuses
the production `pack_nonzero_bytes`). A `verify()` step cross-checks every
variant against the scalar reference before benchmarking so a miscompiled
lowering fails loudly instead of reporting fast-but-wrong.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
https://claude.ai/code/session_01FyAXXAdpt5hbmZRDgAs3ED
@joseph-isaacs joseph-isaacs added the changelog/performance A performance improvement label Jun 7, 2026 — with Claude
- Remove dev-notes/collect-bool-simd-audit.md: it lacked a REUSE/SPDX header
  (reuse-check failure) and its numbers are superseded by the validated findings
  in PR #8280's description.
- Allow `clippy::many_single_char_names` in the vortex_bitbuffer bench (terse
  SIMD/math names), matching compare_lowering, to fix the `-D warnings` lint job.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
https://claude.ai/code/session_01FyAXXAdpt5hbmZRDgAs3ED
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 7, 2026

Merging this PR will degrade performance by 16.61%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

❌ 2 regressed benchmarks
✅ 1511 untouched benchmarks
🆕 62 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation varbinview_zip_block_mask 2.9 ms 3.7 ms -21.59%
Simulation varbinview_zip_fragmented_mask 6.1 ms 6.9 ms -11.31%
🆕 Simulation between_scalar_pack[1048576] N/A 2.6 ms N/A
🆕 Simulation between_scalar_pack[16384] N/A 41.2 µs N/A
🆕 Simulation between_simd_bench[1048576] N/A 2.3 ms N/A
🆕 Simulation between_simd_bench[16384] N/A 36.3 µs N/A
🆕 Simulation gt_scalar_pack[1048576] N/A 2.5 ms N/A
🆕 Simulation gt_scalar_pack[16384] N/A 39.7 µs N/A
🆕 Simulation gt_simd_bench[1048576] N/A 2.2 ms N/A
🆕 Simulation gt_simd_bench[16384] N/A 34.9 µs N/A
🆕 Simulation u8_scalar_pack[1048576] N/A 979.9 µs N/A
🆕 Simulation u8_scalar_pack[16384] N/A 16.1 µs N/A
🆕 Simulation u8_scalar_swar[1048576] N/A 934.2 µs N/A
🆕 Simulation u8_scalar_swar[16384] N/A 15.2 µs N/A
🆕 Simulation u8_simd[1048576] N/A 610.9 µs N/A
🆕 Simulation u8_simd[16384] N/A 10 µs N/A
🆕 Simulation between_bitbuffer_new[1024] N/A 5.7 µs N/A
🆕 Simulation between_bitbuffer_new[1048576] N/A 2.8 ms N/A
🆕 Simulation between_bitbuffer_new[16384] N/A 47 µs N/A
🆕 Simulation between_bitbuffer_new[262144] N/A 692.4 µs N/A
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/vector-bitpack-lowering-yKOkC (5d476ef) with develop (e06d80b)

Open in CodSpeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant