Compare→bitmask SIMD lowering: where custom SIMD actually helps by joseph-isaacs · Pull Request #8280 · vortex-data/vortex

joseph-isaacs · 2026-06-07T00:20:57Z

Summary

This branch investigates where hand-written SIMD genuinely beats compiler
auto-vectorization for compare → bitmask lowerings, and lands the one place
it clearly does plus a safety cleanup. It is intentionally draft — the
headline is the analysis, validated by the compare_lowering benchmark (and,
I expect, by the CodSpeed run on this PR).

What's here

pack_nonzero_bytes (already on the branch): runtime-dispatched
(avx512 → avx2 → scalar) SIMD pack of u8 != 0 into a bitmask, wired into
the BitBuffer/BitBufferMut From<&[u8]> / From<&[bool]> paths.
between safety cleanup: route primitive/decimal between through a new
safe collect_bool_slice (drops an unsafe { *slice.get_unchecked(idx) }).
This is perf-neutral, not a speedup — see below.
compare_lowering bench: best portable scalar vs runtime-dispatched SIMD
for three lowerings (u8 != 0, i32 > 5, 5 < i32 < 10), all writing the
same u64 bitmask, with a verify() correctness gate.

The finding: it depends entirely on build target and working set

Vortex ships baseline x86-64 (SSE2) by default (no target-cpu in
.cargo/config.toml); CodSpeed builds with +avx2; the wall-clock bench CI
uses -C target-cpu=native (AVX-512). Whether scalar matches SIMD changes
across all three.

Wall-clock (taskset -c 0, median µs), all writing a u64 bitmask:

`u8 != 0` — SIMD always wins, no scalar matches

target	best scalar @16ki	SIMD @16ki	best scalar @1mi	SIMD @1mi
SSE2	swar 2.5	0.13 (19×)	swar 159	29 (5.4×)
AVX2	swar 1.8	0.13 (14×)	swar 117	30 (3.9×)
native	swar 1.4	0.13 (11×)	swar 87	30 (2.9×)

The compiler can't synthesize the byte→bit pack; only a SIMD movemask/vptestmb
does it well. (The "best scalar" here is a carry-free SWAR — the textbook
haszero trick is wrong for an exact per-byte mask due to inter-byte borrow.)

`i32 > 5` and `5 < i32 < 10` — scalar matches SIMD only with AVX-512

target	scalar @16ki	SIMD @16ki	scalar @1mi	SIMD @1mi
SSE2	12.2	1.1 (11×)	783	164 (4.8×)
AVX2	3.0	1.1 (2.7×)	197	165 (1.2×)
native	1.6	1.1 (1.4×)	161	161 (≈1.0×)

Compare→bitmask needs mask registers (AVX-512 vpcmpd→k) to vectorize well.
At SSE2/AVX2 the compiler leaves it scalar, so dispatched SIMD wins 2.7–11× in
cache. At native the compiler auto-vectorizes and scalar matches SIMD; at
DRAM sizes the op is memory-bandwidth-bound and the two converge regardless.

Honest caveats

The between rewrite is a safety change (removes unsafe), not a speedup;
an earlier "win" I reported was a divan harness artifact (it pessimizes the
index-closure form; production calls these directly and vectorizes).
CodSpeed runs in simulation/instruction-count mode, which ignores memory
bandwidth. So I expect it to show SIMD doing far fewer instructions than scalar
for all three ops at all sizes — including the i32 DRAM cases where
wall-clock is at parity. That corroborates "SIMD is more work-efficient" but
does not contradict the wall-clock memory-bound parity above.

Tests / checks

cargo nextest run -p vortex-array between (30 passed)
compare_lowering verify() gate passes (all SIMD/SWAR variants match scalar)
cargo clippy -p vortex-buffer --bench compare_lowering clean
cargo +nightly fmt -p vortex-buffer

Benchmarks run locally on one pinned core at SSE2 / +avx2 / target-cpu=native.

Generated by Claude Code

…lowering Classify every collect-bool / bitmask-pack site in the workspace by whether it can benefit from a vector compare -> opmask -> kmov lowering (vptestmb/vpcmpd) instead of the scalar `packed |= (pred as u64) << i` idiom, which LLVM's SLP vectorizer turns into a slow vpsllvq shift-OR reduction. Findings: the chokepoint is BitBufferMut::collect_bool -> collect_bool_word; the hottest contiguous callers are primitive `between` and FastLanes `stream_predicate`. PEXT/PDEP/VBMI paths (count_ones, bool filter, bit_transpose, intersect_by_rank) are already SIMD and left untouched. Includes compiler-pass provenance (rustc emits scalar IR; SLPVectorizer is the culprit; X86 ISel knows the good lowering) and measured speedups (~20x at L1 for collect_bool). Analysis only; no kernels changed. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

…itmask Adds `pack_nonzero_bytes` (runtime-dispatched AVX-512BW vptestmb / AVX2 movemask / scalar fallback) and routes `BitBufferMut::from(&[u8])` and `From<&[bool]>` through it via a new `from_nonzero_bytes` constructor. The previous path went through the closure-generic `collect_bool` -> `collect_bool_word` scalar `packed |= (b != 0) << i` idiom, which LLVM's SLP vectorizer lowers to a slow vpsllvq shift-OR reduction instead of a mask-compare. Benchmarks on the default (baseline) build show the real `BitBufferMut::from(&[u8])` API going from ~910us to ~27us at 1 MiB (~34x; 1.16 GB/s -> 39 GB/s), and 12-25x even when the scalar path is allowed to auto-vectorize with target-cpu=native. Adds a `matches_reference` unit test and three divan benches (pack_truthy_bytes, pack_truthy_bytes_simd, bitbuffer_from_u8). All 784 vortex-buffer tests pass; lib clippy clean. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Adds between_i32_scalar (the exact `collect_bool_words` predicate that `primitive between` uses) vs between_i32_simd (AVX-512 vpcmpd + kmovw), showing 29-71x for the typed-comparison generalization of pack_nonzero_bytes. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

…ndings Adds offsets_eq_scalar/offsets_eq_simd (the varbin compare_offsets_to_empty shape: off[i]==off[i+1] -> vpcmpeqd+kmovw), measuring 5.4-14x. Updates the audit doc with: the three patterns unified as one parameterized shape; the std::simd finding (portable_simd to_bitmask lowers byte-identically to the intrinsic, ~zero perf delta, but nightly-only); the in-tree benchmark table for all three shapes; and the full ranked typed-comparison site list from the exhaustive search (28 sites, top-3 contiguous targets). Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

An adversarial re-review of the slowness claims found two corrections: - The in-tree scalar gap is TWO stacked factors, not one: a bounds-checked closure (~5.5x under AVX-512, a cheap safe source fix) AND the SLP shift-OR idiom (~8x residual, the real LLVM gap). Prior text understated bounds-checks. - "std::simd byte-identical to the intrinsic" was overstated: the hot loop is identical with equivalent perf, but from_slice adds a bounds-check branch. Confirmed unchanged: it is NOT a build-flag artifact (no target-cpu flag gets scalar within 2x of the intrinsic; asm shows vpsllvq reduction even with full AVX-512 and a known trip count), and the default-build magnitude (40-70x) holds. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Adds `pack_slice_predicate(words, values, pred)`: a safe, stable (no `unsafe`, no SIMD intrinsics) packer that takes a slice + per-element predicate and iterates via `chunks_exact(64)`. Unlike the index-closure `collect_bool_words` (`|i| values[i] > k`), there is no per-element bounds check, so LLVM can auto-vectorize the scalar shift-OR loop the index-closure form blocks. Quantified against the bounds-checked closure and the AVX-512 intrinsic (new `*_chunked` benches; see dev-notes/collect-bool-simd-audit.md): - i32 `between`: recovers ~6x at the default baseline build and ~9.6-11.5x under target-cpu=native; at 1Mi/native it lands within 1.04x of hand AVX-512. - byte `!=0`: ~no change at baseline, ~4x under native. This isolates the bounds-checked index closure as the dominant, cheaply-fixable factor for the realistic compare->bitmask shapes, separate from the residual LLVM SLP shift-OR gap that only intrinsics/std::simd close. Includes an rstest correctness test against `collect_bool_words` across tail sizes (caught and fixed a zip over-consumption bug in the remainder write). Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

…e slice packer Add `BitBuffer{,Mut}::collect_bool_slice(values, pred)` (wrapping `pack_slice_predicate`) and use it in `primitive` and `decimal` `between`, replacing the `BitBuffer::collect_bool` index closure that did `unsafe { *slice.get_unchecked(idx) }`. This drops the `unsafe`. Performance: NEUTRAL. Validated three independent ways (clean standalone builds calling the real APIs, Valgrind callgrind instruction counts, and the codspeed-divan benches run locally): - SSE2 baseline: 839 -> 835 us/1Mi; callgrind 260.4M -> 262.3M instructions. - target-cpu=native: 174 -> 173 us/1Mi; 1.90 -> 1.79 us/16Ki. The i32 compare already auto-vectorizes in production (direct, fully-inlined calls), so this is a safety/readability cleanup, not a speedup. NOTE: the "recovers ~6-11x" figures in 0320bbe were a divan-harness artifact -- the bench harness blocks auto-vectorization of the index-closure form only; production is unaffected. The genuine win from this line of work is `pack_nonzero_bytes` (byte != 0 pack), which does not auto-vectorize and is ~50-100x faster than develop's scalar `collect_bool` at the SSE2 baseline. Adds A/B benches (between_i32_unchecked, between_bitbuffer_{original,new}) used for this validation. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01FyAXXAdpt5hbmZRDgAs3ED

Wrap the multi-arg `#[allow(...)]` across lines per `cargo +nightly fmt`. No behavior change. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01FyAXXAdpt5hbmZRDgAs3ED

Add `compare_lowering` bench comparing the best portable scalar form against runtime-dispatched SIMD (avx512->avx2->scalar) for three compare->bitmask lowerings, all writing the same u64 bitmask: * u8 != 0 (byte truthiness pack) * i32 > 5 (single comparison) * 5 < i32 < 10 (between) SIMD paths always do real vector work under CodSpeed's `+avx2` build (u8 reuses the production `pack_nonzero_bytes`). A `verify()` step cross-checks every variant against the scalar reference before benchmarking so a miscompiled lowering fails loudly instead of reporting fast-but-wrong. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01FyAXXAdpt5hbmZRDgAs3ED

- Remove dev-notes/collect-bool-simd-audit.md: it lacked a REUSE/SPDX header (reuse-check failure) and its numbers are superseded by the validated findings in PR #8280's description. - Allow `clippy::many_single_char_names` in the vortex_bitbuffer bench (terse SIMD/math names), matching compare_lowering, to fix the `-D warnings` lint job. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01FyAXXAdpt5hbmZRDgAs3ED

codspeed-hq · 2026-06-07T00:30:20Z

Merging this PR will degrade performance by 16.61%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

❌ 2 regressed benchmarks
✅ 1511 untouched benchmarks
🆕 62 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`varbinview_zip_block_mask`	2.9 ms	3.7 ms	-21.59%
❌	Simulation	`varbinview_zip_fragmented_mask`	6.1 ms	6.9 ms	-11.31%
🆕	Simulation	`between_scalar_pack[1048576]`	N/A	2.6 ms	N/A
🆕	Simulation	`between_scalar_pack[16384]`	N/A	41.2 µs	N/A
🆕	Simulation	`between_simd_bench[1048576]`	N/A	2.3 ms	N/A
🆕	Simulation	`between_simd_bench[16384]`	N/A	36.3 µs	N/A
🆕	Simulation	`gt_scalar_pack[1048576]`	N/A	2.5 ms	N/A
🆕	Simulation	`gt_scalar_pack[16384]`	N/A	39.7 µs	N/A
🆕	Simulation	`gt_simd_bench[1048576]`	N/A	2.2 ms	N/A
🆕	Simulation	`gt_simd_bench[16384]`	N/A	34.9 µs	N/A
🆕	Simulation	`u8_scalar_pack[1048576]`	N/A	979.9 µs	N/A
🆕	Simulation	`u8_scalar_pack[16384]`	N/A	16.1 µs	N/A
🆕	Simulation	`u8_scalar_swar[1048576]`	N/A	934.2 µs	N/A
🆕	Simulation	`u8_scalar_swar[16384]`	N/A	15.2 µs	N/A
🆕	Simulation	`u8_simd[1048576]`	N/A	610.9 µs	N/A
🆕	Simulation	`u8_simd[16384]`	N/A	10 µs	N/A
🆕	Simulation	`between_bitbuffer_new[1024]`	N/A	5.7 µs	N/A
🆕	Simulation	`between_bitbuffer_new[1048576]`	N/A	2.8 ms	N/A
🆕	Simulation	`between_bitbuffer_new[16384]`	N/A	47 µs	N/A
🆕	Simulation	`between_bitbuffer_new[262144]`	N/A	692.4 µs	N/A
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/vector-bitpack-lowering-yKOkC (5d476ef) with develop (e06d80b)}

joseph-isaacs added 9 commits June 6, 2026 15:27

style(vortex-buffer): rustfmt the pack_slice_predicate test attribute

6f994bf

Wrap the multi-arg `#[allow(...)]` across lines per `cargo +nightly fmt`. No behavior change. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01FyAXXAdpt5hbmZRDgAs3ED

joseph-isaacs added the changelog/performance A performance improvement label Jun 7, 2026 — with Claude

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare→bitmask SIMD lowering: where custom SIMD actually helps#8280

Compare→bitmask SIMD lowering: where custom SIMD actually helps#8280
joseph-isaacs wants to merge 10 commits into
developfrom
claude/vector-bitpack-lowering-yKOkC

joseph-isaacs commented Jun 7, 2026

Uh oh!

codspeed-hq Bot commented Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joseph-isaacs commented Jun 7, 2026

Summary

What's here

The finding: it depends entirely on build target and working set

u8 != 0 — SIMD always wins, no scalar matches

i32 > 5 and 5 < i32 < 10 — scalar matches SIMD only with AVX-512

Honest caveats

Tests / checks

Uh oh!

codspeed-hq Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 16.61%

Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`u8 != 0` — SIMD always wins, no scalar matches

`i32 > 5` and `5 < i32 < 10` — scalar matches SIMD only with AVX-512

codspeed-hq Bot commented Jun 7, 2026 •

edited

Loading