Skip to content

perf: branchless primitive zip kernel#8270

Open
joseph-isaacs wants to merge 4 commits into
developfrom
claude/primitive-branchless-zip
Open

perf: branchless primitive zip kernel#8270
joseph-isaacs wants to merge 4 commits into
developfrom
claude/primitive-branchless-zip

Conversation

@joseph-isaacs

@joseph-isaacs joseph-isaacs commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a dedicated primitive zip kernel that selects values branchlessly per row.

The generic zip path copies runs of if_true/if_false between mask boundaries — fast for clustered masks but degrading to per-element work on fragmented masks. This kernel walks the mask as 64-bit chunks and blends both sides per row with no data-dependent branch, so the inner loop stays branch-free and auto-vectorizable regardless of mask shape. Result validity reuses the shared zip_validity helper, which expresses validity selection as a (lazy) zip over the two validity bitmaps.

The branchless boolean zip kernel (#8275) this builds on has now merged into develop; this branch has been rebased on top of it, so the diff here is primitive-only.

Changes

  • vortex-array/src/arrays/primitive/compute/zip.rs: branchless per-row value blend; validity via the shared zip_validity; tests spanning the 64-bit chunk boundary + remainder, non-nullable and nullable.
  • vortex-array/src/arrays/primitive/compute/mod.rs, .../vtable/kernel.rs: register the kernel.
  • vortex-array/benches/primitive_zip.rs: a small divan bench — one Fragmented (alternating) mask, non-nullable and nullable.

Performance (divan, local walltime, median)

At LEN = 65_536 (matching the original measurements), the nullable Fragmented case — which routes validity through the shared zip_validity → boolean zip — drops from ~43 ms (generic builder, before the bool kernel) to ~73 µs; non-nullable is ~58 µs. The committed bench uses LEN = 16_384 so each case stays well under a few hundred microseconds under CodSpeed's instruction-count simulation.

Testing

  • vortex-array zip tests pass (25, incl. the new primitive cases); cargo +nightly fmt, clippy -D warnings (default + all-features), and cargo doc -D warnings clean.

https://claude.ai/code/session_01N5ivPiCJy7dGQjMEP7ips9

@joseph-isaacs joseph-isaacs added the changelog/performance A performance improvement label Jun 5, 2026 — with Claude
@codspeed-hq

codspeed-hq Bot commented Jun 5, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 21.69%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 7 improved benchmarks
✅ 1516 untouched benchmarks
🆕 2 new benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(1000, 10)] 46.8 µs 31.7 µs +47.6%
Simulation varbinview_zip_block_mask 3.7 ms 2.9 ms +27.5%
Simulation chunked_varbinview_canonical_into[(1000, 10)] 198.5 µs 161.6 µs +22.87%
Simulation chunked_varbinview_into_canonical[(1000, 10)] 213.3 µs 176.9 µs +20.59%
Simulation chunked_varbinview_canonical_into[(100, 100)] 309.8 µs 273.6 µs +13.21%
Simulation varbinview_zip_fragmented_mask 6.9 ms 6.1 ms +12.74%
Simulation chunked_varbinview_into_canonical[(100, 100)] 362.6 µs 326.6 µs +11.03%
🆕 Simulation nonnull N/A 252.8 µs N/A
🆕 Simulation nullable N/A 276.5 µs N/A

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/primitive-branchless-zip (336271c) with develop (efd3e9b)

Open in CodSpeed

@joseph-isaacs joseph-isaacs force-pushed the claude/primitive-branchless-zip branch from 241e46d to bdc1c4b Compare June 5, 2026 18:29
@joseph-isaacs joseph-isaacs requested a review from a team June 5, 2026 18:29
@joseph-isaacs joseph-isaacs changed the base branch from develop to claude/bool-branchless-zip June 5, 2026 18:29
Base automatically changed from claude/bool-branchless-zip to develop June 9, 2026 13:14
Add a dedicated `ZipKernel for Primitive` that selects values branchlessly per
64-bit mask chunk instead of routing through the generic run-copy builder. For
each chunk it blends `if_true`/`if_false` per row without a data-dependent
branch, so the inner loop is auto-vectorizable and mask-shape-independent
(memory-bandwidth-bound): up to ~900x faster on fragmented masks and ~8x on
clustered masks, so no density-adaptive fallback is needed. Covers every native
ptype.

Result validity is computed by zipping the two boolean validity arrays with the
same mask -- `Validity::Array(zip(mask, if_true_valid, if_false_valid))` -- reusing
the zip machinery rather than re-deriving the mask algebra, with fast paths when
both sides' validity already agrees.

Adds a `primitive_zip` divan benchmark across fragmented/block/sparse/dense masks
for nullable and non-nullable inputs.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs force-pushed the claude/primitive-branchless-zip branch from bdc1c4b to eb806b8 Compare June 9, 2026 13:30
claude added 2 commits June 9, 2026 13:58
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

@robert3005 robert3005 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only comment I have is that unaligned_chunks is a faster iterator than chunks but we can leave that for later

Capture review feedback that `unaligned_chunks` is a faster single-buffer
iterator than `chunks` for the mask walk in `select_values`. No behavior change.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants