perf: branchless primitive zip kernel by joseph-isaacs · Pull Request #8270 · vortex-data/vortex

joseph-isaacs · 2026-06-05T15:38:25Z

Summary

Adds a dedicated primitive zip kernel that selects values branchlessly per row.

The generic zip path copies runs of if_true/if_false between mask boundaries — fast for clustered masks but degrading to per-element work on fragmented masks. This kernel walks the mask as 64-bit chunks and blends both sides per row with no data-dependent branch, so the inner loop stays branch-free and auto-vectorizable regardless of mask shape. Result validity reuses the shared zip_validity helper, which expresses validity selection as a (lazy) zip over the two validity bitmaps.

The branchless boolean zip kernel (#8275) this builds on has now merged into develop; this branch has been rebased on top of it, so the diff here is primitive-only.

Changes

vortex-array/src/arrays/primitive/compute/zip.rs: branchless per-row value blend; validity via the shared zip_validity; tests spanning the 64-bit chunk boundary + remainder, non-nullable and nullable.
vortex-array/src/arrays/primitive/compute/mod.rs, .../vtable/kernel.rs: register the kernel.
vortex-array/benches/primitive_zip.rs: a small divan bench — one Fragmented (alternating) mask, non-nullable and nullable.

Performance (divan, local walltime, median)

At LEN = 65_536 (matching the original measurements), the nullable Fragmented case — which routes validity through the shared zip_validity → boolean zip — drops from ~43 ms (generic builder, before the bool kernel) to ~73 µs; non-nullable is ~58 µs. The committed bench uses LEN = 16_384 so each case stays well under a few hundred microseconds under CodSpeed's instruction-count simulation.

Testing

vortex-array zip tests pass (25, incl. the new primitive cases); cargo +nightly fmt, clippy -D warnings (default + all-features), and cargo doc -D warnings clean.

https://claude.ai/code/session_01N5ivPiCJy7dGQjMEP7ips9

codspeed-hq · 2026-06-05T15:47:31Z

Merging this PR will improve performance by 21.69%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 7 improved benchmarks
✅ 1516 untouched benchmarks
🆕 2 new benchmarks

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Simulation	`chunked_bool_canonical_into[(1000, 10)]`	46.8 µs	31.7 µs	+47.6%
⚡	Simulation	`varbinview_zip_block_mask`	3.7 ms	2.9 ms	+27.5%
⚡	Simulation	`chunked_varbinview_canonical_into[(1000, 10)]`	198.5 µs	161.6 µs	+22.87%
⚡	Simulation	`chunked_varbinview_into_canonical[(1000, 10)]`	213.3 µs	176.9 µs	+20.59%
⚡	Simulation	`chunked_varbinview_canonical_into[(100, 100)]`	309.8 µs	273.6 µs	+13.21%
⚡	Simulation	`varbinview_zip_fragmented_mask`	6.9 ms	6.1 ms	+12.74%
⚡	Simulation	`chunked_varbinview_into_canonical[(100, 100)]`	362.6 µs	326.6 µs	+11.03%
🆕	Simulation	`nonnull`	N/A	252.8 µs	N/A
🆕	Simulation	`nullable`	N/A	276.5 µs	N/A

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/primitive-branchless-zip (336271c) with develop (efd3e9b)}

Add a dedicated `ZipKernel for Primitive` that selects values branchlessly per 64-bit mask chunk instead of routing through the generic run-copy builder. For each chunk it blends `if_true`/`if_false` per row without a data-dependent branch, so the inner loop is auto-vectorizable and mask-shape-independent (memory-bandwidth-bound): up to ~900x faster on fragmented masks and ~8x on clustered masks, so no density-adaptive fallback is needed. Covers every native ptype. Result validity is computed by zipping the two boolean validity arrays with the same mask -- `Validity::Array(zip(mask, if_true_valid, if_false_valid))` -- reusing the zip machinery rather than re-deriving the mask algebra, with fast paths when both sides' validity already agrees. Adds a `primitive_zip` divan benchmark across fragmented/block/sparse/dense masks for nullable and non-nullable inputs. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

robert3005

the only comment I have is that unaligned_chunks is a faster iterator than chunks but we can leave that for later

Capture review feedback that `unaligned_chunks` is a faster single-buffer iterator than `chunks` for the mask walk in `select_values`. No behavior change. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

joseph-isaacs added the changelog/performance A performance improvement label Jun 5, 2026 — with Claude

joseph-isaacs force-pushed the claude/primitive-branchless-zip branch from 241e46d to bdc1c4b Compare June 5, 2026 18:29

joseph-isaacs requested a review from a team June 5, 2026 18:29

joseph-isaacs changed the base branch from develop to claude/bool-branchless-zip June 5, 2026 18:29

Base automatically changed from claude/bool-branchless-zip to develop June 9, 2026 13:14

joseph-isaacs force-pushed the claude/primitive-branchless-zip branch from bdc1c4b to eb806b8 Compare June 9, 2026 13:30

claude added 2 commits June 9, 2026 13:58

Merge develop into primitive-branchless-zip

afe6cd2

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Add TODO for skewed-mask fast path in primitive zip kernel

d00e3c4

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

robert3005 approved these changes Jun 9, 2026

View reviewed changes

Add TODO to use unaligned_chunks in primitive zip kernel

336271c

Capture review feedback that `unaligned_chunks` is a faster single-buffer iterator than `chunks` for the mask walk in `select_values`. No behavior change. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: branchless primitive zip kernel#8270

perf: branchless primitive zip kernel#8270
joseph-isaacs wants to merge 4 commits into
developfrom
claude/primitive-branchless-zip

joseph-isaacs commented Jun 5, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

robert3005 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joseph-isaacs commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Performance (divan, local walltime, median)

Testing

Uh oh!

codspeed-hq Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 21.69%

Performance Changes

Uh oh!

robert3005 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joseph-isaacs commented Jun 5, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 5, 2026 •

edited

Loading