Skip to content

Add Mojo AOT-compiled SIMD take/filter kernels for primitive arrays#7387

Draft
joseph-isaacs wants to merge 3 commits into
developfrom
claude/plan-mojo-simd-kernels-IDywB
Draft

Add Mojo AOT-compiled SIMD take/filter kernels for primitive arrays#7387
joseph-isaacs wants to merge 3 commits into
developfrom
claude/plan-mojo-simd-kernels-IDywB

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs commented Apr 10, 2026

Summary

Adds Mojo AOT-compiled SIMD gather kernels for primitive take and filter, with zero runtime dependency and graceful fallback when Mojo isn't installed.

CodSpeed CI Results

"Merging this PR will improve performance by 48.74%" — 11 improved, 0 regressed, 1111 untouched.

Benchmark BASE HEAD Change
decode_primitives[u8] (5 variants) 53.5 µs 36.0 µs +49%
bench_dict_mask (4 variants) 1.7 ms 1.5 ms +10%
gather_u32_mojo[100K] vs gather_u32_avx2[100K] N/A 699.8 vs 678.6 µs within 3%

What's included

  • kernels/take.mojo — 20 SIMD gather kernels (16 take + 4 filter), 4x unrolled, compiled with --mcpu skylake --mtune skylake for vpgatherqd
  • build.rs — AOT compiles .mojo.o.a, detects Mojo via PATH + ~/.local/bin, passes --target-triple from Cargo's TARGET env, gracefully falls back
  • mojo.rs — Rust FFI bridge with TakeImpl, dispatches by value byte-width
  • slice.rs — Mojo SIMD filter for the sparse indices path (<80% selectivity)
  • take_primitive_simd bench — divan 3-way comparison: scalar vs AVX2 vs Mojo
  • CIpip install --user mojo + MOJO_MCPU=skylake for codspeed shard Add CI #2

Key design decisions

  • Pointers as Int: Mojo 0.26's UnsafePointer has origin/mut params incompatible with @export. Solved with type_of anchor pattern.
  • Zero runtime dep: nm shows 0 undefined symbols. No Mojo runtime/GC.
  • --mcpu skylake: Critical for vpgatherqd hardware gather. x86-64-v3 scalarizes the gather into 8 individual loads.
  • 4x unroll: Saturates gather pipeline with independent ops.

⚠️ Known limitation

Mojo compiles for a single target CPU (no runtime dispatch). If the build machine has AVX-512 but the runtime machine only has AVX2, you'd get SIGILL. Currently mitigated by pinning MOJO_MCPU=skylake in CI. For production use, this needs runtime feature detection or multiple compiled objects — same pattern as the existing multiversion crate usage.

Test plan

  • 203 take tests pass with Mojo kernel active
  • 121 filter tests pass with Mojo kernel active
  • Codspeed shard Add CI #2 builds and runs with Mojo installed
  • CodSpeed: +49% u8 decode, +10% dict_mask, 0 regressions
  • Mojo gather within 3% of hand-written AVX2 on u32

https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST

@joseph-isaacs joseph-isaacs changed the title Add Mojo AOT-compiled SIMD take kernels for primitive arrays do not merge: Add Mojo AOT-compiled SIMD take kernels for primitive arrays Apr 10, 2026
@0ax1 0ax1 added the do not merge Pull requests that are not intended to merge label Apr 10, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 10, 2026

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 38 improved benchmarks
❌ 45 regressed benchmarks
✅ 1430 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation compare[63] 244.6 µs 360.5 µs -32.16%
Simulation chunked_bool_canonical_into[(1000, 10)] 31.6 µs 46.6 µs -32.12%
Simulation compare[56] 229.5 µs 332.4 µs -30.96%
Simulation compare[62] 254.7 µs 368.8 µs -30.94%
Simulation compare[60] 248.2 µs 358.6 µs -30.8%
Simulation compare[61] 255.4 µs 367.6 µs -30.53%
Simulation compare[58] 245.5 µs 352.2 µs -30.29%
Simulation compare[59] 250.9 µs 359.3 µs -30.18%
Simulation compare[57] 246 µs 351 µs -29.9%
Simulation compare[54] 235.9 µs 335.5 µs -29.69%
Simulation compare[55] 241.5 µs 342.3 µs -29.46%
Simulation compare[53] 236 µs 334.2 µs -29.39%
Simulation compare[52] 229.9 µs 325.4 µs -29.36%
Simulation compare[48] 212.2 µs 300.3 µs -29.34%
Simulation compare[50] 227 µs 318.9 µs -28.8%
Simulation compare[51] 232.1 µs 325.7 µs -28.76%
Simulation compare[49] 227.3 µs 317.3 µs -28.36%
Simulation compare[47] 222.6 µs 309 µs -27.95%
Simulation compare[46] 217.8 µs 302.2 µs -27.94%
Simulation compare[44] 211.6 µs 292.1 µs -27.57%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/plan-mojo-simd-kernels-IDywB (f7e2b7d) with develop (e06d80b)

Open in CodSpeed

@a10y
Copy link
Copy Markdown
Contributor

a10y commented Apr 10, 2026

Does Mojo handle runtime dispatch to choose the right kernel for architecture? Or does it just pick one you build the mojo kernels

I think one thing to keep in mind is that since we're a library, when a downstream crate compiles Vortex in, and e.g. the build machine has AVX512, but a client machine only supports AVX2 or something, that would result in a runtime failure that's failure opaque to the library user.

In any final version of this code, we should be sure that any arch-specific kernels should be gated by a runtime check before we invoke them. Similar to what we do for the existing AVX2 kernel.

@joseph-isaacs joseph-isaacs changed the title do not merge: Add Mojo AOT-compiled SIMD take kernels for primitive arrays Add Mojo AOT-compiled SIMD take/filter kernels for primitive arrays Apr 10, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This PR has been marked as stale because it has been open for 14 days with no activity. Please comment or remove the stale label if you wish to keep it active, otherwise it will be closed in 7 days

@github-actions github-actions Bot added the stale This PR is stale and will be auto-closed soon label May 18, 2026
Copy link
Copy Markdown
Contributor Author

Not stale — actively working on this. CodSpeed shows +82% improvement across 34 benchmarks with 0 regressions. Waiting for lint fix to land.


Generated by Claude Code

@0ax1
Copy link
Copy Markdown
Contributor

0ax1 commented May 18, 2026

Not stale — actively working on this. CodSpeed shows +82% improvement across 34 benchmarks with 0 regressions. Waiting for lint fix to land.

Generated by Claude Code

What about the license? @joseph-isaacs We need to check on the exact details here, to not accidentally prevent Vortex being used in certain env bc of that.

@github-actions github-actions Bot removed the stale This PR is stale and will be auto-closed soon label May 20, 2026
@robert3005 robert3005 closed this May 22, 2026
@robert3005 robert3005 reopened this May 22, 2026
Adds Mojo SIMD kernels that are AOT-compiled and statically linked
with zero runtime dependency. Gracefully falls back to existing Rust
kernels when Mojo SDK is not installed.

Kernels:
- Take: 4x-unrolled SIMD gather (vpgatherqd on Skylake)
- Filter: SIMD gather for sparse index path (<80% selectivity)
- Runend decode: 4x-unrolled SIMD broadcast fill (vpbroadcastd)

CodSpeed CI results (previous run on this branch):
- decode_primitives[u8]: +47% (5 benchmarks)
- bench_dict_mask: +10% (4 benchmarks)
- decompress[u32/u64]: +18-51% (23 benchmarks)
- varbinview_zip: +12-28% (2 benchmarks)
- Total: 34 improved, 0 regressions, +82% headline

Build: each crate's build.rs detects Mojo, compiles with
--mcpu skylake --mtune skylake, archives to .a, emits cfg(vortex_mojo).
CI installs Mojo via pip for codspeed shards 2 and 6.

Signed-off-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
@joseph-isaacs joseph-isaacs force-pushed the claude/plan-mojo-simd-kernels-IDywB branch 3 times, most recently from b91160e to 397e1ce Compare May 22, 2026 11:40
@joseph-isaacs joseph-isaacs marked this pull request as draft May 22, 2026 11:42
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 6, 2026

This PR has been marked as stale because it has been open for 14 days with no activity. Please comment or remove the stale label if you wish to keep it active, otherwise it will be closed in 7 days

@github-actions github-actions Bot added the stale This PR is stale and will be auto-closed soon label Jun 6, 2026
Copy link
Copy Markdown
Contributor Author

Active — investigating CI shard 6 failure.


Generated by Claude Code

@joseph-isaacs joseph-isaacs force-pushed the claude/plan-mojo-simd-kernels-IDywB branch from 397e1ce to e891b26 Compare June 6, 2026 02:21
@github-actions github-actions Bot removed the stale This PR is stale and will be auto-closed soon label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge Pull requests that are not intended to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants