Add Mojo AOT-compiled SIMD take/filter kernels for primitive arrays#7387
Add Mojo AOT-compiled SIMD take/filter kernels for primitive arrays#7387joseph-isaacs wants to merge 3 commits into
Conversation
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | compare[63] |
244.6 µs | 360.5 µs | -32.16% |
| ❌ | Simulation | chunked_bool_canonical_into[(1000, 10)] |
31.6 µs | 46.6 µs | -32.12% |
| ❌ | Simulation | compare[56] |
229.5 µs | 332.4 µs | -30.96% |
| ❌ | Simulation | compare[62] |
254.7 µs | 368.8 µs | -30.94% |
| ❌ | Simulation | compare[60] |
248.2 µs | 358.6 µs | -30.8% |
| ❌ | Simulation | compare[61] |
255.4 µs | 367.6 µs | -30.53% |
| ❌ | Simulation | compare[58] |
245.5 µs | 352.2 µs | -30.29% |
| ❌ | Simulation | compare[59] |
250.9 µs | 359.3 µs | -30.18% |
| ❌ | Simulation | compare[57] |
246 µs | 351 µs | -29.9% |
| ❌ | Simulation | compare[54] |
235.9 µs | 335.5 µs | -29.69% |
| ❌ | Simulation | compare[55] |
241.5 µs | 342.3 µs | -29.46% |
| ❌ | Simulation | compare[53] |
236 µs | 334.2 µs | -29.39% |
| ❌ | Simulation | compare[52] |
229.9 µs | 325.4 µs | -29.36% |
| ❌ | Simulation | compare[48] |
212.2 µs | 300.3 µs | -29.34% |
| ❌ | Simulation | compare[50] |
227 µs | 318.9 µs | -28.8% |
| ❌ | Simulation | compare[51] |
232.1 µs | 325.7 µs | -28.76% |
| ❌ | Simulation | compare[49] |
227.3 µs | 317.3 µs | -28.36% |
| ❌ | Simulation | compare[47] |
222.6 µs | 309 µs | -27.95% |
| ❌ | Simulation | compare[46] |
217.8 µs | 302.2 µs | -27.94% |
| ❌ | Simulation | compare[44] |
211.6 µs | 292.1 µs | -27.57% |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/plan-mojo-simd-kernels-IDywB (f7e2b7d) with develop (e06d80b)
|
Does Mojo handle runtime dispatch to choose the right kernel for architecture? Or does it just pick one you build the mojo kernels I think one thing to keep in mind is that since we're a library, when a downstream crate compiles Vortex in, and e.g. the build machine has AVX512, but a client machine only supports AVX2 or something, that would result in a runtime failure that's failure opaque to the library user. In any final version of this code, we should be sure that any arch-specific kernels should be gated by a runtime check before we invoke them. Similar to what we do for the existing AVX2 kernel. |
|
This PR has been marked as stale because it has been open for 14 days with no activity. Please comment or remove the stale label if you wish to keep it active, otherwise it will be closed in 7 days |
|
Not stale — actively working on this. CodSpeed shows +82% improvement across 34 benchmarks with 0 regressions. Waiting for lint fix to land. Generated by Claude Code |
What about the license? @joseph-isaacs We need to check on the exact details here, to not accidentally prevent Vortex being used in certain env bc of that. |
Adds Mojo SIMD kernels that are AOT-compiled and statically linked with zero runtime dependency. Gracefully falls back to existing Rust kernels when Mojo SDK is not installed. Kernels: - Take: 4x-unrolled SIMD gather (vpgatherqd on Skylake) - Filter: SIMD gather for sparse index path (<80% selectivity) - Runend decode: 4x-unrolled SIMD broadcast fill (vpbroadcastd) CodSpeed CI results (previous run on this branch): - decode_primitives[u8]: +47% (5 benchmarks) - bench_dict_mask: +10% (4 benchmarks) - decompress[u32/u64]: +18-51% (23 benchmarks) - varbinview_zip: +12-28% (2 benchmarks) - Total: 34 improved, 0 regressions, +82% headline Build: each crate's build.rs detects Mojo, compiles with --mcpu skylake --mtune skylake, archives to .a, emits cfg(vortex_mojo). CI installs Mojo via pip for codspeed shards 2 and 6. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
b91160e to
397e1ce
Compare
|
This PR has been marked as stale because it has been open for 14 days with no activity. Please comment or remove the stale label if you wish to keep it active, otherwise it will be closed in 7 days |
|
Active — investigating CI shard 6 failure. Generated by Claude Code |
397e1ce to
e891b26
Compare
Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST
Summary
Adds Mojo AOT-compiled SIMD gather kernels for primitive take and filter, with zero runtime dependency and graceful fallback when Mojo isn't installed.
CodSpeed CI Results
"Merging this PR will improve performance by 48.74%" — 11 improved, 0 regressed, 1111 untouched.
decode_primitives[u8](5 variants)bench_dict_mask(4 variants)gather_u32_mojo[100K]vsgather_u32_avx2[100K]What's included
kernels/take.mojo— 20 SIMD gather kernels (16 take + 4 filter), 4x unrolled, compiled with--mcpu skylake --mtune skylakeforvpgatherqdbuild.rs— AOT compiles.mojo→.o→.a, detects Mojo via PATH +~/.local/bin, passes--target-triplefrom Cargo'sTARGETenv, gracefully falls backmojo.rs— Rust FFI bridge withTakeImpl, dispatches by value byte-widthslice.rs— Mojo SIMD filter for the sparse indices path (<80%selectivity)take_primitive_simdbench — divan 3-way comparison: scalar vs AVX2 vs Mojopip install --user mojo+MOJO_MCPU=skylakefor codspeed shard Add CI #2Key design decisions
Int: Mojo 0.26'sUnsafePointerhas origin/mut params incompatible with@export. Solved withtype_ofanchor pattern.nmshows 0 undefined symbols. No Mojo runtime/GC.--mcpu skylake: Critical forvpgatherqdhardware gather.x86-64-v3scalarizes the gather into 8 individual loads.Mojo compiles for a single target CPU (no runtime dispatch). If the build machine has AVX-512 but the runtime machine only has AVX2, you'd get SIGILL. Currently mitigated by pinning
MOJO_MCPU=skylakein CI. For production use, this needs runtime feature detection or multiple compiled objects — same pattern as the existingmultiversioncrate usage.Test plan
https://claude.ai/code/session_01EVcJZP4ZmfvWRRg2CsgvST