perf[gpu]: coalesce streaming kernels#8357
Conversation
Switch the streaming list_view and decimal_cast kernels from per-thread element ranges to block-stride loops for coalesced warp access. On GH200 the 10M-list contiguous-offsets check drops from 718us to 80us (9x, ~1.4 TB/s) and the take-based rebuild improves by 35%. The data-dependent rebuild gather kernel is unchanged. Also enqueue the Arrow Binary status and total-bytes copies before awaiting either, merging two stream round-trips into one. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
|
@claude fix build issues commit push then review |
|
Claude finished @0ax1's task in 4m 58s —— View job Build status & review
No build issues to fixI checked CI for this PR. Every build/lint/test job is green, including the GPU ones:
The only red check is "Validate Changelog Label" → "Extract and Check Labels" (run nvcc isn't available in this environment and Code reviewThis is a clean, well-scoped perf change. I read all three files in full and the conversion is correct.
The transformation from contiguous per-thread ranges ( The key invariant holds: the launch config does not need to change. Old grid covered Each index is still processed exactly once, so the order-sensitive special cases stay correct:
Dropping the old
The change enqueues both D2H copies before awaiting, merging two stream round-trips into one: let status_copy = status.try_to_host()?;
let total_copy = output_offsets.slice_typed::<i32>(len..len + 1).try_to_host()?;
let (status_value, total_value) = join(status_copy, total_copy).await;
Nits (non-blocking):
Looks good to merge once the changelog label gate is satisfied. ✅ |
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | chunked_varbinview_canonical_into[(1000, 10)] |
161.8 µs | 198.1 µs | -18.31% |
| ⚡ | Simulation | chunked_varbinview_canonical_into[(100, 100)] |
308.5 µs | 273.5 µs | +12.8% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing perf/arrow-export-list-view-offsets (5303c27) with develop (729e17c)
|
@claude fix the lint issue |
|
Claude finished @0ax1's task in 2m 38s —— View job Investigating the lint failure
Finding: no code lint issue — the job was killed by a spot-instance interruptionI pulled the log for the red
The failure conclusion comes from the runner being reclaimed mid-job, not from a lint diagnostic: The runner was a spot instance ( What this meansThere is nothing to fix in the code — This just needs the job re-run: from the run page, click Re-run failed jobs, or push any trivial update to re-trigger CI. I can't trigger a re-run or modify workflow files myself (GitHub App permissions), so that step is on you.
|
Convert the streaming list_view kernels (offsets check, rebuild init scan, offsets validation) and decimal_cast from per-thread-contiguous element ranges to block-stride loops so warp accesses stay coalesced. On GH200 the contiguous-offsets check on 10M lists drops from 718us to 80us (~1.4 TB/s, 9x) and the take-based rebuild path improves by 35%. The rebuild gather kernel keeps its per-list layout since its access pattern is data-dependent.
Also enqueue the status and total-bytes device-to-host copies in the Arrow Binary export before awaiting either, so both readbacks complete in one stream round-trip instead of two.