Skip to content

perf[gpu]: coalesce streaming kernels#8357

Merged
0ax1 merged 1 commit into
developfrom
perf/arrow-export-list-view-offsets
Jun 11, 2026
Merged

perf[gpu]: coalesce streaming kernels#8357
0ax1 merged 1 commit into
developfrom
perf/arrow-export-list-view-offsets

Conversation

@0ax1

@0ax1 0ax1 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Convert the streaming list_view kernels (offsets check, rebuild init scan, offsets validation) and decimal_cast from per-thread-contiguous element ranges to block-stride loops so warp accesses stay coalesced. On GH200 the contiguous-offsets check on 10M lists drops from 718us to 80us (~1.4 TB/s, 9x) and the take-based rebuild path improves by 35%. The rebuild gather kernel keeps its per-list layout since its access pattern is data-dependent.

Also enqueue the status and total-bytes device-to-host copies in the Arrow Binary export before awaiting either, so both readbacks complete in one stream round-trip instead of two.

Switch the streaming list_view and decimal_cast kernels from
per-thread element ranges to block-stride loops for coalesced warp
access. On GH200 the 10M-list contiguous-offsets check drops from
718us to 80us (9x, ~1.4 TB/s) and the take-based rebuild improves by
35%. The data-dependent rebuild gather kernel is unchanged.

Also enqueue the Arrow Binary status and total-bytes copies before
awaiting either, merging two stream round-trips into one.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 requested review from a team, onursatici and robert3005 June 11, 2026 13:39
@0ax1 0ax1 added the changelog/performance A performance improvement label Jun 11, 2026
@0ax1

0ax1 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@claude fix build issues commit push then review

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Claude finished @0ax1's task in 4m 58s —— View job


Build status & review

  • Gather context & reproduce build issues
  • Check CI for build failures
  • Review the PR changes

No build issues to fix

I checked CI for this PR. Every build/lint/test job is green, including the GPU ones:

Job Result
CUDA build & lint ✅ success
CUDA tests / memcheck / synccheck / initcheck / cudf ✅ success
Linux musl ✅ success
Codspeed Benchmarking ✅ success
Publish Dry Runs, Vortex Web, Typo, REUSE ✅ success

The only red check is "Validate Changelog Label" → "Extract and Check Labels" (run 27350939774). That is a PR-metadata/labeling gate, not a compile/build failure, and a later PR Labels run (27350977671) already passed once the label was applied — so the failure is stale. It lives under .github/workflows, which I can't modify, and it isn't something a code change would fix.

nvcc isn't available in this environment and cargo is sandbox-restricted here, so I relied on the CUDA CI job (which compiles + lints the kernels) rather than a local build. Nothing was committed/pushed because there was nothing to fix.


Code review

This is a clean, well-scoped perf change. I read all three files in full and the conversion is correct.

list_view.cu / decimal_cast.cu — per-thread → block-stride

The transformation from contiguous per-thread ranges (start_elem/stop_elem) to block-stride loops (idx = block_start + threadIdx.x; idx += blockDim.x) is the same pattern already used by runend.cu, dict.cu, arrow_binary.cu, etc., so it's consistent with the codebase.

The key invariant holds: the launch config does not need to change. Old grid covered ceil(len/EPT) threads → ceil(ceil(len/EPT)/blockDim) blocks; new grid needs ceil(len/(blockDim*EPT)) blocks. These are equal, so the same grid fully covers len. ✅

Each index is still processed exactly once, so the order-sensitive special cases stay correct:

  • list_view.cu:52 idx == 0 && offset != 0 fires on exactly one thread.
  • list_view.cu:56 idx + 1 == list_len writes output[list_len] exactly once — no double write, no race.
  • validate_offsets reads output_offsets[idx + 1] up to output_offsets[list_len], which is in bounds (list_len + 1 entries). ✅
  • rebuild_init_scan still guards idx >= list_len against scan_len. ✅

Dropping the old if (startElem >= array_len) return; guard in decimal_cast is fine — block_stop is clamped to array_len, so out-of-range threads simply don't enter the loop. The data-dependent list_view_rebuild_primitive_device gather kernel is correctly left on its per-list layout, matching the PR description.

start_elem/stop_elem/ELEMENTS_PER_THREAD remain defined in config.cuh and are still used by other kernels, so no dangling references.

canonical.rs — overlapped readbacks

The change enqueues both D2H copies before awaiting, merging two stream round-trips into one:

let status_copy = status.try_to_host()?;
let total_copy = output_offsets.slice_typed::<i32>(len..len + 1).try_to_host()?;
let (status_value, total_value) = join(status_copy, total_copy).await;
  • try_to_host() returns a BoxFuture<'static, …> that owns its data, so the temporary slice_typed(...) handle being dropped at the ; does not create a borrow issue. ✅
  • Using join (not try_join) is the right call here and matches the comment's intent: both copies complete before any ?, so an early error can't drop the sibling copy mid-flight on the stream. ✅
  • Stream ordering guarantees both copies observe the completed exclusive_sum/validate_offsets kernels. ✅

Nits (non-blocking):

  • const uint64_t elements_per_block = blockDim.x * ELEMENTS_PER_THREAD; computes the product in 32-bit before widening. Safe at the current THREADS_PER_BLOCK=64 × EPT=32 = 2048, and it's consistent with the other kernels, so no action needed — just noting it's relying on those constants staying small.

Looks good to merge once the changelog label gate is satisfied. ✅

@codspeed-hq

codspeed-hq Bot commented Jun 11, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
❌ 1 regressed benchmark
✅ 1530 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_varbinview_canonical_into[(1000, 10)] 161.8 µs 198.1 µs -18.31%
Simulation chunked_varbinview_canonical_into[(100, 100)] 308.5 µs 273.5 µs +12.8%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing perf/arrow-export-list-view-offsets (5303c27) with develop (729e17c)

Open in CodSpeed

@0ax1 0ax1 changed the title perf[gpu]: coalesce streaming kernels & overlap binary export readbacks perf[gpu]: coalesce streaming kernels Jun 11, 2026
@0ax1 0ax1 enabled auto-merge (squash) June 11, 2026 13:49
@0ax1 0ax1 disabled auto-merge June 11, 2026 13:49
@0ax1

0ax1 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@claude fix the lint issue

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Claude finished @0ax1's task in 2m 38s —— View job


Investigating the lint failure

  • Identify the failing lint check in CI
  • Inspect the Rust (lint) job log
  • Determine whether code changes are needed

Finding: no code lint issue — the job was killed by a spot-instance interruption

I pulled the log for the red Rust (lint) job (job 80812326879). Every lint step actually passed before the job died:

Step Result
cargo +nightly fmt --all --check ✅ no diff
cargo check --all-features --all-targets (-D warnings) ✅ Finished
cargo check --all-features --all-targets --release ✅ Finished
cargo clippy --all-features --all-targets -- -D warnings ✅ Finished
cargo clippy --all-targets -- -D warnings ✅ Finished

The failure conclusion comes from the runner being reclaimed mid-job, not from a lint diagnostic:

13:43:09  ##[error]The runner has received a shutdown signal. This can happen when the
          runner service is stopped, or a manually started runner is canceled.
13:43:17  ##[error]The operation was canceled.

The runner was a spot instance (m8azn.6xlarge, RUNS_ON_INSTANCE_LIFECYCLE: spot) and AWS reclaimed it right after the last clippy step finished. That surfaces as a failure rather than a clean cancelled, which is why it looks like a lint error.

What this means

There is nothing to fix in the codefmt, cargo check, and both clippy invocations all completed without warnings on this exact commit (5303c27). I made no changes and pushed nothing.

This just needs the job re-run: from the run page, click Re-run failed jobs, or push any trivial update to re-trigger CI. I can't trigger a re-run or modify workflow files myself (GitHub App permissions), so that step is on you.

Note: cargo is sandbox-restricted and nvcc/GPU aren't available in this environment, so I relied on the CI log rather than a local clippy run — but the log already shows every lint step finishing green before the interruption.

@0ax1 0ax1 enabled auto-merge (squash) June 11, 2026 14:01
@0ax1 0ax1 merged commit 0dd6db7 into develop Jun 11, 2026
97 of 101 checks passed
@0ax1 0ax1 deleted the perf/arrow-export-list-view-offsets branch June 11, 2026 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants