perf[gpu]: coalesce streaming kernels by 0ax1 · Pull Request #8357 · vortex-data/vortex

0ax1 · 2026-06-11T13:39:52Z

Convert the streaming list_view kernels (offsets check, rebuild init scan, offsets validation) and decimal_cast from per-thread-contiguous element ranges to block-stride loops so warp accesses stay coalesced. On GH200 the contiguous-offsets check on 10M lists drops from 718us to 80us (~1.4 TB/s, 9x) and the take-based rebuild path improves by 35%. The rebuild gather kernel keeps its per-list layout since its access pattern is data-dependent.

Also enqueue the status and total-bytes device-to-host copies in the Arrow Binary export before awaiting either, so both readbacks complete in one stream round-trip instead of two.

Switch the streaming list_view and decimal_cast kernels from per-thread element ranges to block-stride loops for coalesced warp access. On GH200 the 10M-list contiguous-offsets check drops from 718us to 80us (9x, ~1.4 TB/s) and the take-based rebuild improves by 35%. The data-dependent rebuild gather kernel is unchanged. Also enqueue the Arrow Binary status and total-bytes copies before awaiting either, merging two stream round-trips into one. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 · 2026-06-11T13:43:59Z

@claude fix build issues commit push then review

github-actions · 2026-06-11T13:44:25Z

Claude finished @0ax1's task in 4m 58s —— View job

Build status & review

Gather context & reproduce build issues
Check CI for build failures
Review the PR changes

No build issues to fix

I checked CI for this PR. Every build/lint/test job is green, including the GPU ones:

Job	Result
CUDA build & lint	✅ success
CUDA tests / memcheck / synccheck / initcheck / cudf	✅ success
Linux musl	✅ success
Codspeed Benchmarking	✅ success
Publish Dry Runs, Vortex Web, Typo, REUSE	✅ success

The only red check is "Validate Changelog Label" → "Extract and Check Labels" (run 27350939774). That is a PR-metadata/labeling gate, not a compile/build failure, and a later PR Labels run (27350977671) already passed once the label was applied — so the failure is stale. It lives under .github/workflows, which I can't modify, and it isn't something a code change would fix.

nvcc isn't available in this environment and cargo is sandbox-restricted here, so I relied on the CUDA CI job (which compiles + lints the kernels) rather than a local build. Nothing was committed/pushed because there was nothing to fix.

Code review

This is a clean, well-scoped perf change. I read all three files in full and the conversion is correct.

list_view.cu / decimal_cast.cu — per-thread → block-stride

The transformation from contiguous per-thread ranges (start_elem/stop_elem) to block-stride loops (idx = block_start + threadIdx.x; idx += blockDim.x) is the same pattern already used by runend.cu, dict.cu, arrow_binary.cu, etc., so it's consistent with the codebase.

The key invariant holds: the launch config does not need to change. Old grid covered ceil(len/EPT) threads → ceil(ceil(len/EPT)/blockDim) blocks; new grid needs ceil(len/(blockDim*EPT)) blocks. These are equal, so the same grid fully covers len. ✅

Each index is still processed exactly once, so the order-sensitive special cases stay correct:

list_view.cu:52 idx == 0 && offset != 0 fires on exactly one thread.
list_view.cu:56 idx + 1 == list_len writes output[list_len] exactly once — no double write, no race.
validate_offsets reads output_offsets[idx + 1] up to output_offsets[list_len], which is in bounds (list_len + 1 entries). ✅
rebuild_init_scan still guards idx >= list_len against scan_len. ✅

Dropping the old if (startElem >= array_len) return; guard in decimal_cast is fine — block_stop is clamped to array_len, so out-of-range threads simply don't enter the loop. The data-dependent list_view_rebuild_primitive_device gather kernel is correctly left on its per-list layout, matching the PR description.

start_elem/stop_elem/ELEMENTS_PER_THREAD remain defined in config.cuh and are still used by other kernels, so no dangling references.

canonical.rs — overlapped readbacks

The change enqueues both D2H copies before awaiting, merging two stream round-trips into one:

let status_copy = status.try_to_host()?;
let total_copy = output_offsets.slice_typed::<i32>(len..len + 1).try_to_host()?;
let (status_value, total_value) = join(status_copy, total_copy).await;

try_to_host() returns a BoxFuture<'static, …> that owns its data, so the temporary slice_typed(...) handle being dropped at the ; does not create a borrow issue. ✅
Using join (not try_join) is the right call here and matches the comment's intent: both copies complete before any ?, so an early error can't drop the sibling copy mid-flight on the stream. ✅
Stream ordering guarantees both copies observe the completed exclusive_sum/validate_offsets kernels. ✅

Nits (non-blocking):

const uint64_t elements_per_block = blockDim.x * ELEMENTS_PER_THREAD; computes the product in 32-bit before widening. Safe at the current THREADS_PER_BLOCK=64 × EPT=32 = 2048, and it's consistent with the other kernels, so no action needed — just noting it's relying on those constants staying small.

Looks good to merge once the changelog label gate is satisfied. ✅

codspeed-hq · 2026-06-11T13:47:00Z

Merging this PR will not alter performance

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
❌ 1 regressed benchmark
✅ 1530 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`chunked_varbinview_canonical_into[(1000, 10)]`	161.8 µs	198.1 µs	-18.31%
⚡	Simulation	`chunked_varbinview_canonical_into[(100, 100)]`	308.5 µs	273.5 µs	+12.8%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing perf/arrow-export-list-view-offsets (5303c27) with develop (729e17c)}

0ax1 · 2026-06-11T13:51:16Z

@claude fix the lint issue

github-actions · 2026-06-11T13:51:40Z

Claude finished @0ax1's task in 2m 38s —— View job

Investigating the lint failure

Identify the failing lint check in CI
Inspect the Rust (lint) job log
Determine whether code changes are needed

Finding: no code lint issue — the job was killed by a spot-instance interruption

I pulled the log for the red Rust (lint) job (job 80812326879). Every lint step actually passed before the job died:

Step	Result
`cargo +nightly fmt --all --check`	✅ no diff
`cargo check --all-features --all-targets` (`-D warnings`)	✅ Finished
`cargo check --all-features --all-targets --release`	✅ Finished
`cargo clippy --all-features --all-targets -- -D warnings`	✅ Finished
`cargo clippy --all-targets -- -D warnings`	✅ Finished

The failure conclusion comes from the runner being reclaimed mid-job, not from a lint diagnostic:

13:43:09  ##[error]The runner has received a shutdown signal. This can happen when the
          runner service is stopped, or a manually started runner is canceled.
13:43:17  ##[error]The operation was canceled.

The runner was a spot instance (m8azn.6xlarge, RUNS_ON_INSTANCE_LIFECYCLE: spot) and AWS reclaimed it right after the last clippy step finished. That surfaces as a failure rather than a clean cancelled, which is why it looks like a lint error.

What this means

There is nothing to fix in the code — fmt, cargo check, and both clippy invocations all completed without warnings on this exact commit (5303c27). I made no changes and pushed nothing.

This just needs the job re-run: from the run page, click Re-run failed jobs, or push any trivial update to re-trigger CI. I can't trigger a re-run or modify workflow files myself (GitHub App permissions), so that step is on you.

Note: cargo is sandbox-restricted and nvcc/GPU aren't available in this environment, so I relied on the CI log rather than a local clippy run — but the log already shows every lint step finishing green before the interruption.

0ax1 requested review from a team, onursatici and robert3005 June 11, 2026 13:39

0ax1 added the changelog/performance A performance improvement label Jun 11, 2026

robert3005 approved these changes Jun 11, 2026

View reviewed changes

0ax1 changed the title ~~perf[gpu]: coalesce streaming kernels & overlap binary export readbacks~~ perf[gpu]: coalesce streaming kernels Jun 11, 2026

0ax1 enabled auto-merge (squash) June 11, 2026 13:49

0ax1 disabled auto-merge June 11, 2026 13:49

0ax1 enabled auto-merge (squash) June 11, 2026 14:01

0ax1 merged commit 0dd6db7 into develop Jun 11, 2026
97 of 101 checks passed

0ax1 deleted the perf/arrow-export-list-view-offsets branch June 11, 2026 14:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf[gpu]: coalesce streaming kernels#8357

perf[gpu]: coalesce streaming kernels#8357
0ax1 merged 1 commit into
developfrom
perf/arrow-export-list-view-offsets

0ax1 commented Jun 11, 2026

Uh oh!

0ax1 commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Jun 11, 2026

Uh oh!

0ax1 commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

0ax1 commented Jun 11, 2026

Uh oh!

0ax1 commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Build status & review

No build issues to fix

Code review

Uh oh!

codspeed-hq Bot commented Jun 11, 2026

Merging this PR will not alter performance

Performance Changes

Uh oh!

0ax1 commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Investigating the lint failure

Finding: no code lint issue — the job was killed by a spot-instance interruption

What this means

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 11, 2026 •

edited

Loading

github-actions Bot commented Jun 11, 2026 •

edited

Loading