[GPU] New QIPC ops for subgroups by hughperkins · Pull Request #676 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-05-10T09:11:52Z

Summary

Adds a single, consistent set of new SIMT-subgroup query / inter-processor-communication ops to qd.simt.subgroup, all working portably across CUDA, AMDGPU, and SPIR-V (Vulkan / Metal):

Reductions (lane 0 / broadcast) — reduce_min, reduce_max (lane 0 of each 2**log2_size group), reduce_all_min, reduce_all_max (broadcast to every lane).
- Built on the existing shuffle_down tree / shuffle_xor butterfly patterns; same shape as reduce_add / reduce_all_add.
- NaN handling for floats is implementation-defined and is documented.
Ballot — subgroup.ballot(predicate) returns a u32 bitmask (bit i set iff lane i's predicate is non-zero).
- Single hardware instruction on every backend: __ballot_sync (CUDA), v_ballot_b32 (AMDGPU), OpGroupNonUniformBallot (SPIR-V).
- Cherry-picked from [Perf] Add cross-GPU subgroup.ballot(predicate) primitive #600 (which is now superseded by this PR).
Segmented reduce — subgroup.segmented_reduce_add / segmented_reduce_min / segmented_reduce_max (value, head_flag, log2_size) run a per-lane inclusive scan that resets at every non-zero head_flag, scoped to 2**log2_size consecutive lanes.
- One ballot to materialise the head bitmask, one clz to find each lane's segment head, then a Hillis-Steele inclusive scan bounded by distance >= offset. Cost: 1 ballot + 1 clz + log2_size shuffles + log2_size ops.
- No identity argument is required (unlike exclusive_min / exclusive_max): the per-lane distance >= offset guard ensures the scan never crosses a segment boundary, so a partner from another segment is never combined with the local value.
- The group's first lane is treated as an implicit head, so a segment that starts mid-group is still reduced correctly.
Lane masks — subgroup.lanemask_lt(lane_id) / _le / _eq / _gt / _ge: closed-form u32 masks parametrised by a lane id, mirroring CUDA's __lanemask_{lt,le,eq,gt,ge} but generalised to take any lane_id (pass invocation_id() for the CUDA built-in form).
- Pure @qd.func arithmetic — no backend intrinsic, no shuffle, no ballot — so per-lane-varying lane_id works the same as a uniform one.
- Caller contract: lane_id in [0, 31]; on AMDGPU CDNA wave64 the mask covers only the low 32 lanes (build a 64-bit mask from two u32 ballots if needed).

Drive-by fixes (required by `segmented_reduce_*`, but useful in their own right)

qd.clz is the first user of clz in the codebase, and exposed bugs on every backend:

CUDA — __nv_clz / __nv_clzll are declared on signed types but operate on the underlying bit pattern; route u32 / u64 through them so qd.clz(u32(...)) no longer hits QD_NOT_IMPLEMENTED.
AMDGPU — emit_extra_unary had no clz case; map to LLVM's Intrinsic::ctlz with is_zero_undef=0.
SPIR-V — split GLSL.std.450 FindSMsb ([Build] Add gc before each unit test, to prevent ndarray issues #74, signed) and FindUMsb ([Build] Reduce concurrency #75, unsigned). The unsigned form is required for u32 / u64 inputs whose top bit may be set; FindSMsb is undefined for those (treats them as negative; "most-significant 0-bit" doesn't exist for 0xFFFFFFFF). Cast the result back to i32 before the 32 - msb - 1 subtraction so SPIR-V's strict-type sub is happy.

Stacking

This PR is stacked on top of #665 (hp/cross-gpu-subgroup).

It supersedes #600 (hp/cross-gpu-ballot), whose three commits are cherry-picked here unchanged. #600 can be closed once this lands (or once #665 lands, whichever is convenient).

Test plan

Local pre-commit (black, clang-format, ruff, pylint) passes.
pyright (project config) clean for new code; pre-existing errors in untouched files unchanged.
CUDA (cluster, RTX PRO 6000 Blackwell): all 103 new lanemask + segmented_reduce_min/max tests pass; full test_simt.py (586 passed, 1 skipped) green.
Vulkan (same cluster node): all 103 new tests pass; full test_simt.py (567 passed, 20 skipped) green.
AMDGPU (amddesktop, Radeon RX 7900 XTX, gfx1100, ROCm): all 103 new tests pass; full test_subgroup_* (566 passed, 21 deselected) green.
Comment / docstring widths audited against the project's 120c target via find_underwrapped.py.

Made with Cursor

Aligns the subgroup scope with `block.sync()` and the planned `block.mem_fence()` / `grid.mem_fence()` naming. The old names remain as thin aliases that forward to the new ones and emit a DeprecationWarning on first use (per-alias one-shot guard, plus the existing `warnings.filterwarnings("once", DeprecationWarning, ...)` in `quadrants.lang.misc`). Updates `docs/source/user_guide/subgroup.md` to describe the renames as done (with deprecation aliases) rather than planned.

Brings the four previously partial / TODO data-movement ops up to full CUDA + AMDGPU + SPIR-V coverage: * shuffle_up: add CUDA + AMDGPU lowerings. - CUDA: new `cuda_shuffle_up_{i32,f32,i64,f64}` runtime helpers in runtime_module/runtime.cpp (mirroring `cuda_shuffle_down_*`), built on the already-patched `cuda_shfl_up_sync_{i32,f32}` NVVM intrinsics. Codegen branch + `emit_cuda_shuffle_up` in codegen/cuda/codegen_cuda.cpp. - AMDGPU: new `amdgpu_shuffle_up_{i32,f32,i64,f64}` runtime helpers using the existing `ds_bpermute` path (same FIXME re: DPP fast-path as `shuffle_down`). Codegen branch + `emit_amdgpu_shuffle_up`. * shuffle_xor and broadcast_first: replace TODO `pass` stubs with portable `@qd.func` wrappers that inline into the calling kernel: - `shuffle_xor(value, mask)` ≡ `shuffle(value, u32(lane) ^ mask)` - `broadcast_first(value)` ≡ `broadcast(value, u32(0))` No backend codegen / runtime changes required: every backend that lowers `shuffle` / `broadcast` now lowers these too. Tests: * test_subgroup_shuffle_up (mirror of test_subgroup_shuffle_down) * test_subgroup_shuffle_xor (uses the new wrapper directly; the existing `_pattern` test continues to verify the manual emulation) * test_subgroup_broadcast_first Doc: refresh `docs/source/user_guide/subgroup.md` data-movement support matrix + per-op semantics + performance notes to reflect universal coverage. Drop the now-stale "fail to link on CUDA / AMDGPU" paragraph from the `shuffle_up` section.

Adds the missing test coverage for the rename half of this PR: * test_subgroup_sync (vulkan): smoke that subgroup.sync() — the renamed subgroup.barrier() — traces and runs. * test_subgroup_mem_fence (vulkan): same for subgroup.mem_fence(). * test_subgroup_barrier_deprecation_warn_once: pure-Python unit test asserting subgroup.barrier() emits exactly one DeprecationWarning across multiple calls and forwards to sync(); monkeypatches sync to a no-op so no kernel context is required and the test runs on every arch. * test_subgroup_memory_barrier_deprecation_warn_once: mirror for subgroup.memory_barrier() / subgroup.mem_fence().

… + SPIR-V The data-movement ops in qd.simt.subgroup require uniform control flow with all lanes active (already documented in subgroup.md). Under that contract subgroups (warps / waves) execute in lockstep on CUDA and AMDGPU, so an intra-subgroup control barrier or memory fence is a no-op on those backends. The SPIR-V backend keeps the real OpControlBarrier / OpMemoryBarrier emission because Vulkan / Metal subgroups can diverge. Lower subgroupBarrier / subgroupMemoryBarrier to a placeholder i32 0 (matching the SPIR-V codegen's return convention) on the CUDA and AMDGPU codegen, so calling subgroup.sync() / subgroup.mem_fence() from a kernel succeeds on every GPU backend. The smoke tests for sync()/mem_fence() are now arch=qd.gpu rather than arch=qd.vulkan and confirm tracing + running on each backend. Doc: matrix updated to yes/yes/yes (with a footnote explaining the no-op-on-CUDA/AMDGPU semantics) and the per-op section rewritten to describe the universal lowering.

…+ AMDGPU + SPIR-V" This reverts commit 233b08c. The "no-op on CUDA / AMDGPU" lowering conflated control-flow lockstep with memory ordering. The two are not equivalent: * `sync()` (control barrier) under our uniform-CF + all-lanes-active contract really is a no-op on CUDA / AMDGPU, because warps / waves are already at the same program point. That part was defensible. * `mem_fence()` (memory fence) is NOT a no-op. Lockstep execution does not order memory operations: the compiler may reorder loads / stores across the call, and the SM may buffer writes. A correct CUDA lowering would need at minimum an LLVM `fence` intrinsic with the appropriate scope (or `__threadfence_block()` as an over-strict fallback). That was not done. Rather than ship a half-correct lowering, restore the previous status: both ops remain SPIR-V only, the doc keeps its original "warps are lockstep, these are typically unnecessary; use __syncwarp under divergent control flow" guidance, and the smoke tests stay on arch=qd.vulkan. Implementing real CUDA / AMDGPU lowerings can be a separate, properly-thought-through change.

…GPU + SPIR-V Replaces the earlier (reverted) attempt that lowered these to no-ops on CUDA / AMDGPU "because warps are lockstep", which was wrong about what the user contract guarantees: sync() must reconverge lanes that have been split by independent thread scheduling (Volta+) and mem_fence() must actually order memory. This change wires real backend primitives into the lowering and fixes a long-standing SPIR-V mem_fence() bug. Per-backend lowerings --------------------- sync() (subgroupBarrier): * SPIR-V : already correct - OpControlBarrier(Subgroup, Subgroup, 0). * CUDA : warp_barrier(0xFFFFFFFF), reusing the existing runtime helper that is patched to llvm.nvvm.bar.warp.sync (i.e. __syncwarp). This is the precise warp-scope reconvergence primitive Volta+ needs and is a no-op under uniform CF on Pascal. * AMDGPU : llvm.amdgcn.wave.barrier - LLVM's wave-scope sync primitive. Acts as a compiler reordering barrier on GCN (lockstep) and emits a real wave barrier on RDNA where waves can span multiple SIMDs. mem_fence() (subgroupMemoryBarrier): * SPIR-V : was emitting OpMemoryBarrier(Subgroup, 0). The Memory Semantics operand must have an ordering bit AND at least one storage class, so 0 is invalid; drivers that accept it treat the instruction as a no-op. Now emits AcquireRelease | UniformMemory | WorkgroupMemory, matching what workgroupMemoryBarrier does (just at Subgroup scope). * CUDA : block_memfence(), patched to llvm.nvvm.membar.cta (__threadfence_block). Workgroup-scope, hence over-strict for the subgroup-scope ask but correct - a CTA-scope fence orders memory across the whole CTA, of which the subgroup is a strict subset. * AMDGPU : LLVM 'fence syncscope("workgroup") seq_cst' - lowers to the appropriate s_waitcnt / cache-flush sequence. Same workgroup-scope over-strictness note. Tests ----- test_subgroup_sync and test_subgroup_mem_fence flip from arch=qd.vulkan to arch=qd.gpu and now run on every GPU backend. They are smoke tests: they verify the kernel traces, codegens, and runs without error. We do not attempt to construct a producer/consumer race that only the fence makes legal - that kind of test is hard to write portably and easy to make flaky. Doc updates ----------- The Identification-and-control table now shows yes for sync() / mem_fence() across all backends, with a footnote on mem_fence() pointing out the workgroup-scope over-strictness on CUDA / AMDGPU. The semantics section spells out the per-backend lowering and the uniform-CF caller contract.

…s CUDA + AMDGPU + SPIR-V Closes the last two `no` cells in the Identification-and-control matrix in subgroup.md. Both ops now lower correctly on every GPU backend. group_size() ------------ * CUDA: returns the static constant 32 (warp size on every supported NVIDIA arch). * AMDGPU: emits llvm.amdgcn.wavefrontsize; the AMDGPU backend folds it to 32 or 64 based on the function's +wavefrontsize32/+wavefrontsize64 target feature. * SPIR-V: unchanged - was already querying OpSubgroupSize. elect() ------- Reimplemented as a @qd.func wrapper: @func def elect(): return i32(invocation_id() == 0) Inlines at trace time into compare + zext on every backend. Replaces the SPIR-V-only OpGroupNonUniformElect path with a portable definition. Semantic change worth flagging ------------------------------ OpGroupNonUniformElect is allowed to elect any *active* lane and may pick a different lane on different invocations. The new wrapper deterministically elects lane 0. Under qd.simt.subgroup's documented uniform-CF + all-lanes-active contract this is strictly compatible (lane 0 is always active and is a legal SPIR-V choice), and it makes the behaviour identical across backends. Grepped the codebase before changing - no internal caller depends on the broader OpGroupNonUniformElect semantics. Tests ----- * test_subgroup_group_size: every lane writes group_size() into a buffer; the result must be uniform across lanes and in {32, 64}. * test_subgroup_elect: writes elect(), invocation_id(), and group_size() into per-lane slots, then asserts (a) elect() is in {0, 1}, (b) elected lanes are exactly the invocation_id == 0 lanes, and (c) the elected count equals N / group_size. Both parametrized over arch=qd.gpu so they run on every available GPU backend. Doc --- subgroup.md matrix flips both rows to yes-on-all. Semantics sections describe each backend lowering and call out the elect() lane-0-pinning narrowing of SPIR-V.

… + AMDGPU + SPIR-V Replaces the SPIR-V-only `subgroup.inclusive_add(v)` with a portable sized variant implemented as a `@qd.func` Hillis-Steele scan over `shuffle_up`. This is the first slice of the planned migration of the inclusive_* / exclusive_* ops to a universal sized API; the other 6 inclusive_* ops still take `(value)` and lower via OpGroupNonUniformInclusiveScan on SPIR-V only. Implementation -------------- @func def inclusive_add(value, log2_size: template()): lane_in_group = invocation_id() & ((1 << log2_size) - 1) for i in static(range(log2_size)): offset = static(1 << i) partner = shuffle_up(value, u32(offset)) if lane_in_group >= offset: value = value + partner return value * `shuffle_up` is in uniform CF (every lane participates) - matches its documented contract on every backend. * The `if lane_in_group >= offset` is per-lane arithmetic - no subgroup op inside the conditional. * Cross-group `shuffle_up` partners are masked off by the lane_in_group guard, so groups smaller than the full subgroup compose correctly when log2_size < log2(group_size). Backend cleanup --------------- * Dropped `subgroupInclusiveAdd` from the SPIR-V codegen `inclusive_scan_ops` set in `quadrants/codegen/spirv/spirv_codegen.cpp` - that path is now unreachable for `inclusive_add`. The other 6 inclusive ops still go through that branch. * Dropped `PER_INTERNAL_OP(subgroupInclusiveAdd)` from internal_ops.inc.h and `POLY_OP(subgroupInclusiveAdd, ...)` from type_system.cpp. No SPIR-V fast path left to keep alive. Internal caller fix ------------------- `quadrants.algorithms.PrefixSumExecutor` was passing `subgroup.inclusive_add` as a template-callable to `scan_add_inclusive`, which invokes it as `inclusive_add(val)` with one argument. After the API change this would TypeError. Added a single-arg adapter `subgroup_inclusive_add_warp_i32` next to `warp_shfl_up_i32` in `_kernels.py` that calls `subgroup.inclusive_add(val, 5)` (log2_size=5 -> 32-lane warp/wave scan, matching WARP_SZ in the kernel), and routed the Vulkan branch to the adapter. The CUDA branch still uses `warp_shfl_up_i32` for now. Tests ----- `test_subgroup_inclusive_add` (arch=qd.gpu, parametrized over `log2_size in 1..5` and `dtype in {i32, i64, u64, f32, f64}`): runs the scan and verifies each lane's result against a Python running sum. Doc --- * Matrix flips `inclusive_add` row to yes-on-all (with the same `*` AMDGPU perf-asterisk as `reduce_add`). * Top-of-section text and "Performance notes" updated to reflect that `inclusive_add` now has a portable sized form, while the other inclusive_* ops are still mid-migration. * The "Inclusive scan on SPIR-V" example now uses `inclusive_add(v, 5)` and works on every GPU backend.

… AMDGPU + SPIR-V Slice 2 of the inclusive_* / exclusive_* migration: extends the same portable @qd.func Hillis-Steele pattern from `inclusive_add` (slice 1) to the other six inclusive ops, sharing a single `_inclusive_scan` helper. Implementation -------------- @func def _inclusive_scan(value, op: template(), log2_size: template()): lane_in_group = invocation_id() & ((1 << log2_size) - 1) for i in static(range(log2_size)): offset = static(1 << i) partner = shuffle_up(value, u32(offset)) if lane_in_group >= offset: value = op(value, partner) return value @func def inclusive_add(v, log2_size): return _inclusive_scan(v, _bin_add, log2_size) @func def inclusive_mul(v, log2_size): return _inclusive_scan(v, _bin_mul, log2_size) ... (min / max / and / or / xor follow the same one-line pattern) The seven `_bin_*` are tiny @func wrappers around `+`, `*`, `min(a,b)`, `max(a,b)`, `a & b`, `a | b`, `a ^ b`. Each is passed as a template-callable to `_inclusive_scan` and gets inlined at trace time, so the public API has the same cost as the slice 1 inline scan: log2_size shuffle+op pairs, no runtime indirection. This refactors the existing `inclusive_add` (which lived inline in slice 1) onto the shared helper at the same time, so all seven scans live in one place. The externally-observable behaviour of `inclusive_add` is unchanged. Backend cleanup --------------- * Removed the entire `inclusive_scan_ops` / `OpGroupNonUniformInclusiveScan` branch from `quadrants/codegen/spirv/spirv_codegen.cpp` - all seven ops now go through the portable Python path on every backend, including SPIR-V. * Removed the six remaining `subgroupInclusive{Mul,Min,Max,And,Or,Xor}` entries from `internal_ops.inc.h` and `type_system.cpp`. Tests ----- * Added `test_subgroup_inclusive_{mul,min,max,and,or,xor}` (arch=qd.gpu), each parametrized over `log2_size in 1..5` and a per-op dtype list: - `_mul`: i32, f32, f64 (inputs clamped to [1, 4] so 32-way product fits i32). - `_min` / `_max`: i32, f32, f64 (varied non-monotonic inputs). - `_and` / `_or` / `_xor`: i32, i64, u64 (bit-varied inputs). * Refactored the existing `test_subgroup_inclusive_add` to share a small `_check_inclusive_scan` helper with the new tests; the dtype matrix is unchanged (i32, i64, u64, f32, f64). Doc --- * Matrix flips all six remaining `inclusive_*` rows to yes-on-all (with `*` for AMDGPU - same ds_bpermute perf note as `inclusive_add`). * Section header collapses the seven ops into a single block: same shape, only the operator differs. * Performance notes call out that `OpGroupNonUniformInclusiveScan` is no longer used on SPIR-V even though it was supported - the trade-off is uniform cost across backends. The `exclusive_*` ops are still TODO stubs - that's slice 3.

…s i32 The previous `(i % 4) + 1` pattern produced cycles of 1*2*3*4 = 24 per group of 4; over 28 lanes that's 24^7 ≈ 4.6e9, which overflows i32 (and was the only failure in the cuda-side slice 2 run). Replace with `2 if i % 4 == 0 else 1`: max 8 twos in 32 lanes → product ≤ 2**8 == 256, well within i32 and exact in f32.

Slice 3 (final) of the inclusive_* / exclusive_* migration: replaces the seven TODO-stub `exclusive_*` functions with portable @qd.func implementations layered on top of the inclusive scans from slice 2. Implementation -------------- @func def _exclusive_scan(value, op: template(), identity, log2_size: template()): inc = _inclusive_scan(value, op, log2_size) shifted = shuffle_up(inc, u32(1)) lane_in_group = invocation_id() & ((1 << log2_size) - 1) result = shifted if lane_in_group == 0: result = identity return result The lane-0 substitution is required: `shuffle_up` with offset 1 is implementation-defined at lane 0 (and `OpGroupNonUniformShuffleUp` calls it undefined outright), so we cannot rely on whatever the hardware happens to produce there. Identity per op is supplied as a runtime expression in `value`'s dtype, derived from `value` itself so the wrapper does not need to inspect the dtype: add: value - value (zero) mul: value - value + 1 (one - the literal +1 takes value's dtype) or: value ^ value (zero) xor: value ^ value (zero) and: ~(value ^ value) (all bits set) For `min` and `max` there is no portable type-extreme that can be derived from `value` alone, so those two ops take an explicit `identity` argument: exclusive_min(v, log2_size, identity) # pass +inf or dtype max exclusive_max(v, log2_size, identity) # pass -inf or dtype min Cost per call: one inclusive scan (`log2_size` shuffle+op pairs) plus one extra `shuffle_up` and a per-lane select. Tests ----- * Added `test_subgroup_exclusive_{add,mul,min,max,and,or,xor}` (arch=qd.gpu), each parametrized over `log2_size in 1..5` and a per-op dtype list: - `_add`: i32, i64, u64, f32, f64 - `_mul`: i32, f32, f64 (inputs bounded so 32-way product fits i32) - `_min` / `_max`: i32, f32, f64 (caller passes explicit identity) - `_and` / `_or` / `_xor`: i32, i64, u64 * Shared `_check_exclusive_scan` helper drives the kernel launch, dtype skip, and per-lane verification: lane 0 must equal the supplied identity, lane k>0 must equal the op-reduce of `src[0..k]`. Doc --- * Matrix gains all seven `exclusive_*` rows, all yes-on-all (with `*` for AMDGPU same as inclusive_*). * New section describes the shared shuffle_up + select pattern, the per-op identity expressions, and why min/max take explicit identities. * The old "exclusive_*, all_true, any_true, all_equal" TODO-stub section is trimmed down to just the three remaining stubs.

… scans Both `_check_inclusive_scan` and `_check_exclusive_scan` previously verified only the first group's worth of lanes (lanes 0..group_size-1). Two coverage gaps: 1. For log2_size < 5, multiple independent groups of 2**log2_size lanes share a single 32-lane subgroup. The `lane_in_group >= offset` mask is what isolates them from each other - and that mask was completely untested. A bug there would have silently passed. 2. The 64-lane launch produces two independent 32-lane subgroups (lanes 0-31 and 32-63) running the same scan side by side. Cross-subgroup leakage in the underlying shuffle_up (e.g. an AMDGPU ds_bpermute with the wrong mask) would not have been caught. Both helpers now iterate over every (group, in-group-lane) pair across the full 64-lane launch and verify the expected per-lane value, recomputing the running op-reduce from `src[group_base..]` at each group boundary. Coverage delta: with log2_size=1 the old test verified 2 of 64 lanes; the new test verifies all 64. At log2_size=3, 8 of 64 -> 64 of 64. At log2_size=5, 32 of 64 -> 64 of 64 (still the same group_size, but the second subgroup is now exercised). Validated on the cluster: all 230 scan tests (115 inclusive + 115 exclusive) pass with the extended verification on CUDA and on Vulkan; the slice 1/2/3 implementations were already correct, this just closes the test gap.

…al fix) `exclusive_*` scans all fail on the Metal backend (via MoltenVK), with the `got` value at lane 1 of each group being whatever the inclusive scan would produce *if the lane-0 conditional update had been applied unconditionally* (eg. `inc[0] = src[0] op src[0]` instead of `inc[0] = src[0]`). For non-idempotent ops this is visibly wrong; for `and`/`or` it accidentally matches at group 0 because `x op x = x`. Inclusive scans pass because nothing downstream re-reads `inc[0]` across lanes. Root cause is reconvergence in MoltenVK's SPIR-V → MSL lowering of the pattern `if lane_in_group >= offset: value = op(value, partner)` followed by another subgroup op (the next loop iteration's `shuffle_up`, or the `shuffle_up(inc, 1)` inside `_exclusive_scan`): lanes that took the false branch end up reading stale register state from the subsequent shuffle. Fix: replace both conditional updates (`if`-then-assignment) with `qd.select`, which lowers to `OpSelect` and keeps every lane in straight-line code. `op(value, partner)` is pure so unconditional evaluation is safe. Adds a comment explaining the choice. Validated: - CUDA simt scans: 280/280 pass - Vulkan simt scans: 280/280 pass - CUDA scan+sort: 65/65 pass - Vulkan scan+sort: 65/65 pass

Replaces the long-standing TODO stubs with portable @qd.func implementations plus a CUDA fast path at full-warp size. API: - `subgroup.all_true(predicate, log2_size)` -- AND-reduce `predicate != 0` across each `2**log2_size` group, returns `i32(0|1)` broadcast to every lane. - `subgroup.any_true(predicate, log2_size)` -- OR-reduce, same shape. - `subgroup.all_equal(value, log2_size)` -- broadcast group-lane-0's value, AND-reduce per-lane equality bit. Equality is the backend's native `==` (NaN != NaN, +0.0 == -0.0), matching SPIR-V `OpGroupNonUniformAllEqual`. CUDA shortcut: at trace time, `qd.static()` on `current_cfg().arch` plus the compile-time `log2_size` selects `cuda_all_sync_i32` / `cuda_any_sync_i32` when `log2_size == 5`, so full-warp uses lower to a single `vote.all` / `vote.any` instruction with no branch in the IR. `all_equal` inherits the shortcut transitively via `all_true`. We deliberately do not wire `__match_all_sync` because it requires sm_70+ and uses bit-equality on floats, contradicting the documented `OpGroupNonUniformAllEqual` semantics. Every other backend (Vulkan, Metal, AMDGPU), and CUDA at `log2_size < 5`, falls back to a portable `shuffle_xor` butterfly: `log2_size` shuffles plus `log2_size` ANDs / ORs, fully unrolled into the calling kernel's IR (same shape as `reduce_all_add`). No C++ codegen changes. Tests cover all-true / all-false / one-odd-lane-in-one-group / sparse-pattern scenarios for `all_true` and `any_true`, and all-same / all-distinct / same-per-group / one-outlier-per-group for `all_equal`. Each scenario verifies every group across the full 64-lane launch (so the launch spans two CUDA / Metal / RDNA subgroups, exercising both partial-subgroup multi-group and cross-subgroup behaviour). Validated: - CUDA simt: 369/370 (+ 1 expected skip) - Vulkan simt: 350/370 (+20 expected MoltenVK skips) - CUDA scan+sort: 65/65 - Vulkan scan+sort: 65/65 Doc: `docs/source/user_guide/subgroup.md` updated -- support matrix, dedicated section per op, and CUDA-shortcut rationale.

The previous commit replaced `if` with `qd.select` in the scan helpers, but `OpSelect` on MoltenVK/Metal silently returns the false-branch value when an operand is an f32 produced by a shuffle intrinsic. Revert `_inclusive_scan` back to `if`, which works correctly on its own. For `_exclusive_scan`, restructure to shift the input before the inclusive scan (shuffle_up → fill lane 0 with identity → inclusive scan) instead of running the inclusive scan then shuffling the result. The old pattern triggered a separate Metal SPIR-V misoptimization where the register holding the inclusive result was clobbered when only consumed by a shuffle intrinsic. Co-authored-by: Cursor <cursoragent@cursor.com>

Two coverage gaps surfaced during a post-merge audit: * `all_true` / `any_true` were only ever exercised with predicate values 0 or 1, so the `i32(predicate != 0)` cast was untested. Adds a `nonbinary-mixed` scenario (`[((i*17) % 13) - 6 for i in range(N)]` -- mixes 0, positives, and negatives) to both tests. * `all_equal` on floats was documented as "NaN != NaN, +0.0 == -0.0" (matching `OpGroupNonUniformAllEqual`) but no test pinned the contract down. Adds `test_subgroup_all_equal_float_contract` (f32 + f64 x log2_size 1..5) covering: ±0 mixed in every group -> 1; NaN at every group start -> 0; NaN at a single lane -> only that group is 0; all NaN -> every group 0. These also lock the door against a future refactor swapping in `__match_all_sync` on CUDA (which would silently regress to bit-equality on floats). Validated: 45/45 voting tests on CUDA and Vulkan (was 35/35 + 10 new from the float contract scenarios).

* black auto-reformats in `subgroup.py` and `test_simt.py` (line-length=120 per `.pre-commit-config.yaml`). * clang-format auto-reformats in `codegen_amdgpu.cpp` and `spirv_codegen.cpp`. * Drop unused `from quadrants.lang.simt import subgroup` from `_algorithms.py` (left over after the switch to `subgroup_inclusive_add_warp_i32`); ruff re-sorts the remaining import block. * Extend the file-level pyright comment in `subgroup.py` from `reportInvalidTypeForm=false` to also disable `reportOperatorIssue` so that `p & shuffle_xor(...)` / `p | shuffle_xor(...)` in the new voting ops don't trip pyright on `Expr` operator overloads — same false-positive class the existing suppression already covers. Pre-commit (black, clang-format, ruff, pylint, trailing-whitespace, end-of-file) clean. Pyright is down to 6 pre-existing errors in files this branch does not touch (`_tensor_wrapper.py`, `_func_base.py`, `_metal_interop.py`, all from PR #618 / streams work) — net 0 new errors attributable to this branch.

The voting / scan / data-movement work landed with prose wrapped at the AI-default ~80-95c instead of the project's 120c (per `pre-commit` black config `-l 120`). Reflow the affected runs in: * `python/quadrants/lang/simt/subgroup.py` — module-level voting / inclusive / exclusive backend-strategy comments, plus `elect`, `all_true`, `any_true`, `all_equal`, `broadcast_first`, `_inclusive_scan`, all `inclusive_*` / `exclusive_*` op docstrings, and `_exclusive_scan` / `shuffle_xor`. * `tests/python/test_simt.py` — voting / scan section comments, scan verification rationale, voting predicate-truthy / float-contract notes, `test_subgroup_sync` / `_mem_fence` / `_group_size` / `_elect` / `_barrier_deprecation_warn_once` / `_memory_barrier_deprecation_warn_once` docstrings. * `python/quadrants/_kernels.py` — `subgroup_inclusive_add_warp_i32` adapter docstring. * `python/quadrants/algorithms/_algorithms.py` — comment explaining the warp-i32 adapter usage in `PrefixSumExecutor`. No semantic changes; black / pre-commit / pyright still clean. Audited via `find_underwrapped --diff origin/main`: remaining flagged runs are all at ~110-120c (only minor packing imbalance, max ≤ 123c) — no AI-default 80c under-wrapping in this branch's diff.

The CI wrap-checker flagged three C++ comment blocks in PR #665 still wrapped near ~80c (`runtime.cpp:1033`, `runtime.cpp:1136`, `codegen_amdgpu.cpp:507`). While in there I audited the rest of the new C++ subgroup commentary and the per-op intrinsic notes, and reflowed them to the project's 120c target. Also tightened a couple of Python lines that crept past 120c (one f-string docstring, one explanatory comment in test_simt.py). No semantic changes.

CI wrap-checker on PR #665 flagged three more docstring blocks wrapping at 83-87c instead of 120c (`exclusive_add`, `test_subgroup_sync`, `test_subgroup_mem_fence`). Reflow those. No semantic changes.

Stale carry-over from the days when several ops were one-backend stubs; no longer applies now that everything in the doc is universal.

Stacked on hp/cross-gpu-subgroup; same shape as the existing `reduce_add` / `reduce_all_add` pair: * `reduce_min(v, log2_size)` / `reduce_max(v, log2_size)` — `shuffle_down` tree, result valid in lane 0 of each `2**log2_size` group. * `reduce_all_min(v, log2_size)` / `reduce_all_max(v, log2_size)` — `shuffle_xor` butterfly, result broadcast to every lane. Both forms unroll into exactly `log2_size` shuffle+min (or `+max`) pairs in the calling kernel's IR — no kernel-launch overhead, no separate runtime symbol. Lowers to backend-specific min/max intrinsics (`fminnm` / `fmaxnm` on PTX, `llvm.minnum` / `llvm.maxnum` on AMDGPU, `OpFMin` / `OpFMax` on SPIR-V); float-NaN handling is documented as implementation-defined. Tests: parametrized as `qd.gpu` over `i32` / `i64` / `u64` / `f32` / `f64` and `log2_size` in `[1..5]`, verifying every group across the full 64-lane launch. Doc: new rows in the `subgroup.md` Reductions/scans table; new per-op sections; the "removed" note is updated to drop `reduce_min` / `reduce_max` (now portable).

Implement a portable ballot operation that returns a u32 bitmask where bit i is set if lane i's predicate is non-zero. Works across CUDA (__ballot_sync), AMDGPU (amdgcn_ballot.i32), and SPIR-V/Vulkan (OpGroupNonUniformBallot). Follows the same cross-backend pattern as subgroup.shuffle: a single Python API (subgroup.ballot) dispatches to the appropriate backend intrinsic at codegen time. On AMDGPU CDNA with 64-wide wavefronts only the low 32 bits are returned, consistent with the u32 return type.

Mac OS X build was failing because spirv_codegen.cpp was accessing IRBuilder::t_v4_uint_ directly, which is a private member. Add a public v4_u32_type() accessor following the existing pattern (u32_type(), bool_type(), etc.) and use it from the ballot lowering.

Per-lane inclusive sum scoped to 2**log2_size lanes, where every lane with head_flag != 0 resets the running sum. Useful for stream compaction and sparse / variable-length records. Implementation: one subgroup.ballot(head_flag != 0) to materialise a u32 of head positions, then a Hillis-Steele inclusive sum bounded by `distance >= offset` (distance = lane - segment_head, segment_head from 31 - clz(effective_mask & ((1 << (lane + 1)) - 1)) with a virtual head OR-injected at group_base so lower is always non-zero). Cost: 1 ballot + 1 clz + log2_size shuffles + log2_size adds, fully unrolled. Same shape as inclusive_add with a single-instruction setup. Tests: parametrized over the standard dtypes (i32 / i64 / u64 / f32 / f64) and log2_size in [0..5], plus three contract tests (no head flags -> equivalent to inclusive_add; every lane is a head -> output equals input; truthy non-binary head_flag values). Doc: new row in the Reductions/scans table; new per-op section after reduce_all_min / reduce_all_max.

`qd.clz(u32_value)` was emitting QD_NOT_IMPLEMENTED on CUDA and produced undefined results on SPIR-V (GLSL.std.450 FindSMsb is undefined for the all-bits-set case). The new `subgroup.segmented_reduce_add` is the first user of `clz` in the codebase and exposed both bugs. * CUDA: route u32 / u64 inputs through the same `__nv_clz` / `__nv_clzll` intrinsics used for i32 / i64 — the underlying bit pattern is what matters, the C declaration on signed types is a header-level convention. * SPIR-V: dispatch to FindUMsb (#75) for unsigned inputs and FindSMsb (#74) for signed. The two GLSL.std.450 instructions return a value of the same type as their operand, so add an explicit OpBitcast back to i32 before the `32 - msb - 1` subtraction (otherwise SPIR-V's strict-type `sub` asserts on mixed i32 / u32). * Python: in `segmented_reduce_add`, wrap `clz`'s result in `i32(...)` so the subsequent arithmetic is uniformly signed-32-bit (the trace- time tracer would otherwise propagate u32 from the input through to the subtraction, hitting SPIR-V's same-type assertion). Tests: `subgroup.segmented_reduce_add` tests now pass on CUDA + Vulkan across i32 / i64 / u64 / f32 / f64 and `log2_size` in [0..5], including the all-heads, no-heads, and truthy-predicate edge cases.

# Conflicts: # quadrants/codegen/cuda/codegen_cuda.cpp

github-actions · 2026-05-12T15:14:43Z

Total: 11 file(s) changed, +733 -1 code lines.

github-actions · 2026-05-12T16:09:09Z

Diff coverage: 74% · 654 lines, 172 missing

# Conflicts: # docs/source/user_guide/subgroup.md # python/quadrants/lang/simt/subgroup.py # quadrants/codegen/amdgpu/codegen_amdgpu.cpp # quadrants/codegen/cuda/codegen_cuda.cpp # quadrants/codegen/spirv/spirv_codegen.cpp # tests/python/test_simt.py

`subgroupBallotU32` regressed back to `amdgpu_ballot_i32` (the original PR #676 codegen) after commit 57234b0 restored the codegen lost during a cross-gpu-subgroup merge. That restore picked up the pre-workaround version, accidentally reverting the a5319f6 fix that switched the lowering to `amdgpu_ballot_u64 + trunc to i32`. Symptom on AMD CI (PR #676, gfx942 wave64): RuntimeError: Cannot select: i32 = AMDGPUISD::SETCC ..., setne i32 = zero_extend i1 = CopyFromReg ..., Register:i1 %8 In function: foo_c94_0_kernel_0_range_for 10+ `test_subgroup_ballot_first_n_*` tests crash with the LLVM AMDGPU isel bug that the workaround was written for; ballot.i32 with a non-constant i1 predicate still fails to select on gfx942 in LLVM 20 and 22.1.0 despite PR #71556's i32-on- wave64 documentation. Verbatim re-application of a5319f6's codegen change — comment and codegen body identical, working tree matches a5319f6 in the ballot region. Identical assembly to the (broken) i32 form: same v_cmp_*_e64 + low-half store, since LLVM's CSE folds the i64 ballot's high half away as soon as the trunc is observed. Already verified on MI300X wave64 in the original commit.

github-actions · 2026-05-12T17:23:52Z

Total: 11 file(s) changed, +734 -1 code lines.

…ize() Mirror the table from perso_hugh/doc/subgroup_size.md: a quick at-a-glance view of how group_size() ends up in the final artifact on each backend (literal 32 on CUDA, literal 64 on AMDGPU, runtime OpLoad on SPIR-V), with the existing per-backend prose retained as supporting notes below the table. Also note that callers needing a Python int at trace time should use qd.template() / qd.static() with the literal 32/64 they're targeting, rather than group_size() (which isn't a compile-time constant on SPIR-V).

github-actions · 2026-05-12T18:13:21Z

Diff coverage: 17% · 654 lines, 540 missing

…docstrings 'Trace time' is an internal-implementation term (the trace step of kernel compilation, where Python runs the kernel body to emit Quadrants IR). Users don't think in those terms — they think in 'compile time vs runtime'. Replace the term throughout subgroup.md, block.md, atomics.md, algorithms.md, and the subgroup.py docstrings. The semantic meaning is identical (everything previously labelled 'trace time' happens during kernel compilation, before the first dispatch), so this is a pure terminology cleanup. Note: decompositions.md also has one 'trace time' occurrence, but that file is being renamed/split on hp/new-qipc-ops-linalg, so leaving it untouched to avoid a merge conflict there.

Missed this one in the previous commit (was inappropriately gated on a 'leave untouched to avoid a merge conflict with hp/new-qipc-ops-linalg' note). The terminology fix is trivial and applies just as much here as elsewhere; the rename branch can resolve the one-line conflict.

github-actions · 2026-05-12T18:49:06Z

Total: 11 file(s) changed, +734 -1 code lines.

…e,log2_group_size}() returning Python int Lift subgroup-size querying from an internal IR op to a compile-time Python int. Lets callers feed `group_size()` / `log2_group_size()` into `qd.template()` arguments for the upcoming `_full`-suffixed reductions, where the right `log2_size` differs per backend (32 on CUDA, 64 on AMDGPU, 32 on every Vulkan / Metal device we've tested but device-probed on SPIR-V). C++ side: - New `DeviceCapability::spirv_subgroup_size`; populated from `VkPhysicalDeviceSubgroupProperties.subgroupSize` on Vulkan and hard-coded to 32 on Metal. - New `Program::subgroup_size()` returning 32 / 64 / probed-from-caps per arch; exposed via pybind. - Removed dead `subgroupSize` internal op + its CUDA / AMDGPU / SPIR-V codegen branches (the value is now folded into the IR as a literal at trace time). The SPIR-V `IRBuilder::get_subgroup_size()` helper and its `BuiltInSubgroupSize` global go away too. Python side: - `qd.simt.subgroup.group_size()` returns `prog.subgroup_size()` as a Python `int`; callable from kernel or host scope. - New `qd.simt.subgroup.log2_group_size()` asserts power-of-two then returns `bit_length() - 1`. Cache keys: - SPIR-V cache key already serialized `devcaps`, so adding `spirv_subgroup_size` to the cap map gives us fe-ll cache differentiation between wave32 and wave64 SPIR-V devices for free (future-proofing). - CUDA / AMDGPU keep using `subgroup_size(Arch)` in `offline_cache_util.cpp`; updated comment to make the SPIR-V path explicit.

github-actions · 2026-05-12T19:24:19Z

Total: 17 file(s) changed, +758 -23 code lines.

…ll variants for every log2_size op Two related changes, batched so the new _full wrappers for segmented_reduce_* land with the cap that makes them work on AMDGPU. `_segment_head_distance` rewrite (`log2_size <= 6`): - Compile-time `impl.static(log2_size <= 5)` branch picks between the historical u32-bitmask path (`log2_size <= 5`, identical IR to the wave32-only impl — zero overhead for CUDA / SPIR-V / Vulkan-wave32) and a new u64-bitmask path (`log2_size == 6`, only reachable on AMDGPU wave64). - `segmented_reduce_{add,min,max}` each gain `qd.static_assert(log2_size <= 6)` and their docstrings cap at `2**log2_size <= group_size()` rather than the old `<= 32`. `_full` variants for every windowed op: - `reduce_*_full`, `reduce_all_*_full`, `inclusive_*_full`, `exclusive_*_full`, `all_true_full`, `any_true_full`, `all_equal_full`, `segmented_reduce_*_full`. Each is a plain Python one-liner that passes `log2_size=subgroup.log2_group_size()` (a Python int resolved at trace time) into the base `@qd.func`. Compiles down to identical IR to a hand-rolled `reduce_add(v, 5)` on CUDA or `reduce_add(v, 6)` on AMDGPU, but lets callers write "operate over the whole subgroup" without gating on arch. Tests: - Smoke tests for `subgroup.group_size()` / `subgroup.log2_group_size()` returning Python `int`, folding into the kernel IR, and feeding into `qd.template()` args. - Smoke tests for each `_full` variant comparing its result lane-by-lane against the matching base call with `log2_size=log2_group_size()`. - AMDGPU-gated tests for `segmented_reduce_{add,max}` at `log2_size=6` (the wave64-only u64 path). Docs: `subgroup.md` `group_size()` section rewritten to reflect the Python-int return type, a new "Full-subgroup (`_full`) variants" section enumerating every wrapper, and the `segmented_reduce_*` description updated to call out the lifted cap and the compile-time-selected u32/u64 paths.

…egment_head_distance

github-actions · 2026-05-12T20:14:00Z

Total: 17 file(s) changed, +1044 -25 code lines.

…dentity arg, float dtype on _full variants, group_size() stability across qd.reset+qd.init Closes the gaps from the audit: - Direct coverage for every `_full` wrapper that was previously only smoke-tested transitively (mul/and/or/xor variants of inclusive/exclusive, segmented_reduce_min/max_full, all_equal_full). - Dedicated tests for `exclusive_min_full(value, identity)` / `exclusive_max_full(value, identity)` - the only `_full` variants that take an extra arg. - Float-dtype (f32/f64) parametrization on `reduce_add_full`, `inclusive_add_full`, and `segmented_reduce_add_full` to catch a future regression that would accidentally cast through i32 inside a wrapper. - `test_subgroup_group_size_stable_across_reinit` exercises 3 qd.reset+qd.init cycles and asserts `group_size()` / `log2_group_size()` stay stable - guards against a stale-cached subgroup size on `Program` surviving reset. The shared helper `_check_full_variant_matches_base` gained a `dtype` parameter and an optional `host_init` callable for ops where the in-kernel `src[i] = i + 1` would overflow (mul) or collapse to a single bit (and/or/xor).

…init test The previous version read `qd.lang.impl.current_cfg().arch` once before any `qd.reset()` and reused that enum value across the loop. On at least the AMDGPU backend the reused enum reads back as garbage after reset (the live config object is torn down with the runtime), and `qd.init` then falls back to x64 where `Program::subgroup_size()` returns 0. Take `req_arch` straight from the fixture parameter so it survives reset, mirroring `test_qd_tensor_across_reset_and_reinit`.

github-actions · 2026-05-12T21:06:09Z

Total: 17 file(s) changed, +1239 -25 code lines.

Background: every AMDGPU shuffle (incl. shuffle_xor) lowers through amdgpu_shuffle_i32 / _shuffle_down_i32 / _shuffle_up_i32, which all dispatch to llvm.amdgcn.ds_bpermute. On CDNA the SIMD is 64-wide so ds_bpermute addresses the entire wave64; on RDNA gfx10+ the SIMD is 32 lanes wide and ds_bpermute is scoped to a single SIMD32 cluster -- a lane in the bottom half cannot read the top half even though Quadrants forces wave64. Result: cross-half shuffles (mask=32, offset>=32) silently wrap mod 32 and corrupt log2_size=6 reductions / scans / votes -- the all_true_full / any_true_full / all_equal_full failures observed on RX 7900 XTX. Fix: introduce a wave64-aware cross-half helper that pairs ds_bpermute with llvm.amdgcn.permlane64 (single-instruction SIMD swap) and picks the right copy based on which half the target lane sits in. amdgpu_shuffle_i32, _shuffle_down_i32 and _shuffle_up_i32 now all route through it; the i64, f32 and f64 wrappers inherit the fix for free. CDNA path is unchanged in observable behaviour: ds_bpermute already covers the whole wave64, so the permlane64-branch is dead-code-selected. Cost there is one extra v_permlane64 + v_cndmask per i32 shuffle; in practice this is dwarfed by the existing ~50-cycle LDS-permute latency. Patches llvm.amdgcn.permlane64 in llvm_context.cpp so the new stub gets linked at module load.

github-actions · 2026-05-12T21:35:46Z

Diff coverage: 20% · 1099 lines, 878 missing

llvm.amdgcn.permlane64 is overloaded on its element type (signature ``T -> T`` with ``T`` matching the input); CreateIntrinsic needs the explicit type list to resolve the mangled declaration, otherwise it segfaults inside getDeclaration when patching at module-init. Mirrors the existing amdgpu_ballot_w32 / amdgpu_ballot_w64 patches. Verified on amddesktop (RX 7900 XTX, gfx1100 wave64): the 5 previously-failing tests now pass -- all_{true,any,equal}_full[arch=amdgpu] and segmented_reduce_{add,max}_log2_size_6[arch=amdgpu].

…m target Iterated on amdgpu_cross_half_shuffle_i32 until it works for every shuffle shape on RDNA wave64. Key empirical findings (verified on RX 7900 XTX, gfx1100 wave64): * ``ds_bpermute_b32`` on RDNA is SIMD32-scoped: a lane in the top half cannot read the bottom half (and vice versa) when the byte address is per-lane. Need ``permlane64`` to bring the partner SIMD's payload across and pick between the two reads based on which half the target lane is in. * When the byte address is uniform across the wave, LLVM's uniformity analysis can lower ``llvm.amdgcn.ds.bpermute`` to a ``v_readlane_b32`` style instruction that addresses lanes 0..31 wave-globally rather than SIMD32-locally -- which silently breaks cross-half reads with compile-time-constant target lanes (broadcast / shuffle with a u32 literal). The fix is an empty ``+v`` inline asm marking ``byte`` as an opaque per-lane VGPR, which forces LLVM to emit the real ``ds_bpermute_b32`` and gets the SIMD-local semantics our ``permlane64`` pairing relies on. Net cost: zero on the per-lane path (we'd be issuing real ``ds_bpermute_b32`` anyway); zero on the uniform path (same instruction shape, one extra ``v_permlane64_b32`` + ``v_cndmask_b32`` we already emit). On CDNA the wave is one SIMD64 so the cross-half branch is dead but ``permlane64`` is still available and free. Adds five regression tests exercising the previously-broken paths on wave64 hardware (skipped on wave32 backends where the cross-half partner is out of range): * shuffle_xor with mask=32 * shuffle_down with offset>=32 * shuffle_up with offset>=32 * shuffle to absolute lanes in the top half (constant target, exercises the inline-asm fence) * reduce_add_full absolute correctness over the full subgroup

Update subgroup.md to reflect the new permlane64-based lowering for shuffle / shuffle_down / shuffle_up on AMDGPU wave64: * Drop the implication that cross-half shuffles are broken on RDNA -- they now work uniformly on every AMDGPU target via the permlane64 + ds_bpermute + select sequence described in a new ``AMDGPU wave64 cross-half lowering`` section. * Update the support-matrix footnote, the per-op (shuffle / shuffle_down / shuffle_up / broadcast) backend notes, and the performance bullet to point at the new section. * Mention the inline-asm fence that pins ds_bpermute to its SIMD32-local form when the byte address is uniform, since anyone reading the generated IR will see it. No code changes -- this is doc-only.

…e / scan Add four wave64-only tests that hit the cross-half (offset>=32) shuffle step in each of the four shuffle-tree families that the AMDGPU wave64 cross-half lowering touches: * ``test_subgroup_reduce_add_log2_size_6`` -- ``shuffle_down`` tree at offset 32 (the last step on a 64-lane reduction). * ``test_subgroup_reduce_all_add_log2_size_6`` -- ``shuffle_xor`` butterfly with ``mask = 32`` (the broadcast-to-all form's last step). * ``test_subgroup_inclusive_add_log2_size_6`` -- ``shuffle_up`` Hillis- Steele scan, final step at offset 32. * ``test_subgroup_exclusive_add_log2_size_6`` -- same shape as the inclusive scan plus an extra ``shuffle_up`` and the lane-0 identity substitution. All four are gated to ``arch=qd.amdgpu`` (the only currently-forced- wave64 target) and use ``i32`` only -- the dtype-lowering and the unroll-depth code paths are orthogonal, so testing every dtype at ``log2_size = 6`` would be redundant with the existing parameterized matrix at ``log2_size = 5`` (every dtype) plus these new tests at ``log2_size = 6`` (one dtype). The matching ``_full`` variant tests already cover every dtype at ``log2_size = log2_group_size()``. Each test is a one-line wrapper around the existing per-family helper (``_check_reduce_lane0`` / ``_check_reduce_all`` / ``_check_inclusive_ scan`` / ``_check_exclusive_scan``), which at ``log2_size = 6`` and ``N = 64`` reduces to a single window spanning the whole wave64 wave and absolute-checks every lane against a CPU oracle.

Replace the cartesian ``parametrize(dtype) x parametrize(log2_size, [1..5])`` matrix on the 20 sized reduce / scan tests with a single hand-picked ``parametrize(dtype, log2_size, ...)`` table per category. The two axes are orthogonal in the lowering: * log2_size controls unroll depth of the same shuffle tree -- the tree shape is dtype-agnostic, so once we've verified the tree at one dtype we mostly just need spot checks at the other dtypes. * dtype controls the lowering inside ``subgroup.shuffle*`` (64-bit values split into two 32-bit shuffles on AMDGPU, f64/i64 skipped on Metal/MoltenVK) -- that lowering is independent of log2_size. Each scenario table holds ``log2_size = 1`` (shortest tree, boundary) and ``log2_size = 5`` (full wave32 / 32-lane window on wave64) for ``i32``, plus one row per non-i32 dtype at ``log2_size = 5``. Three tables for the three dtype-list shapes: * ``_SCENARIOS_FULL_DTYPE`` (6 cases) -- the 8 reduce_* / inclusive_add / exclusive_add tests that accept all five dtypes. * ``_SCENARIOS_I32_AND_FLOATS`` (4 cases) -- the 6 inclusive_mul / inclusive_min / inclusive_max / exclusive_mul / exclusive_min / exclusive_max tests. * ``_SCENARIOS_INT`` (4 cases) -- the 6 bitwise inclusive_and / or / xor / exclusive_and / or / xor tests. Net effect: ~96 cases across the 20 tests instead of the previous ~380 cartesian cases (~4x reduction). Full wave64 absolute-correctness is already covered by the dedicated ``test_subgroup_*_log2_size_6`` tests + the ``_full`` variant tests added earlier in this branch, so the trimmed bulk matrix doesn't lose meaningful coverage.

github-actions · 2026-05-12T23:26:41Z

Total: 17 file(s) changed, +1721 -385 code lines.

hughperkins and others added 30 commits May 7, 2026 15:16

[doc] Drop qd.simt.warp pointer from subgroup voting ops section

9638acf

Merge branch 'main' into hp/cross-gpu-subgroup

2f94298

[doc] Reflow more Python subgroup docstrings to 120c

b0fb964

CI wrap-checker on PR #665 flagged three more docstring blocks wrapping at 83-87c instead of 120c (`exclusive_add`, `test_subgroup_sync`, `test_subgroup_mem_fence`). Reflow those. No semantic changes.

[doc] Drop 'Cells marked no' sentence from subgroup.md

1525626

Stale carry-over from the days when several ops were one-backend stubs; no longer applies now that everything in the doc is universal.

Merge branch 'main' into hp/cross-gpu-subgroup

c2aa6dd

Apply pre-commit formatting (black, clang-format)

119b3fa

hughperkins added 2 commits May 12, 2026 07:39

Merge remote-tracking branch 'origin/main' into hp/cross-gpu-subgroup

ce98908

# Conflicts: # quadrants/codegen/cuda/codegen_cuda.cpp

Merge branch 'hp/cross-gpu-subgroup' into hp/new-qipc-ops-subgroup

7d18244

hughperkins changed the title ~~[feat] New QIPC ops for subgroups~~ [GPU] New QIPC ops for subgroups May 12, 2026

Base automatically changed from hp/cross-gpu-subgroup to main May 12, 2026 16:34

hughperkins added 2 commits May 12, 2026 09:46

hughperkins added 2 commits May 12, 2026 11:14

hughperkins added 2 commits May 12, 2026 12:28

[subgroup] pre-commit: clang-format + replace else-after-return in _s…

62e4c17

…egment_head_distance

hughperkins added 2 commits May 12, 2026 13:27

hughperkins added 5 commits May 12, 2026 14:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] New QIPC ops for subgroups#676

[GPU] New QIPC ops for subgroups#676
hughperkins wants to merge 82 commits into
mainfrom
hp/new-qipc-ops-subgroup

hughperkins commented May 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hughperkins commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Drive-by fixes (required by segmented_reduce_*, but useful in their own right)

Stacking

Test plan

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hughperkins commented May 10, 2026 •

edited

Loading

Drive-by fixes (required by `segmented_reduce_*`, but useful in their own right)