[GPU] New QIPC ops for block#684
Conversation
… reduce)
Ports qipc's `qipc/_src/core/block/reduce.py` into Quadrants as
`qd.simt.block.reduce_{add,min,max}` (lane-0-only) and
`qd.simt.block.reduce_all_{add,min,max}` (broadcast). Closes one of the
three block-tier rows in `qipc_gaps_block.md`.
Strategy is CUB's `BLOCK_REDUCE_WARP_REDUCTIONS`: per-warp `shuffle_down`
tree reduce -> lane 0 of each warp publishes the warp aggregate into a
`SharedArray(NUM_WARPS, dtype)` -> `block.sync()` -> thread 0
sequentially folds the warp aggregates with the same operator. The
`reduce_all_*` flavour adds one extra `block.sync()` plus a one-slot
shared-memory broadcast. When the block is exactly one warp the
shared-memory path is short-circuited at trace time via a
`impl.static(NUM_WARPS == 1)` guard, so single-warp blocks pay only the
per-warp tree.
Generic `block.reduce(value, tid, block_dim, log2_warp, op, dtype)` and
`block.reduce_all(...)` accept an arbitrary template binary `op` for
custom monoids; the `_add`/`_min`/`_max` shortcuts wire in the binary
ops already exported by `simt/subgroup.py`.
API mirrors qipc: the caller passes `tid` explicitly (Quadrants does
not yet have a portable in-kernel `block.thread_idx()` on CUDA /
AMDGPU; pass `i % block_dim`). `block_dim` and `log2_warp` are
`template()` so the call lowers to a fully-unrolled IR with no runtime
branches on warp size.
Tests in `tests/python/test_simt.py` parametrise over
`{i32, f32} x {block_dim ∈ [32, 128, 256]}`, covering the single-warp
short-circuit (32) and the multi-warp shared-mem path (128, 256).
Reduce-min/max use a permuted (non-monotone) initialiser so the result
genuinely depends on every lane, not just the first or last. Run on
`arch=qd.gpu` so CUDA, AMDGPU, Vulkan, and Metal CI all exercise the
ports.
Stacks on `hp/new-qipc-ops-subgroup` (which provides sized
`subgroup.reduce_*`, `_bin_*`, `shuffle_down`, `invocation_id`).
Documentation updated in `docs/source/user_guide/block.md`: support
table grows two rows; new `block.reduce_{add,min,max}` and
`block.reduce_all_{add,min,max}` sections explain the contract, list
the args, and show a worked example.
… scan)
Ports qipc's `qipc/_src/core/block/scan.py` into Quadrants as
`qd.simt.block.inclusive_{add,min,max}` and
`qd.simt.block.exclusive_{add,min,max}`, plus the generic
`block.{inclusive,exclusive}_scan(..., op, [identity,] dtype)` for
custom monoids. Closes the second of three block-tier rows in
`qipc_gaps_block.md`.
Strategy is CUB's `BLOCK_SCAN_WARP_SCANS`: per-warp Hillis-Steele scan
via `subgroup.{_inclusive_scan, _exclusive_scan}` -> last lane of
every warp publishes the warp aggregate to a
`SharedArray(NUM_WARPS, dtype)` -> `block.sync()` -> every thread
sequentially folds the cross-warp prefix and applies its own warp's
prefix to its scan value. Cross-warp prefix is computed redundantly
on every thread to avoid a second barrier (same trade-off CUB makes).
Inclusive: warp aggregate at the last lane is just the inclusive
value, written directly. Exclusive: warp aggregate is recovered as
`op(exclusive[last_lane], value[last_lane])`, since the exclusive
scan does not include the last lane's input. When the block is
exactly one warp the cross-warp shared-mem path is short-circuited at
trace time.
API mirrors the new block reduce: caller passes `tid` explicitly,
`block_dim` and `log2_warp` are `template()`. `exclusive_add`
derives the additive identity from `value - value` (matches
`subgroup.exclusive_add`); `exclusive_min` / `exclusive_max` take an
explicit `identity` because no portable type-extreme is derivable
from `value` alone (matches `subgroup.exclusive_min`).
Tests in `tests/python/test_simt.py` parametrise over
`{i32, f32} x {block_dim ∈ [32, 128, 256]}`, asserting per-thread
against a sequential CPU oracle. Min / max use a permuted
(non-monotone) input so the scan result genuinely depends on every
prefix step, not just the trailing or leading element. Run on
`arch=qd.gpu` so CUDA, AMDGPU, Vulkan, and Metal CI all exercise the
ports.
Documentation updated in `docs/source/user_guide/block.md`: support
table grows two rows; new `block.inclusive_*` and `block.exclusive_*`
sections explain the contract, identity rules, and the cost profile.
Ports qipc's `qipc/_src/core/sort/_block_radix_rank.py` into Quadrants
as `qd.simt.block.radix_rank_match_atomic_or`. Closes the third (and
final) row in `qipc_gaps_block.md`.
Faithful port of CUB's
`BlockRadixRankMatchEarlyCounts<WARP_MATCH_ATOMIC_OR>` from
`cub/block/block_radix_rank.cuh` (the SM90 onesweep policy):
1. Per-warp digit histograms via shared-memory atomic_add.
2. Per-thread column-sum upsweep across warps (rewrites warp
histograms into per-warp running prefixes; yields per-thread
bin_count for digit == tid).
3. Block exclusive scan on bin_count (uses our newly-added
`block.exclusive_add`).
4. Downsweep: fold the block-wide exclusive prefix into every warp's
offset entry.
5. Per-key match via shared-memory atomic_or on a per-digit lane
mask; leader (highest set lane) does a single atomic_add on the
warp offset and broadcasts via `subgroup.shuffle`; each thread's
rank = warp_offset + popc(bin_mask & lanemask_le) - 1.
6. Publish bins + exclusive_digit_prefix to caller-supplied
SharedArray outparams; one block.sync() retires before exit so
the caller can read them without an extra barrier.
Improvements over the qipc port:
- WARP_SYNC (qipc's `warp.ballot(qd.cast(1, qd.i32))` hack) is
replaced by `subgroup.sync()` which lowers to `__syncwarp` /
`OpControlBarrier(ScopeSubgroup, ...)` / `s_barrier` on the
appropriate backend.
- CUB's `LaneMaskLe()` PTX intrinsic + qipc's `_lane_mask_le`
workaround function are replaced by the new portable
`subgroup.lanemask_le(invocation_id())` (saves ~13 LoC and
sidesteps the lane==31 overflow special case).
- `BlockScan::ExclusiveSum` is replaced by the just-added
`block.exclusive_add`, which drops the qipc kernel's reliance on
a private sequential cross-warp scan.
API: `radix_rank_match_atomic_or(key, tid, block_dim, log2_warp,
radix_bits, bit_start, num_bits, bins, excl_prefix)` — `key` is one
u32 per thread; `bins` and `excl_prefix` are
`block.SharedArray((1 << radix_bits,), qd.i32)` outparams. Returns
the per-thread stable rank. Currently constrained to
`block_dim == 1 << radix_bits`, `log2_warp == 5` (wave32), and
`items_per_thread == 1`; multi-item and wave64 paths are future work.
Tests in `tests/python/test_simt.py` exercise three input
distributions against a CPU oracle:
- low_entropy: 16 distinct digits each repeated 16 times (heavy
match-path traffic with several lanes per warp colliding).
- uniform: full 16-bit uniform random.
- uniform_high_bits: digit drawn from bits [8, 16) (covers
bit_start > 0).
The oracle checks bins, excl_prefix, and the per-thread ranks
(uniqueness as a permutation of [0, 256) plus value match). Run on
`arch=qd.gpu`.
Documentation updated in `docs/source/user_guide/block.md`: support
table grows one row, new section explains the constraints, args,
cost, and shows a worked example.
`subgroup.sync()` lowers to `OpControlBarrier(ScopeSubgroup, ScopeSubgroup, 0)` on SPIR-V — i.e. the Memory Semantics operand is **0**. Bare control barriers without memory semantics do not publish prior shared-memory writes to other lanes, so on Vulkan / Metal the radix-rank algorithm silently saw stale `atomic_or` / `atomic_add` results across the warp, producing off-by-one ranks (e.g. `actual_ranks[14] == 15` vs `expected[14] == 14` on the uniform_high_bits pattern). Wrap the WARP_SYNC pattern in `_warp_sync_fence()` which issues both `subgroup.sync()` and `subgroup.mem_fence()`. On CUDA, the explicit mem_fence is a redundant `__threadfence_block` — slight overhead but always correct. On SPIR-V, the mem_fence emits a real `OpMemoryBarrier(ScopeSubgroup, AcquireRelease | UniformMemory | WorkgroupMemory)` and restores CUB's `__syncwarp` shared-memory visibility invariant that the algorithm depends on. Replaces three `subgroup_sync()` call-sites in `radix_rank_match_atomic_or`: the histogram-zero retire, the post `atomic_or` retire, and the leader-clear retire. The longer-term fix is to emit the proper memory semantics on `subgroupBarrier` in `spirv_codegen.cpp` (matching the pattern used by `subgroupMemoryBarrier`), but that lives on the subgroup branch / needs its own runtime rebuild — handle here in user code so this PR stays self-contained.
The Vulkan / Metal radix-rank failures were not the SPIR-V subgroupBarrier-without-memory-semantics quirk after all: pairing sync() with mem_fence() was correct for general robustness but did not fix the test. Real bug: ``leader = 31 - clz(cast(bin_mask, i32))`` triggered SPIR-V's GLSL.std.450 FindSMsb on the i32, which for negative values (top bit set) returns the index of the most-significant **0**-bit, not the 1-bit. Concretely, when only the highest lane in a warp has a given digit, bin_mask = 0x80000000; FindSMsb on (i32)-2147483648 returns 30 (bit 30 is the highest 0-bit), so the leader was elected as lane ``31 - 30 = 1`` instead of lane 31. All non-leader lanes in the match group then read warp_offset from lane 1 (which had stale 0 because lane 1 was *not* the actual leader and never atomic_add'd), producing the observed last-lane-of-warp off-by-one ranks. Fix: pass the u32 directly to ``clz`` so SPIR-V dispatches FindUMsb (which returns MSB-of-1 unconditionally and is already cross-backend on the subgroup branch). This matches what CUB's ``__clz`` does on CUDA and what AMDGPU emits via the new amdgcn-clz lowering. Restore the ``_warp_sync_fence`` (subgroup.sync + subgroup.mem_fence) calls in the match phase — they're orthogonal to the clz fix but remain the right thing for memory visibility across the warp on SPIR-V (whose ``OpControlBarrier(ScopeSubgroup, ScopeSubgroup, 0)`` has no memory semantics; pairing it with an explicit ``OpMemoryBarrier(ScopeSubgroup, AcqRel | UniformMem | WorkgroupMem)`` restores CUDA ``__syncwarp``'s shared-memory visibility invariant). CUDA was unaffected because ``__clz`` ignores signedness (counts leading zeros bit-pattern-wise) — the failure only showed up on the SPIR-V GLSL extended instruction set.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3e73bab45
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| WARP_SIZE = impl.static(1 << log2_warp) | ||
| NUM_WARPS = impl.static(block_dim // WARP_SIZE) | ||
|
|
||
| inclusive = _subgroup_inclusive_scan(value, op, log2_warp) |
There was a problem hiding this comment.
Preserve operand order for generic scans
For custom block.inclusive_scan/exclusive_scan users with an associative but non-commutative operator, this produces the wrong prefix order: the reused subgroup scan folds the shuffled lower-lane value as the second operand (op(current, previous)) rather than op(previous, current). The docs for these new generic APIs promise op(v[0], ..., v[i]) and "custom monoids", so matrix-multiply-style monoids will return reversed/scrambled prefixes within each warp; the added add/min/max tests don't catch this because those ops are commutative.
Useful? React with 👍 / 👎.
…vocation_id() On wave32 hardware (CUDA, Metal, RDNA AMDGPU, most Vulkan compute) the two are equivalent: hardware wave size equals the logical 32-lane warp, so ``invocation_id()`` runs 0..31 within each warp. On wave64 hardware (CDNA AMDGPU) the hardware wave is 64 lanes wide but our log2_warp=5 partitions a block into logical 32-lane warps — so within one hardware wave we have *two* logical warps (lanes 0..31 and 32..63 of the wave). ``invocation_id()`` runs 0..63 across that pair, meaning ``invocation_id() == 0`` only fires once per hardware wave instead of once per logical warp. The publish step ``if lane_id == 0: shared[warp_id] = warp_agg`` then skips every odd logical warp's aggregate, which on a 256-thread block produced sums missing 4 of 8 warps — exactly the failure mode the local amdcloud run hit (got 14400 vs expected 32896, the difference being warps 1,3,5,7). Replace ``invocation_id()`` with ``tid & (WARP_SIZE-1)`` so the lane index always means "lane within the logical 32-lane warp". This matches CUB's ``threadIdx.x & 31`` recipe on CUDA wave32 and stays correct when the hardware wave is wider than the logical warp. Note: this fix alone does *not* make ``block.reduce`` correct on wave64 with log2_warp=5 — the per-warp ``_warp_reduce`` / ``subgroup.reduce_*`` step still does 5-iteration shuffle_down trees with offsets 16..1 inside a 64-lane wave, and lane 48's offset-16 shuffle wraps to lane 0 (ds_bpermute address modulo wave bytes) so warp 1 of every wave gets a contaminated aggregate. Making block reduce/scan fully wave64-correct requires either passing log2_warp=6 on CDNA (so the warp size equals the hardware wave) or a width-clip on shuffle_down — both out of scope for this PR. Documented in the test parametrisation; CDNA support is a follow-up. The radix_rank match phase still uses ``invocation_id()`` because it deliberately operates on the hardware wave (the bin_mask atomic_or + clz + popcnt leader pattern is wave-wide by construction; making it wave64-correct would need a u64 bin_mask and is also out of scope).
| | `block.SharedArray(shape, dtype)` | yes | yes | yes | | ||
| | `block.global_thread_idx()` | yes | yes | — | | ||
| | `block.thread_idx()` | no | no | yes | | ||
| | `block.reduce_{add,min,max}(v, tid, ...)` | yes | yes | yes | |
There was a problem hiding this comment.
- why do we need to pass in tid?
| | `block.inclusive_{add,min,max}(v, tid, ...)` | yes | yes | yes | | ||
| | `block.exclusive_{add,min,max}(v, tid, ...)` | yes | yes | yes | | ||
| | `block.radix_rank_match_atomic_or(...)` | yes | yes | yes | | ||
| | `grid.memfence()` (device-scope, see below) | yes | no | no | |
There was a problem hiding this comment.
why is this still in block.md?
|
|
||
| - `value`: per-thread input. | ||
| - `tid`: calling thread's block-local index. Pass `i % block_dim` from a `qd.loop_config(block_dim=...)` kernel, or `qd.simt.block.thread_idx()` on backends that expose it. | ||
| - `block_dim`: threads per block (compile-time `template()`; must be a multiple of `2**log2_warp`). |
There was a problem hiding this comment.
do we need to pass in block_dim
7f2538b to
009aba4
Compare
Pulls in the upstream subgroup-branch refresh (block.sync_*_nonzero now emulated cross-GPU on AMDGPU/Vulkan/Metal, mem_sync renamed to mem_fence, block.thread_idx ported to CUDA/AMDGPU, the underscore- prefixed import convention) and merges with the new block reduce / scan / radix_rank work on top. Conflict resolution: - python/quadrants/lang/simt/block.py: union the two import sets and align everything I added on the base branch's underscore convention (`func as _func`, `i32 as _i32`, `u32 as _u32`, `ops as _ops`, `subgroup as _subgroup`). All call sites in the new code now go through `_ops.atomic_add` / `_ops.cast` / `_ops.clz` / `_ops.popcnt` / etc. and `_subgroup.shuffle` / `_subgroup.invocation_id` / `_subgroup.lanemask_le` / `_subgroup.sync` / `_subgroup.mem_fence`. No behavioural change. - docs/source/user_guide/block.md: take the base's wider support table (CUDA / AMDGPU / Vulkan / Metal columns) plus the `yes*` cells for the now-emulated sync_*_nonzero ops, and add the new rows for reduce / scan / radix_rank. The new rows carry `yes**` for AMDGPU with a follow-up footnote explaining the wave32-vs-wave64 log2_warp contract; radix_rank specifically is wave32-only and the cell says so. Also drop the stale `## Grid-scope fence` heading + body that an earlier round had left behind in this file (grid-scope is documented in grid.md, not here). - tests/python/test_simt.py: auto-merged.
009aba4 to
351017b
Compare
…p/new-qipc-ops-block
…p/new-qipc-ops-block
… ops block.reduce*, block.inclusive_*, block.exclusive_*, and block.radix_rank_match_atomic_or now read the calling thread's block-local index via block.thread_idx() internally rather than taking it as a parameter. The change is correctness-neutral and either neutral or a small win performance-wise: on every backend block.thread_idx() lowers to a single builtin register read that LLVM / spirv-opt / Metal-compiler CSE reliably, while non-power-of-two block_dim cases avoid a runtime modulo at the call site. Also collapse the redundant _reduce / _reduce_all / _inclusive_block / _exclusive_block private helpers into the public reduce / reduce_all / inclusive_scan / exclusive_scan: the prior structure had each public op as a thin wrapper that only added the now-removed tid parameter, so the private layer no longer earns its keep. reduce_add / reduce_min / reduce_max etc. continue to delegate to the generic reduce / scan ops.
…ally Every block-scope reduction, scan, and the radix-rank primitive previously took log2_warp as a template parameter, requiring callers to plumb in 5 on wave32 backends or 6 on wave64. Now that subgroup.group_size() and subgroup.log2_group_size() return compile-time Python ints (32 on CUDA / Metal / Vulkan-on-NVIDIA, 64 on AMDGPU), each block op reads the subgroup size itself and the parameter is no longer plumbed at the call site. Adds a compile-time guard (impl.static_assert) at the top of every block op that block_dim is a positive multiple of subgroup.group_size() -- catches misconfigurations like block_dim=32 on AMDGPU (wave64) with a clear error message instead of a silent NUM_WARPS=0 miscompile. The radix-rank op also gains a compile-time assert that subgroup.group_size() is 32: the atomic-OR match path is built on 32-lane i32 ballot masks, and the wave64 path is not yet implemented (parallel to the existing constraint in the docstring).
The base branch already did this sweep across most user-facing docs and docstrings; these mentions slipped in after the original cleanup and from test_atomic.py which the sweep missed. No semantic change.
…es Python ints Inside a @qd.func body, plain Python ops like block_dim % WARP_SIZE get traced into Quadrants Expressions rather than evaluated as Python ints, so the new compile-time guards in reduce / scan / radix_rank were failing with "Static assert with non-static condition" on every backend. Restoring the impl.static(...) wrappers around the template-int arithmetic (WARP_SIZE, NUM_WARPS, BLOCK_WARPS, the static_assert condition, SharedArray shape tuples, and the per-iteration shared-memory indices) forces evaluation at trace time and the constants reach static_assert / SharedArray as plain Python ints again, matching the pre-API-cleanup shape. Pure correctness fix; semantically equivalent to the previous shape.
…k.py uses it Mirrors the existing subgroup._inclusive_scan / _exclusive_scan private helpers in shape -- generic op: template(), log2_size: template(), private because the generic-op contract is fragile and we don't want to invite ad-hoc subgroup reductions from arbitrary kernels. Nothing in subgroup.py routes through the new helper; the typed reduce_add / reduce_min / reduce_max keep their existing hard-coded bodies, so no existing IR moves. block.reduce drops its local _warp_reduce helper and calls subgroup._reduce instead. This brings block.py's three per-subgroup steps into a consistent shape (subgroup._reduce / _inclusive_scan / _exclusive_ scan), removes the duplicated 4-line shuffle tree, and lines up with the existing convention where block.py only calls into private subgroup helpers for the generic-op paths. Identical IR; pure code motion + one call-site rewrite.
block.py is cross-GPU code -- its abstractions (block, subgroup, lane, shuffle, ballot) come from the SPIR-V / Vulkan / portable subgroup model, not from CUDA's warp model. Calling the per-subgroup step a "warp" leaks CUDA terminology into a portable surface. This commit sweeps every cross-GPU identifier and prose mention from "warp" to "subgroup", and leaves only the genuinely CUDA-specific terms intact (``__syncwarp`` in comments documenting CUDA's primitive, and the legacy ``qd.simt.warp.*`` test module which is CUDA-only). block.py identifier renames: WARP_SIZE -> SUBGROUP_SIZE WARP_THREADS -> SUBGROUP_THREADS (radix-rank local) log2_warp -> log2_subgroup NUM_WARPS -> NUM_SUBGROUPS BLOCK_WARPS -> BLOCK_SUBGROUPS warp_id -> subgroup_id warp_idx -> subgroup_idx warp_agg -> subgroup_agg warp_prefix -> subgroup_prefix warp_count -> subgroup_count warp_offset -> subgroup_offset j_warp -> j_subgroup (radix-rank loop var) _warp_sync_fence -> _subgroup_sync_fence CUB stage names: ComputeHistogramsWarp / ComputeOffsetsWarpUpsweep / ComputeOffsetsWarpDownsweep -> ...Subgroup / ...SubgroupUpsweep / ...SubgroupDownsweep (algorithm-internal names, not exported anywhere). block.md: Deletes the stale ``**`` footnote paragraph that claimed AMDGPU callers must pass log2_warp=6 -- the parameter no longer exists; subgroup size is read from subgroup.group_size() at compile time, and the same source compiles correctly on wave32 and wave64. Sweeps prose "warp" -> "subgroup" in the algorithm descriptions and cost expressions. Keeps "subgroup (warp / wavefront)" in the dual-term glossary line. test_simt.py: Sweeps prose in our block-reduce / block-scan / block-radix-rank section headers. Rewrites the stale ``log2_warp is pinned to 5`` comment to reflect the current compile-time auto-detection. Leaves the legacy qd.simt.warp.* tests, the segmented-reduce tests, and the ``__syncwarp`` mention in test_block_sync untouched. Pure rename + doc cleanup; identical IR.
5a2f552 to
5d1d17e
Compare
5d1d17e to
f85c12c
Compare
Splits the previously unioned temp-storage into two SharedArrays: smem_offsets (i32, always) and smem_match (i32 on wave32, i64 on wave64). The match phase (step 5) is now compile-time-gated by subgroup size: wave32 keeps the existing 32-bit ballot mask + u32 atomic_or / clz / popcnt; wave64 uses a 64-bit ballot mask + u64 atomic_or (native on AMDGPU LDS) + u64 clz / popcnt, with an inline u64 lanemask_le (subgroup.lanemask_le is u32-only by contract). Wave32 backends never see the i64 path, so portability does not depend on SPIR-V / Metal supporting 64-bit threadgroup atomics.
f85c12c to
cd9e546
Compare
Brings in the AMDGPU RDNA-wave64 cross-half shuffle fix in the subgroup runtime (permlane64-based), plus log2_size=6 absolute-correctness tests for subgroup reduce / scan. Unblocks amddesktop validation of the wave64 path in block.radix_rank_match_atomic_or.
Block reduce / scan / radix-rank impls require ``block_dim`` to be a positive multiple of the subgroup size (enforced by ``impl.static_assert``). AMDGPU is wave64 in Quadrants, so the existing ``block_dim=32`` parametrization is degenerate there: it fails at compile time with ``block.reduce: block_dim must be a positive multiple of subgroup size``. Add ``_skip_if_block_dim_lt_subgroup`` and call it from all 12 block-* dim-parametrized tests. CUDA / Vulkan / Metal still cover the full ``[32, 128, 256]`` matrix; AMDGPU exercises ``[128, 256]``. Also refresh the section-header comment to reflect the wave64 regime.
Picks up '[subgroup] test: lean parameterization for sized reduce / scan tests'.
…block_dim by arch Previous parametrization ``block_dim ∈ [32, 128, 256]`` was wave32-shaped: on wave64 ``block_dim=32`` was unsupported and got skipped, and the single-subgroup short-circuit path (BLOCK_SUBGROUPS == 1) was never exercised on AMDGPU because no parameter value landed at ``block_dim == 64``. Switch the parameter axis to ``sg_per_block ∈ [1, 4, 8]`` and derive ``block_dim`` inside each test body from a host-side ``_arch_subgroup_size()``. Each arch now covers its own canonical regimes: wave32 (CUDA / Vulkan-NVIDIA / Metal): block_dim ∈ [32, 128, 256] wave64 (AMDGPU): block_dim ∈ [64, 256, 512] Both arches now exercise the single-subgroup short-circuit + multi-subgroup paths. The ``_skip_if_block_dim_lt_subgroup`` helper is no longer needed (no degenerate parameter values reach the kernel) and is removed.
Three assertion-message wrappings updated by black 25.1.0 (the version pinned in .pre-commit-config.yaml). Pure formatting; no semantic changes.
…ve64 block.md still said radix_rank requires subgroup_size == 32 and that AMDGPU (wave64) is unsupported; that's been fixed. Update the constraints bullet to reflect the wave32 / wave64 dispatch and refresh the shared-memory footprint breakdown to call out the wave-size-specific match-mask region (4 KiB i32 on wave32, 8 KiB i64 on wave64).
Summary
Closes the block-level QIPC gaps listed in perso_hugh/doc/qipc/qipc_gaps_block.md by adding three new families of block-scope primitives under
qd.simt.block, stacked on the new subgroup primitives branch (hp/new-qipc-ops-subgroup).block.reduce_{add,min,max}/block.reduce_all_{add,min,max}— CUB-style two-stage warp-reduction tree (per-warpsubgroup.reduce_*into shared memory, second pass reduces the per-warp partials)._allvariants broadcast the result to every thread via shared memory.block.{inclusive,exclusive}_{add,min,max}— CUB-style block scan via warp scans (subgroup.{inclusive,exclusive}_*per warp, exclusive scan of warp aggregates in shared memory, fold prefix back into per-thread results).block.radix_rank_match_atomic_or— CUB'sBlockRadixRankMatchEarlyCountsATOMIC_OR path: per-warp histogram viaatomic_addon shared memory, block-wide exclusive prefix over digit columns, intra-warp match viaatomic_oron a per-digit lane mask, leader election viaclz/popcnt, leader'satomic_addplussubgroup.shufflebroadcast for the warp offset.Files
python/quadrants/lang/simt/block.py— implementations (~370 LoC)tests/python/test_simt.py— 78 new parametrized tests (36 reduce + 36 scan + 6 radix-rank), allarch=qd.gpudocs/source/user_guide/block.md— user-facing docs with API tables, semantics, costs, and examples for each new primitiveNotes / gotchas resolved during development
LLVMRuntime function subgroupShuffleUp not found.rm -rf ~/.cache/quadrants/qdcache/*fixes it; not a code bug.FindSMsbvsFindUMsb(commitd3e73bab4).leader = 31 - clz(cast(bin_mask, i32))produced wrong leaders on Vulkan whenbin_maskhad bit 31 set (e.g., a single-lane match in lane 31):FindSMsbon a negative i32 returns the position of the highest 0-bit, not the 1-bit, so the leader was elected as lane 1 instead of lane 31 and non-leader lanes read warp_offset 0 from a non-leader lane. Passingu32directly toclzdispatchesFindUMsband matches CUDA's__clz.subgroupBarrierhas no memory semantics.OpControlBarrier(ScopeSubgroup, ScopeSubgroup, MemorySemantics=0)does not publish shared-memory writes, unlike CUDA's__syncwarp.radix_rankpairs every warp-sync inside the match phase withsubgroup.mem_fence()(_warp_sync_fencehelper) so the lane-mask publish/clear sequence has correct release/acquire ordering on Vulkan / Metal.lane_id = invocation_id()was wrong on wave64 (commitbac3f9f41). On CDNA AMDGPU the hardware wave is 64 lanes wide but log2_warp=5 partitions a block into 32-lane logical warps, soinvocation_id() == 0only fires once per hardware wave instead of once per logical warp — every other logical warp's aggregate was silently dropped. Replaced withtid & (WARP_SIZE-1)(CUB'sthreadIdx.x & 31recipe). Wave32 backends are unaffected.Tests
-k blockpassing-k blockAMDGPU CDNA wave64 follow-up (not blocking this PR)
block.reduce/block.scanwork on AMDGPU only when the block is exactly one logical warp (block_dim == 32withlog2_warp=5). Multi-warp blocks fail becauseshuffle_downon AMDGPU'sds_bpermutelowering wraps OOB indices modulo the wave size — i.e. on a wave64 wave,shuffle_down(value, 16)from lane 48 reads lane 0 instead of returning the lane's own value (CUDA's__shfl_down_syncsemantics withwidth=32).subgroup.reduce_*/subgroup.{inclusive,exclusive}_*inherit the same constraint; the parent subgroup branch documentslog2_sizemust equal the hardware wave size on CDNA.block.radix_rank_match_atomic_oris wave32-only by construction: the bin_mask atomic_or +clz+popcntleader pattern is keyed on a u32 lane mask. Wave64 would need a u64 bin_mask, u64clz/popcnt, and 64-lanesubgroup.shufflefor the warp_offset broadcast.Both follow-ups are tracked separately and are not in scope for closing the QIPC block gaps (QIPC currently consumes these on CUDA and Vulkan).
Test plan
QD_WANTED_ARCHS=cuda pytest tests/python/test_simt.py -k blockQD_WANTED_ARCHS=vulkan pytest tests/python/test_simt.py -k block