Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
923614f
[API] Rename qd.simt.subgroup.barrier/memory_barrier -> sync/mem_fence
hughperkins May 7, 2026
c6ea97f
[API] Make qd.simt.subgroup data-movement ops universal across backends
hughperkins May 7, 2026
cb0a91d
[Tests] Cover the rename + alias / sync / mem_fence additions
hughperkins May 7, 2026
233b08c
[API] Make subgroup.sync()/mem_fence() universal across CUDA + AMDGPU…
hughperkins May 7, 2026
2b3410f
Revert "[API] Make subgroup.sync()/mem_fence() universal across CUDA …
hughperkins May 7, 2026
f4403c8
[API] Implement subgroup.sync()/mem_fence() universally on CUDA + AMD…
hughperkins May 7, 2026
4f1eb99
[API] Make subgroup.group_size() and subgroup.elect() universal acros…
hughperkins May 7, 2026
3890ba9
[API] Make subgroup.inclusive_add(v, log2_size) universal across CUDA…
hughperkins May 8, 2026
4b5ae0e
[API] Make all seven subgroup.inclusive_* ops universal across CUDA +…
hughperkins May 8, 2026
71d3199
[test] Bound test_subgroup_inclusive_mul inputs so 32-way product fit…
hughperkins May 8, 2026
2fffd73
[API] Implement subgroup.exclusive_* across CUDA + AMDGPU + SPIR-V
hughperkins May 8, 2026
99c0087
[test] Verify all groups + both subgroups for inclusive_*/exclusive_*…
hughperkins May 8, 2026
a2a2917
[fix] Replace `if` with `qd.select` in inclusive/exclusive scans (Met…
hughperkins May 8, 2026
38927ba
[API] Implement subgroup.{all_true,any_true,all_equal} universally
hughperkins May 8, 2026
9638acf
[doc] Drop qd.simt.warp pointer from subgroup voting ops section
hughperkins May 8, 2026
fbdc91a
[fix] Fix Metal inclusive/exclusive scan miscompilation
hughperkins May 8, 2026
785bb27
[test] Cover predicate-truthy + float-equality contracts for voting ops
hughperkins May 8, 2026
2f94298
Merge branch 'main' into hp/cross-gpu-subgroup
hughperkins May 9, 2026
1519823
[chore] Make pre-commit + pyright pass
hughperkins May 9, 2026
214a0cf
[doc] Reflow new subgroup comments / docstrings to project's 120c width
hughperkins May 9, 2026
4a91b8b
[doc] Reflow more C++ subgroup comments to 120c
hughperkins May 9, 2026
b0fb964
[doc] Reflow more Python subgroup docstrings to 120c
hughperkins May 9, 2026
1525626
[doc] Drop 'Cells marked no' sentence from subgroup.md
hughperkins May 9, 2026
c2aa6dd
Merge branch 'main' into hp/cross-gpu-subgroup
hughperkins May 9, 2026
6ed3aad
[feat] Add subgroup.reduce_min / _max / _all_min / _all_max
hughperkins May 10, 2026
6b88c00
Add cross-GPU subgroup.ballot(predicate) primitive
hughperkins Apr 30, 2026
119b3fa
Apply pre-commit formatting (black, clang-format)
May 9, 2026
35d6c9c
[SPIR-V] Use public accessor for v4_u32 type in ballot codegen
May 9, 2026
4029944
[feat] Add subgroup.segmented_reduce_add on top of subgroup.ballot
hughperkins May 10, 2026
61a89eb
[fix] Make `qd.clz` work on u32 / u64 across all backends
hughperkins May 10, 2026
1f4383a
[fix] AMDGPU: lower `qd.clz` via LLVM `ctlz` intrinsic
hughperkins May 10, 2026
abfed77
[doc] Reflow new subgroup comments / docstrings closer to 120c
hughperkins May 10, 2026
0f83e23
[feat] Add subgroup.lanemask_{lt,le,eq,gt,ge}(lane_id)
hughperkins May 10, 2026
0567225
[feat] Add subgroup.segmented_reduce_min / _max
hughperkins May 10, 2026
89772e8
[feat] Replace subgroup.ballot(p) with ballot_first_n + ballot_full_s…
hughperkins May 10, 2026
62667a9
[doc] Cite LLVM ballot.iN wave-mismatch lowering in AMDGPU codegen + …
hughperkins May 10, 2026
25a9217
[doc] Update math.md clz support matrix; drop AMDGPU clz xfail; add u…
hughperkins May 10, 2026
ddb8e25
[fix] Make segmented_reduce_* correct on wave64 (lanes 32..63)
hughperkins May 10, 2026
a5319f6
[fix] Work around LLVM AMDGPU isel bug for ballot.i32 on wave64
hughperkins May 10, 2026
bc049db
Merge branch 'hp/cross-gpu-subgroup' into hp/new-qipc-ops-subgroup
hughperkins May 10, 2026
2e6788e
Merge branch 'main' into hp/cross-gpu-subgroup
hughperkins May 10, 2026
25dd3cb
Merge branch 'hp/cross-gpu-subgroup' into hp/new-qipc-ops-subgroup
hughperkins May 10, 2026
64d3a24
[doc] Document log2_size windowing across the full subgroup
hughperkins May 11, 2026
9832309
[doc] Document log2_size windowing across the full subgroup
hughperkins May 11, 2026
c7aba60
Merge remote-tracking branch 'origin/hp/cross-gpu-subgroup' into hp/n…
hughperkins May 11, 2026
e61906f
[doc] Correct shuffle_down / shuffle_up windowing framing
hughperkins May 11, 2026
9e3e043
[doc] Drop redundant wave64 gotcha sentence from reduce result-placem…
hughperkins May 11, 2026
0f4e180
[doc] Trim per-op windowing mechanism breakdown from subgroup windowi…
hughperkins May 11, 2026
eedaca2
[doc] State explicitly that voting / predicate ops are windowed
hughperkins May 11, 2026
705147c
[doc] State explicitly that reductions / scans are windowed
hughperkins May 11, 2026
5835883
Merge remote-tracking branch 'origin/main' into hp/cross-gpu-subgroup
hughperkins May 11, 2026
0382cb3
[doc] Note subgroup.group_size() is for use inside kernels only
hughperkins May 11, 2026
86e7631
[doc] Trim host-side hardcoding aside from subgroup.group_size() note
hughperkins May 11, 2026
1e21230
[doc] Trim CPython-vs-IR aside from subgroup.group_size() note
hughperkins May 11, 2026
88f0e18
[doc] Fix broken #how-log2_size-windowing-works anchor links in subgr…
hughperkins May 11, 2026
566a47b
[doc] Address PR #665 review: refresh subgroup-op deletion comment
hughperkins May 11, 2026
ec1a65d
[style] clang-format wrap on internal_ops.inc.h subgroup comment
hughperkins May 11, 2026
b61a5c3
Merge branch 'hp/cross-gpu-subgroup' into hp/new-qipc-ops-subgroup
hughperkins May 12, 2026
7fa26de
[doc] Fix broken #how-log2_size-windowing-works anchor in reduce_min/…
hughperkins May 12, 2026
57234b0
[fix] Restore subgroupBallotU32/U64 codegen lost in cross-gpu-subgrou…
hughperkins May 12, 2026
dd97253
[doc] Drop stale sync()-after-inclusive-scan advice in _inclusive_sca…
hughperkins May 12, 2026
6e58b1e
Merge branch 'main' into hp/cross-gpu-subgroup
hughperkins May 12, 2026
aac7551
[fix] CUDA subgroup.mem_fence(): rename stale block_memfence -> block…
hughperkins May 12, 2026
fcbac8f
Merge branch 'hp/cross-gpu-subgroup' into hp/new-qipc-ops-subgroup
hughperkins May 12, 2026
ce98908
Merge remote-tracking branch 'origin/main' into hp/cross-gpu-subgroup
hughperkins May 12, 2026
7d18244
Merge branch 'hp/cross-gpu-subgroup' into hp/new-qipc-ops-subgroup
hughperkins May 12, 2026
453600b
Merge remote-tracking branch 'origin/main' into hp/new-qipc-ops-subgroup
hughperkins May 12, 2026
122451d
[fix] Re-apply AMDGPU ballot.i32 isel workaround lost in merge restore
hughperkins May 12, 2026
379ab96
[doc] Add backend / final-compiled-artifact table to subgroup.group_s…
hughperkins May 12, 2026
81a9b8c
[doc] Replace 'trace time' with 'compile time' in user-facing docs / …
hughperkins May 12, 2026
3a7d06e
[doc] Also replace 'trace time' with 'compile time' in decompositions.md
hughperkins May 12, 2026
ee0f722
[subgroup] Add Program::subgroup_size() + qd.simt.subgroup.{group_siz…
hughperkins May 12, 2026
13b5915
[subgroup] Lift segmented_reduce_* to log2_size=6 on wave64 + add _fu…
hughperkins May 12, 2026
62e4c17
[subgroup] pre-commit: clang-format + replace else-after-return in _s…
hughperkins May 12, 2026
14b5553
[subgroup] tests: full _full variant matrix, exclusive_min/max_full i…
hughperkins May 12, 2026
00d2f25
[subgroup] test: read req_arch/req_options from fixture for stable-re…
hughperkins May 12, 2026
1e15bdc
[subgroup] amdgpu: fix wave64 cross-half shuffle on RDNA via permlane64
hughperkins May 12, 2026
3c78160
[subgroup] amdgpu: pass i32 overload type to permlane64 patch_intrinsic
hughperkins May 12, 2026
7342caf
[subgroup] amdgpu: fix cross-half helper for both per-lane and unifor…
hughperkins May 12, 2026
da2506d
[subgroup] docs: document AMDGPU wave64 cross-half shuffle lowering
hughperkins May 12, 2026
d1ea772
[subgroup] test: add log2_size=6 absolute-correctness tests for reduc…
hughperkins May 12, 2026
6fefebe
[subgroup] test: lean parameterization for sized reduce / scan tests
hughperkins May 12, 2026
7b8682c
[subgroup] test: extend lean parameterization to segmented_reduce, al…
hughperkins May 13, 2026
05d490c
[subgroup] test: black auto-format wave64 cross-half shuffle asserts
hughperkins May 13, 2026
8432ce7
[subgroup] fix: gate AMDGPU VGPR asm fence on ARCH_amdgpu
hughperkins May 13, 2026
36102a8
Merge branch 'main' into hp/new-qipc-ops-subgroup
hughperkins May 13, 2026
802d31c
[subgroup] style: wrap docstrings exceeding 120c
hughperkins May 13, 2026
d7c4002
[subgroup] doc: address PR #676 review comments on subgroup.md
hughperkins May 13, 2026
501659f
[subgroup] doc: drop misleading same-half overhead claim
hughperkins May 13, 2026
81f9fbe
[subgroup] doc: mention _full variants in voting / reductions intro p…
hughperkins May 13, 2026
f923dac
[subgroup] api: rename ballot_full_subgroup -> ballot_full
hughperkins May 13, 2026
d07644e
[subgroup] api: rename to _tiled suffix convention (breaking change)
hughperkins May 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/user_guide/algorithms.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Device-wide algorithms — primitives that consume and produce whole arrays, exe

### `qd.algorithms.parallel_sort(keys, values=None)`

In-place sort. Reorders `keys` ascending; if `values` is provided, applies the same permutation to `values` (key-value sort). Both arguments must be 1-D `qd.field` — `parallel_sort` reaches into `snode.ptr.offset` internally, so `ndarray` is **not** supported and will fail at trace time with an `AttributeError`.
In-place sort. Reorders `keys` ascending; if `values` is provided, applies the same permutation to `values` (key-value sort). Both arguments must be 1-D `qd.field` — `parallel_sort` reaches into `snode.ptr.offset` internally, so `ndarray` is **not** supported and will fail at compile time with an `AttributeError`.

```python
import quadrants as qd
Expand Down Expand Up @@ -61,7 +61,7 @@ Constraints:

- **Dtype:** `qd.i32` only. Calling with any other dtype raises `RuntimeError("Only qd.i32 type is supported for prefix sum.")`.
- **Inclusive only.** No exclusive variant exposed. To convert to exclusive, post-process: `exclusive[i] = inclusive[i] - input_original[i]`.
- **Backend coverage.** CUDA and Vulkan only. AMDGPU and Metal raise `RuntimeError(f"{arch} is not supported for prefix sum.")` at trace time.
- **Backend coverage.** CUDA and Vulkan only. AMDGPU and Metal raise `RuntimeError(f"{arch} is not supported for prefix sum.")` at compile time.

The implementation is a Kogge-Stone hierarchical scan: per-block inclusive scan on shared memory, then a small recursive scan over per-block totals, then a uniform-add pass to propagate back. This means the executor reuses the underlying buffer across calls, which is why it's a class (allocate once, run many times) rather than a free function.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/user_guide/atomics.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Atomically writes back `min(x, y)` (resp. `max(x, y)`); returns the old value of

### `qd.atomic_and(x, y)` / `qd.atomic_or(x, y)` / `qd.atomic_xor(x, y)`

Bitwise atomics. Integer dtypes only — passing `f32` / `f64` raises a type error at trace time.
Bitwise atomics. Integer dtypes only — passing `f32` / `f64` raises a type error at compile time.

### `qd.atomic_sub(x, y)` / `qd.atomic_mul(x, y)`

Expand Down
2 changes: 1 addition & 1 deletion docs/source/user_guide/block.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Block-level primitives operate on the threads of a single CUDA thread block (CTA) / AMDGPU workgroup / Vulkan or Metal workgroup. They include thread barriers, memory fences, shared memory, and per-thread indexing helpers — the building blocks for cooperation among threads of the same block.

Block ops live under `qd.simt.block`. They are written so the same Python source compiles to the right vendor primitive on each backend. As of this writing every op on this page is portable across CUDA, AMDGPU, Vulkan, and Metal; the only remaining caveat (called out in the support-table footnote below) is a perf trade-off for the emulated `block.sync_*_nonzero` ops on non-CUDA backends, not a correctness gap. If a future op is added that is not yet portable, the Python layer will raise `ValueError` at trace time on the unsupported backend.
Block ops live under `qd.simt.block`. They are written so the same Python source compiles to the right vendor primitive on each backend. As of this writing every op on this page is portable across CUDA, AMDGPU, Vulkan, and Metal; the only remaining caveat (called out in the support-table footnote below) is a perf trade-off for the emulated `block.sync_*_nonzero` ops on non-CUDA backends, not a correctness gap. If a future op is added that is not yet portable, the Python layer will raise `ValueError` at compile time on the unsupported backend.

The closely-related device-scope memory fence is documented separately in [grid](grid.md). Users picking between a block-scope and a device-scope fence should read that page for the device-scope side.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/user_guide/decompositions.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ All ops live at the top level (`qd.svd`, `qd.sym_eig`, `qd.polar_decompose`, `qd

A few patterns to note:

- **Shapes are fixed.** Calling any of these on a matrix outside the supported shapes raises an exception at trace time (`"SVD only supports 2D and 3D matrices."`, etc.). Larger matrices need a different path — typically a Jacobi-style sweep applied iteratively, which Quadrants does not currently provide out of the box.
- **Shapes are fixed.** Calling any of these on a matrix outside the supported shapes raises an exception at compile time (`"SVD only supports 2D and 3D matrices."`, etc.). Larger matrices need a different path — typically a Jacobi-style sweep applied iteratively, which Quadrants does not currently provide out of the box.
- **FIXME (message wording):** these exception strings are misleading — "2D matrix" / "3D matrix" conventionally means "rank-2 / rank-3 tensor" (any matrix is rank-2), but here the intent is "matrix of shape 2×2 / 3×3". They should be updated to e.g. `"SVD only supports 2×2 and 3×3 matrices."`. This page reproduces the messages as they are emitted today.
- **All ops accept an optional `dt` argument.** When unspecified, it defaults to `impl.get_runtime().default_fp` — usually `qd.f32` unless overridden in `qd.init()`. Pass `dt=qd.f64` for the high-precision variant.
- **Output shape matches the input shape.** A 3×3 input yields 3×3 outputs (and a length-3 vector for `solve` / eigenvalues); a 2×2 input yields 2×2 outputs.
Expand Down
Loading
Loading