Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
cd61958
[device-algos] add quadrants-level scratch field (single Field(u32), …
hughperkins May 13, 2026
e5aea0c
[device-algos] add qd.algorithms.device_reduce_{add,min,max}
hughperkins May 13, 2026
1ccb553
[device-algos] add qd.algorithms.device_exclusive_scan_{add,min,max} …
hughperkins May 13, 2026
5801bda
[device-algos] add qd.algorithms.device_select (stream compaction)
hughperkins May 13, 2026
dd4b689
[device-algos] add bit_cast scratch microbench
hughperkins May 13, 2026
00bb511
[device-algos] add qd.algorithms.device_radix_sort + deprecate parall…
hughperkins May 13, 2026
a98e7cf
[device-algos] add qd.algorithms.device_reduce_by_key_add
hughperkins May 13, 2026
b856e5a
Merge branch 'hp/new-qipc-ops-block' into hp/new-qipc-ops-device
hughperkins May 13, 2026
3e732f2
[device-algos] reset _scratch_bytes on qd.reset; add coverage tests
hughperkins May 13, 2026
1cd1c4a
[device-algos] reflow comments + docstrings from ~80c to 120c
hughperkins May 13, 2026
da89293
[device-algos] more coverage: reduce/scan @ N=1M + oversized, reset r…
hughperkins May 13, 2026
c030a80
[amdgpu] fence-wrap block_barrier to match HIP __syncthreads
hughperkins May 13, 2026
e1f6c0c
Merge remote-tracking branch 'origin/main' into hp/new-qipc-ops-device
hughperkins May 13, 2026
e912e8b
[style] replace 'trace time' with 'compile time' and em-dashes with A…
hughperkins May 13, 2026
0c44835
[device-algos] mark N=1M tests run_in_serial to fix amdgpu flake
hughperkins May 13, 2026
a57cd32
Merge origin/hp/new-qipc-ops-block (proper base branch) into hp/new-q…
hughperkins May 13, 2026
347946b
[device-algos] revise n_1m radix sort docstring with actual flake-inv…
hughperkins May 14, 2026
3ed880c
[device-algos] revert run_in_serial markers on 1M tests (no verified …
hughperkins May 14, 2026
695e169
[device-algos] device_reduce_{add,min,max} on f64 / i64 / u64
hughperkins May 14, 2026
d88c234
[device-algos] device_radix_sort on u64 / i64 / f64 keys
hughperkins May 14, 2026
5f52e33
[device-algos] device_select on 64-bit scalars and struct dtypes
hughperkins May 14, 2026
eec45ef
[device-algos] device_exclusive_scan_{add,min,max} on 64-bit scalars
hughperkins May 14, 2026
b39fe89
Merge remote-tracking branch 'origin/hp/new-qipc-ops-block' into hp/n…
hughperkins May 14, 2026
bb95f06
[device-algos] move _bin_{add,min,max} imports to reductions module
hughperkins May 14, 2026
1e0731d
[style] revert em-dash → ASCII-hyphen substitution in non-device-algo…
hughperkins May 14, 2026
40708c1
[style] revert em-dash drift in more non-device-algos files
hughperkins May 14, 2026
3790228
[device-algos] move bit_cast scratch microbench out of Quadrants
hughperkins May 14, 2026
8d3bd36
[device-algos] doc: fix sentence fragment in algorithms.md intro
hughperkins May 14, 2026
40e4985
[device-algos] doc: correct the algorithms.md intro framing
hughperkins May 14, 2026
e931cb6
[device-algos] simplify min/max API + drop keyword-only marker
hughperkins May 14, 2026
5f3bf87
[device-algos] doc: drop AMDGPU / Metal coverage disclaimer
hughperkins May 14, 2026
c1daf2b
[device-algos] doc: drop Onesweep follow-up note from radix sort
hughperkins May 14, 2026
d0722fa
[device-algos] doc: add a dedicated Scratch space section
hughperkins May 14, 2026
fbe3dd9
[device-algos] bump default scratch budget from 1 MB to 5 MB
hughperkins May 14, 2026
8231a2b
[device-algos] rename public-API param input -> arr
hughperkins May 14, 2026
6d2a3ef
[runtime/llvm] pin NVPTX64 data layout when linking libdevice
hughperkins May 14, 2026
165ff37
[device-algos] drop Field(f64) scratch workaround now that bit_cast f…
hughperkins May 14, 2026
7feb669
[device-algos] doc: fuse {add,min,max} signatures in algorithms.md
hughperkins May 14, 2026
4d664da
Merge remote-tracking branch 'origin/hp/new-qipc-ops-block' into hp/n…
hughperkins May 14, 2026
2fc3908
[device-algos] CI fixes: lint, line-wrap, test_api snapshot, Metal/Mo…
hughperkins May 14, 2026
ef9b40e
[device-algos] CI fixes: line wrap, Apple-GPU radix-sort skips, sqrt(…
hughperkins May 15, 2026
2a65ba9
[device-algos] address PR 693 review on test_algorithms helpers and c…
hughperkins May 15, 2026
b313341
[device-algos] reflow under-wrapped docstrings in test_algorithms.py …
hughperkins May 15, 2026
083606e
[device-algos] reflow remaining ~78c prose runs across PR-touched files
hughperkins May 15, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
281 changes: 262 additions & 19 deletions docs/source/user_guide/algorithms.md

Large diffs are not rendered by default.

105 changes: 105 additions & 0 deletions python/quadrants/_scratch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
"""Quadrants-level scratch buffer for device-wide algorithms.

Two scratch fields - one ``Field(u32)`` and one ``Field(u64)`` - shared by every ``qd.algorithms.*`` device kernel.
Algorithms ``qd.bit_cast`` to / from these buffers to support every supported scalar dtype: 4-byte ``i32`` / ``u32``
/ ``f32`` go through the u32 scratch; 8-byte ``i64`` / ``u64`` / ``f64`` go through the u64 scratch. Sized to
comfortably cover device-wide reduce, exclusive scan, select / compact, radix sort, and reduce-by-key on inputs up
to ``N = 1M`` out of the box (qipc's hot path), per the design doc at
``perso_hugh/doc/qipc/qipc_device_algos_design.md``.

Sizing rationale: ``device_select`` / ``device_radix_sort`` need ~``N`` u32 slots per call (one write index /
tile-histogram entry per input element). At ``N = 1M`` that is 4 MB of u32 slots; we round up to 5 MB to leave
headroom for the recursion overhead (``ceil(N / BLOCK_DIM)`` extra slots) and the second-level scan partials.
``device_reduce_*`` / ``device_exclusive_scan_*`` need only ~``N / BLOCK_DIM`` u32 slots, so the same 5 MB
covers them well past ``N = 64M``. The u64 scratch sees half as many slots at the same byte budget.

Allocation strategy: lazy on first use, invalidated on ``qd.reset()`` via the ``impl.on_reset`` hook. This avoids
paying the 5 MB-per-width allocation cost in programs that never touch ``qd.algorithms``, and avoids coupling
``qd.init()``'s argument surface to the device-algos work for the first land. Programs that only touch 4-byte
algorithms never pay for the u64 buffer. A future change can add ``qd.init(scratch_bytes=...)`` if a caller needs
to override the default before any allocation has happened.
"""

from quadrants.lang.impl import field, on_reset
from quadrants.types.primitive_types import u32, u64

DEFAULT_SCRATCH_BYTES: int = 5 * (1 << 20)

_scratch_field = None
_scratch_field_u64 = None
_scratch_bytes: int = DEFAULT_SCRATCH_BYTES


def set_scratch_bytes(scratch_bytes: int) -> None:
"""Set the scratch capacity in bytes for the next allocation.

Must be called before the first ``get_scratch_u32()`` / ``get_scratch_u64()`` call in the current runtime cycle.
Has no effect on an already-allocated scratch field; users wishing to enlarge an existing scratch must
``qd.reset()`` and ``qd.init()`` again, then re-call ``set_scratch_bytes`` (capacity resets to
``DEFAULT_SCRATCH_BYTES`` on every ``qd.reset()``).
"""
global _scratch_bytes
if _scratch_field is not None or _scratch_field_u64 is not None:
raise RuntimeError(
"set_scratch_bytes called after scratch was already allocated; "
"call before any qd.algorithms.* op runs, or qd.reset() first"
)
if scratch_bytes <= 0 or scratch_bytes % 8 != 0:
raise ValueError(f"scratch_bytes must be a positive multiple of 8; got {scratch_bytes}")
_scratch_bytes = scratch_bytes


def get_scratch_u32():
"""Return the shared scratch ``Field(u32)``, allocating on first use.

The field is invalidated automatically by the ``impl.on_reset`` hook registered below, so a subsequent call
after ``qd.reset()`` will reallocate against the fresh runtime.
"""
global _scratch_field
if _scratch_field is None:
_scratch_field = field(u32, shape=_scratch_bytes // 4)
return _scratch_field


def get_scratch_u64():
"""Return the shared scratch ``Field(u64)``, allocating on first use.

Used by 8-byte algorithms (``i64`` / ``u64`` / ``f64`` reduce + scan, ``u64`` radix-sort keys). Lives alongside
the u32 scratch rather than overlaying it: a u64 backing aliasing into u32-sized half-cells would require
dtype-punning fields, which Quadrants doesn't expose. Same byte budget, half as many slots.
"""
global _scratch_field_u64
if _scratch_field_u64 is None:
_scratch_field_u64 = field(u64, shape=_scratch_bytes // 8)
return _scratch_field_u64


def scratch_capacity_u32() -> int:
"""Return the scratch capacity in u32 slots for the *next* allocation."""
return _scratch_bytes // 4


def scratch_capacity_u64() -> int:
"""Return the scratch capacity in u64 slots for the *next* allocation."""
return _scratch_bytes // 8


def _invalidate() -> None:
"""Drop the cached scratch handles *and* reset the capacity setting back to ``DEFAULT_SCRATCH_BYTES``. Registered
as an ``impl.on_reset`` hook so every ``qd.reset()`` -> ``qd.init()`` transaction is a clean slate: the next
``get_scratch_*()`` call reallocates against the fresh runtime at the default capacity, and any prior
``set_scratch_bytes(...)`` bump has to be re-applied before the new runtime's first algorithm call.

The persistence-vs-clean-slate trade-off was explicitly resolved in favour of clean slate: ``qd.init`` /
``qd.reset`` is meant to be "free to use whenever, no constraints", which only holds if all module state tied to
a runtime cycle (resource handles *and* runtime-scoped config) goes away on reset. Apps that want a persistent
bump (or persistent shrink, for apps that know their N is small and don't want to pay 10 MB across the two
scratch fields) should call ``set_scratch_bytes`` immediately after each ``qd.init``.
"""
global _scratch_field, _scratch_field_u64, _scratch_bytes
_scratch_field = None
_scratch_field_u64 = None
_scratch_bytes = DEFAULT_SCRATCH_BYTES


on_reset(_invalidate)
23 changes: 23 additions & 0 deletions python/quadrants/algorithms/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,26 @@
# type: ignore

from ._algorithms import *
from ._radix_sort import device_radix_sort
from ._reduce import device_reduce_add, device_reduce_max, device_reduce_min
from ._reduce_by_key import device_reduce_by_key_add
from ._scan import (
device_exclusive_scan_add,
device_exclusive_scan_max,
device_exclusive_scan_min,
)
from ._select import device_select

__all__ = [
"PrefixSumExecutor",
"device_exclusive_scan_add",
"device_exclusive_scan_max",
"device_exclusive_scan_min",
"device_radix_sort",
"device_reduce_add",
"device_reduce_by_key_add",
"device_reduce_max",
"device_reduce_min",
"device_select",
"parallel_sort",
]
39 changes: 37 additions & 2 deletions python/quadrants/algorithms/_algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,28 @@


def parallel_sort(keys, values=None):
"""Odd-even merge sort
"""Odd-even merge sort (deprecated).

.. deprecated::
Prefer ``qd.algorithms.device_radix_sort(keys, *, tmp_keys, values=..., tmp_values=...)``. The new
functional API is asymptotically ``O(N log_radix N)`` rather than ``O(N log^2 N)``, supports
``{u32, i32, f32}`` keys across CUDA / AMDGPU / Vulkan / Metal, and takes a caller-supplied tmp buffer so
the call stays fully async. ``parallel_sort`` is kept for one release cycle for backward compat and will be
removed thereafter. See ``docs/source/user_guide/algorithms.md`` for the migration recipe.

References:
https://developer.nvidia.com/gpugems/gpugems2/part-vi-simulation-and-numerical-algorithms/chapter-46-improved-gpu-sorting
https://en.wikipedia.org/wiki/Batcher_odd%E2%80%93even_mergesort
"""
import warnings # pylint: disable=import-outside-toplevel

warnings.warn(
"qd.algorithms.parallel_sort is deprecated. Use "
"qd.algorithms.device_radix_sort(keys, tmp_keys=..., values=..., tmp_values=...) "
"instead. See docs/source/user_guide/algorithms.md for migration.",
DeprecationWarning,
stacklevel=2,
)
N = keys.shape[0]

num_stages = 0
Expand All @@ -42,7 +58,14 @@ def parallel_sort(keys, values=None):

@data_oriented
class PrefixSumExecutor:
"""Parallel Prefix Sum (Scan) Helper
"""Parallel Prefix Sum (Scan) Helper.

.. deprecated::
Prefer ``qd.algorithms.device_exclusive_scan_add(arr, out)``. The new functional API supports
``{i32, u32, f32}`` on every backend (CUDA, AMDGPU, Vulkan, Metal) and runs the exclusive variant directly.
``PrefixSumExecutor`` is inclusive-only, ``i32``-only, and limited to CUDA / Vulkan; it is kept for one
release cycle for backward compat and will be removed thereafter. See ``docs/source/user_guide/algorithms.md``
for the migration recipe.

Use this helper to perform an inclusive in-place's parallel prefix sum.

Expand All @@ -52,6 +75,18 @@ class PrefixSumExecutor:
"""

def __init__(self, length):
import warnings # pylint: disable=import-outside-toplevel

warnings.warn(
"qd.algorithms.PrefixSumExecutor is deprecated. Use "
"qd.algorithms.device_exclusive_scan_add(arr, out) instead. "
"See docs/source/user_guide/algorithms.md for migration.",
DeprecationWarning,
stacklevel=2,
)
self._init(length)

def _init(self, length):
self.sorting_length = length

BLOCK_SZ = 64
Expand Down
Loading
Loading