Summary
Quadrants has no volatile load or atomic load primitive. This makes it impossible to correctly and efficiently implement spin-wait patterns (e.g. decoupled-look-back scans) that require repeatedly reading a memory location until it changes.
Problem
A spin-wait loop like:
while flags[prev] == STATE_INVALID:
pass
relies on the compiler re-reading flags[prev] from global memory on each iteration. Without a volatile load, the compiler may hoist the load out of the loop or cache the value in a register, turning the spin into an infinite loop.
Current workarounds are all suboptimal:
| Workaround |
Correct? |
Performance |
grid.memfence() inside the loop |
Yes (acts as compiler barrier) |
Bad — full device-scope cache drain per iteration |
atomic_add(flags[prev], 0) |
Yes (forces memory round-trip) |
Bad — read-modify-write overhead, contention |
| Do nothing and hope the compiler doesn't optimize |
Fragile |
N/A |
Proposed solution
Add a qd.volatile_load(target, *indices) primitive (or equivalent) that guarantees the load goes to memory on every call. The implementation maps cleanly to existing backend primitives:
- LLVM IR (CUDA / AMDGPU): emit
load volatile instead of load — LLVM guarantees it cannot be eliminated, merged, or hoisted. On CUDA, LLVM lowers this to ld.volatile.global in PTX.
- SPIR-V (Vulkan / Metal): emit
OpLoad with the Volatile Memory Access bit (0x1) — prevents expression forwarding and value caching.
No new hardware capability is needed; every backend already supports this at the instruction level.
Use cases
- Decoupled-look-back scans (Onesweep-style) — spin on a flag array
- Producer-consumer patterns between blocks via global memory
- Any kernel that polls a shared memory location written by another thread/block
Found during
Review of #641 (docs for qd.simt.grid.*), where the lookback_scan example relies on re-reading flags[prev] in a spin loop but has no way to guarantee it.
Summary
Quadrants has no volatile load or atomic load primitive. This makes it impossible to correctly and efficiently implement spin-wait patterns (e.g. decoupled-look-back scans) that require repeatedly reading a memory location until it changes.
Problem
A spin-wait loop like:
relies on the compiler re-reading
flags[prev]from global memory on each iteration. Without a volatile load, the compiler may hoist the load out of the loop or cache the value in a register, turning the spin into an infinite loop.Current workarounds are all suboptimal:
grid.memfence()inside the loopatomic_add(flags[prev], 0)Proposed solution
Add a
qd.volatile_load(target, *indices)primitive (or equivalent) that guarantees the load goes to memory on every call. The implementation maps cleanly to existing backend primitives:load volatileinstead ofload— LLVM guarantees it cannot be eliminated, merged, or hoisted. On CUDA, LLVM lowers this told.volatile.globalin PTX.OpLoadwith theVolatileMemory Access bit (0x1) — prevents expression forwarding and value caching.No new hardware capability is needed; every backend already supports this at the instruction level.
Use cases
Found during
Review of #641 (docs for
qd.simt.grid.*), where the lookback_scan example relies on re-readingflags[prev]in a spin loop but has no way to guarantee it.