[Feature] Add volatile_load primitive for spin-wait patterns

## Summary

Quadrants has no volatile load or atomic load primitive. This makes it impossible to correctly and efficiently implement spin-wait patterns (e.g. decoupled-look-back scans) that require repeatedly reading a memory location until it changes.

## Problem

A spin-wait loop like:

```python
while flags[prev] == STATE_INVALID:
    pass
```

relies on the compiler re-reading `flags[prev]` from global memory on each iteration. Without a volatile load, the compiler may hoist the load out of the loop or cache the value in a register, turning the spin into an infinite loop.

Current workarounds are all suboptimal:

| Workaround | Correct? | Performance |
|-----------|----------|-------------|
| `grid.memfence()` inside the loop | Yes (acts as compiler barrier) | Bad — full device-scope cache drain per iteration |
| `atomic_add(flags[prev], 0)` | Yes (forces memory round-trip) | Bad — read-modify-write overhead, contention |
| Do nothing and hope the compiler doesn't optimize | Fragile | N/A |

## Proposed solution

Add a `qd.volatile_load(target, *indices)` primitive (or equivalent) that guarantees the load goes to memory on every call. The implementation maps cleanly to existing backend primitives:

- **LLVM IR** (CUDA / AMDGPU): emit `load volatile` instead of `load` — LLVM guarantees it cannot be eliminated, merged, or hoisted. On CUDA, LLVM lowers this to `ld.volatile.global` in PTX.
- **SPIR-V** (Vulkan / Metal): emit `OpLoad` with the `Volatile` Memory Access bit (`0x1`) — prevents expression forwarding and value caching.

No new hardware capability is needed; every backend already supports this at the instruction level.

## Use cases

- Decoupled-look-back scans (Onesweep-style) — spin on a flag array
- Producer-consumer patterns between blocks via global memory
- Any kernel that polls a shared memory location written by another thread/block

## Found during

Review of #641 (docs for `qd.simt.grid.*`), where the lookback_scan example relies on re-reading `flags[prev]` in a spin loop but has no way to guarantee it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add volatile_load primitive for spin-wait patterns #648

Summary

Problem

Proposed solution

Use cases

Found during

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Workaround	Correct?	Performance
`grid.memfence()` inside the loop	Yes (acts as compiler barrier)	Bad — full device-scope cache drain per iteration
`atomic_add(flags[prev], 0)`	Yes (forces memory round-trip)	Bad — read-modify-write overhead, contention
Do nothing and hope the compiler doesn't optimize	Fragile	N/A

[Feature] Add volatile_load primitive for spin-wait patterns #648

Description

Summary

Problem

Proposed solution

Use cases

Found during

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions