[AMDGPU] Emit atomicrmw fmin/fmax for f32/f64 atomic_min/atomic_max to fix syncscope gap

## Background

PR #672 fixes AMDGPU CAS livelock for `atomic_xor`/`atomic_add` by setting `syncscope("agent")` on kernel-facing `atomicrmw` instructions. However, **f32/f64 `atomic_min`/`atomic_max`** are not covered — they still go through runtime-bitcode CAS loops at system scope.

## Current behaviour

`codegen_llvm.cpp:real_type_atomic()` dispatches f32/f64 min/max via `call("atomic_min_f32", ...)` (lines 1366–1373), which calls into `runtime_module/atomic.h` `DEFINE_ATOMIC_OP_COMP_EXCH`. Those CAS loops use `__atomic_compare_exchange(..., seq_cst, seq_cst)` → system-scope `cmpxchg`. On AMDGPU this means per-iteration cache flushes for host-CPU coherence that is never needed.

This is a **performance** issue, not a correctness one: min/max are convergent operations and will not livelock like xor does. But under high contention the system-scope cache-flush overhead per CAS iteration is significant.

## Proposed fix

Add `min`/`max` cases to the existing `switch` in `real_type_atomic()`, mirroring the `FAdd` pattern:

```cpp
    case AtomicOpType::min:
      return builder->CreateAtomicRMW(llvm::AtomicRMWInst::FMin,
          llvm_val[stmt->dest], llvm_val[stmt->val],
          llvm::MaybeAlign(0),
          llvm::AtomicOrdering::SequentiallyConsistent,
          kernel_atomic_syncscope(llvm_context, current_arch()));
    case AtomicOpType::max:
      return builder->CreateAtomicRMW(llvm::AtomicRMWInst::FMax,
          llvm_val[stmt->dest], llvm_val[stmt->val],
          llvm::MaybeAlign(0),
          llvm::AtomicOrdering::SequentiallyConsistent,
          kernel_atomic_syncscope(llvm_context, current_arch()));
```

After this change, the `call("atomic_min_f32", ...)` block and `DEFINE_ATOMIC_OP_COMP_EXCH(min/max, f32/f64)` in `atomic.h` become dead code and can be removed.

## Expected results by target

| Target | Result |
|--------|--------|
| GFX10+ (RDNA) | Native `global_atomic_fmin`/`fmax` — single instruction |
| GFX9 (Vega) | CAS loop, but `cmpxchg` uses agent scope (no per-iteration cache flush) |
| CUDA / CPU | No change (they already use System scope, which lowers correctly) |

## Things to address

1. **NaN semantics change**: current CAS uses `b > a ? a : b` (propagates NaN); `atomicrmw fmin/fmax` uses `minNum`/`maxNum` semantics (returns non-NaN value). This aligns f32/f64 with the f16 path (`CreateMinNum`/`CreateMaxNum` at `codegen_llvm.cpp:1342–1346`) but is user-visible — needs CHANGELOG entry and doc update.
2. **Test coverage**: add contention tests for f32/f64 min/max on AMDGPU, including NaN and -0 edge cases.
3. **Backend verification**: confirm correct lowering on all LLVM backends (CPU, CUDA, AMDGPU, SPIR-V).
4. **Dead-code removal**: verify no other callers of `atomic_min_f32` etc. before deleting.

## References

- PR #672 discussion: https://github.com/Genesis-Embodied-AI/quadrants/pull/672#discussion_r3219453634
- `AtomicRMWInst::FMin`/`FMax` added in LLVM 15: [D127041](https://reviews.llvm.org/D127041), [LLVM 15 release notes](https://releases.llvm.org/15.0.0/docs/ReleaseNotes.html)
- GFX10 native `global_atomic_fmin/fmax`: [LLVM commit e381833](https://github.com/llvm/llvm-project/commit/e381833ba5794fde6d36e3fbcc9b22ffbe438d17)
- GFX9 lacks f32 fmin/fmax: [LLVM PR #95592](https://github.com/llvm/llvm-project/pull/95592)
- CI uses ROCm 6.4 = [LLVM 19](https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.4.2/index.html)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Emit atomicrmw fmin/fmax for f32/f64 atomic_min/atomic_max to fix syncscope gap #686

Background

Current behaviour

Proposed fix

Expected results by target

Things to address

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Target	Result
GFX10+ (RDNA)	Native `global_atomic_fmin`/`fmax` — single instruction
GFX9 (Vega)	CAS loop, but `cmpxchg` uses agent scope (no per-iteration cache flush)
CUDA / CPU	No change (they already use System scope, which lowers correctly)

[AMDGPU] Emit atomicrmw fmin/fmax for f32/f64 atomic_min/atomic_max to fix syncscope gap #686

Description

Background

Current behaviour

Proposed fix

Expected results by target

Things to address

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions