Background
PR #672 fixes AMDGPU CAS livelock for atomic_xor/atomic_add by setting syncscope("agent") on kernel-facing atomicrmw instructions. However, f32/f64 atomic_min/atomic_max are not covered — they still go through runtime-bitcode CAS loops at system scope.
Current behaviour
codegen_llvm.cpp:real_type_atomic() dispatches f32/f64 min/max via call("atomic_min_f32", ...) (lines 1366–1373), which calls into runtime_module/atomic.h DEFINE_ATOMIC_OP_COMP_EXCH. Those CAS loops use __atomic_compare_exchange(..., seq_cst, seq_cst) → system-scope cmpxchg. On AMDGPU this means per-iteration cache flushes for host-CPU coherence that is never needed.
This is a performance issue, not a correctness one: min/max are convergent operations and will not livelock like xor does. But under high contention the system-scope cache-flush overhead per CAS iteration is significant.
Proposed fix
Add min/max cases to the existing switch in real_type_atomic(), mirroring the FAdd pattern:
case AtomicOpType::min:
return builder->CreateAtomicRMW(llvm::AtomicRMWInst::FMin,
llvm_val[stmt->dest], llvm_val[stmt->val],
llvm::MaybeAlign(0),
llvm::AtomicOrdering::SequentiallyConsistent,
kernel_atomic_syncscope(llvm_context, current_arch()));
case AtomicOpType::max:
return builder->CreateAtomicRMW(llvm::AtomicRMWInst::FMax,
llvm_val[stmt->dest], llvm_val[stmt->val],
llvm::MaybeAlign(0),
llvm::AtomicOrdering::SequentiallyConsistent,
kernel_atomic_syncscope(llvm_context, current_arch()));
After this change, the call("atomic_min_f32", ...) block and DEFINE_ATOMIC_OP_COMP_EXCH(min/max, f32/f64) in atomic.h become dead code and can be removed.
Expected results by target
| Target |
Result |
| GFX10+ (RDNA) |
Native global_atomic_fmin/fmax — single instruction |
| GFX9 (Vega) |
CAS loop, but cmpxchg uses agent scope (no per-iteration cache flush) |
| CUDA / CPU |
No change (they already use System scope, which lowers correctly) |
Things to address
- NaN semantics change: current CAS uses
b > a ? a : b (propagates NaN); atomicrmw fmin/fmax uses minNum/maxNum semantics (returns non-NaN value). This aligns f32/f64 with the f16 path (CreateMinNum/CreateMaxNum at codegen_llvm.cpp:1342–1346) but is user-visible — needs CHANGELOG entry and doc update.
- Test coverage: add contention tests for f32/f64 min/max on AMDGPU, including NaN and -0 edge cases.
- Backend verification: confirm correct lowering on all LLVM backends (CPU, CUDA, AMDGPU, SPIR-V).
- Dead-code removal: verify no other callers of
atomic_min_f32 etc. before deleting.
References
Background
PR #672 fixes AMDGPU CAS livelock for
atomic_xor/atomic_addby settingsyncscope("agent")on kernel-facingatomicrmwinstructions. However, f32/f64atomic_min/atomic_maxare not covered — they still go through runtime-bitcode CAS loops at system scope.Current behaviour
codegen_llvm.cpp:real_type_atomic()dispatches f32/f64 min/max viacall("atomic_min_f32", ...)(lines 1366–1373), which calls intoruntime_module/atomic.hDEFINE_ATOMIC_OP_COMP_EXCH. Those CAS loops use__atomic_compare_exchange(..., seq_cst, seq_cst)→ system-scopecmpxchg. On AMDGPU this means per-iteration cache flushes for host-CPU coherence that is never needed.This is a performance issue, not a correctness one: min/max are convergent operations and will not livelock like xor does. But under high contention the system-scope cache-flush overhead per CAS iteration is significant.
Proposed fix
Add
min/maxcases to the existingswitchinreal_type_atomic(), mirroring theFAddpattern:After this change, the
call("atomic_min_f32", ...)block andDEFINE_ATOMIC_OP_COMP_EXCH(min/max, f32/f64)inatomic.hbecome dead code and can be removed.Expected results by target
global_atomic_fmin/fmax— single instructioncmpxchguses agent scope (no per-iteration cache flush)Things to address
b > a ? a : b(propagates NaN);atomicrmw fmin/fmaxusesminNum/maxNumsemantics (returns non-NaN value). This aligns f32/f64 with the f16 path (CreateMinNum/CreateMaxNumatcodegen_llvm.cpp:1342–1346) but is user-visible — needs CHANGELOG entry and doc update.atomic_min_f32etc. before deleting.References
AtomicRMWInst::FMin/FMaxadded in LLVM 15: D127041, LLVM 15 release notesglobal_atomic_fmin/fmax: LLVM commit e381833