Skip to content

[AMDGPU] Emit atomicrmw fmin/fmax for f32/f64 atomic_min/atomic_max to fix syncscope gap #686

@alanray-tech

Description

@alanray-tech

Background

PR #672 fixes AMDGPU CAS livelock for atomic_xor/atomic_add by setting syncscope("agent") on kernel-facing atomicrmw instructions. However, f32/f64 atomic_min/atomic_max are not covered — they still go through runtime-bitcode CAS loops at system scope.

Current behaviour

codegen_llvm.cpp:real_type_atomic() dispatches f32/f64 min/max via call("atomic_min_f32", ...) (lines 1366–1373), which calls into runtime_module/atomic.h DEFINE_ATOMIC_OP_COMP_EXCH. Those CAS loops use __atomic_compare_exchange(..., seq_cst, seq_cst) → system-scope cmpxchg. On AMDGPU this means per-iteration cache flushes for host-CPU coherence that is never needed.

This is a performance issue, not a correctness one: min/max are convergent operations and will not livelock like xor does. But under high contention the system-scope cache-flush overhead per CAS iteration is significant.

Proposed fix

Add min/max cases to the existing switch in real_type_atomic(), mirroring the FAdd pattern:

    case AtomicOpType::min:
      return builder->CreateAtomicRMW(llvm::AtomicRMWInst::FMin,
          llvm_val[stmt->dest], llvm_val[stmt->val],
          llvm::MaybeAlign(0),
          llvm::AtomicOrdering::SequentiallyConsistent,
          kernel_atomic_syncscope(llvm_context, current_arch()));
    case AtomicOpType::max:
      return builder->CreateAtomicRMW(llvm::AtomicRMWInst::FMax,
          llvm_val[stmt->dest], llvm_val[stmt->val],
          llvm::MaybeAlign(0),
          llvm::AtomicOrdering::SequentiallyConsistent,
          kernel_atomic_syncscope(llvm_context, current_arch()));

After this change, the call("atomic_min_f32", ...) block and DEFINE_ATOMIC_OP_COMP_EXCH(min/max, f32/f64) in atomic.h become dead code and can be removed.

Expected results by target

Target Result
GFX10+ (RDNA) Native global_atomic_fmin/fmax — single instruction
GFX9 (Vega) CAS loop, but cmpxchg uses agent scope (no per-iteration cache flush)
CUDA / CPU No change (they already use System scope, which lowers correctly)

Things to address

  1. NaN semantics change: current CAS uses b > a ? a : b (propagates NaN); atomicrmw fmin/fmax uses minNum/maxNum semantics (returns non-NaN value). This aligns f32/f64 with the f16 path (CreateMinNum/CreateMaxNum at codegen_llvm.cpp:1342–1346) but is user-visible — needs CHANGELOG entry and doc update.
  2. Test coverage: add contention tests for f32/f64 min/max on AMDGPU, including NaN and -0 edge cases.
  3. Backend verification: confirm correct lowering on all LLVM backends (CPU, CUDA, AMDGPU, SPIR-V).
  4. Dead-code removal: verify no other callers of atomic_min_f32 etc. before deleting.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions