Skip to content

OpenMP offload: firstprivate of a small fixed-size array spills ~35 KB to scratch and collapses occupancy (gfx90a) #2909

Description

@sbryngelson

On gfx90a OpenMP offload, putting a small fixed-size integer array in a firstprivate clause on a register-heavy target teams distribute parallel do kernel makes the kernel spill ~35 KB/work-item to scratch, pins AGPRs at the hardware maximum, and drops occupancy to a single wave per SIMD. The kernel runs 30-50x slower. Passing the same data as two scalars, or as a plain private array initialized from those scalars, costs nothing. The array here is 8 bytes (integer, dimension(2)).

I hit this in a production CFD code (MFC) and reduced it to a single self-contained file. Full reproducer, build/run scripts, and raw traces:
https://github.com/sbryngelson/compiler-bugs/tree/main/amd/flang-firstprivate-array-occupancy

Environment

  • Hardware: AMD MI250X (gfx90a), one GCD, OLCF Frontier.
  • Compilers (reproduces on all three; see below for the per-version surface behavior):
    • amdflang ROCm 7.2.0, LLVM 22.0.0git (public release)
    • AMD AFAR drop 23.1.0, LLVM 23.0.0git (03/12/26)
    • AMD AFAR drop 23.2.0, LLVM 23.0.0git (04/18/26, latest available)
  • Flags:
    compile: -fopenmp --offload-arch=gfx90a -O3 \
             -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription
    link:    -fopenmp --offload-arch=gfx90a
    run:     OMP_TARGET_OFFLOAD=MANDATORY  LIBOMPTARGET_KERNEL_TRACE=1
    

What I see

The reproducer is one source file built five ways (one -DVARIANT_* each). The kernel arithmetic is byte-identical in all five — a register-heavy blob (~90 private real scalars and a few small private arrays through a long dependent sqrt/sign/divide chain so nothing folds away). The only thing that differs is how the two small integers reach the kernel. LIBOMPTARGET_KERNEL_TRACE=1 on afar 23.2.0:

variant                          ns/elem   scratch   AGPR  SGPR-spill  VGPR-spill  occ
A  baseline, no clause            0.135       0 B       0        0           0      50%
B  firstprivate(re)   [int(2)]    6.330   35424 B     256     1155         451      12%   <- 47x
C  firstprivate(re1, re2)         0.196       0 B       0        0           0      50%
D  firstprivate(re), const index  6.347   35424 B     256     1155         451      12%   <- 47x
E  private(repriv) + fp scalars   0.203       0 B       0        0           0      50%

It's firstprivate of an array specifically — not the indexing

The natural guess is "a runtime-indexed private array can't stay in registers, so it spills — expected." That's wrong here. A 2x2 over {clause} x {how the array is indexed} rules it out:

read re(i) (dynamic) read merge(re(1),re(2),..) (constant)
firstprivate(re) B: spills, 12% occ D: spills, 12% occ
private, seeded from firstprivate scalars E: 0 scratch, 50% occ C (scalars): 0 scratch, 50%
  • D reads the firstprivate array with constant indices and spills just as hard as B, so the dynamic index isn't the cause.
  • E reads a private array with a dynamic index and is perfectly fine, so a dynamically-indexed private array isn't the cause either.

E is the interesting one: it expresses the exact semantics of firstprivate(re) by hand — a per-work-item private array seeded from the original values (carried in as two firstprivate scalars) — and you lower that to zero-scratch, full-occupancy code. So the back end is fully capable of generating good code for the semantics; it only goes wrong when the clause is spelled firstprivate(<array>).

Likely mechanism

The same source on the public ROCm 7.2.0 release doesn't even link — the firstprivate-array variants leave an undefined device symbol:

ld.lld: error: undefined symbol: _FortranAAssign
>>> referenced by ...__omp_offloading_..._run_sweep...

_FortranAAssign is the Fortran runtime's descriptor-assignment helper. So the firstprivate(array) copy-in appears to be lowered through the general array-assignment runtime path rather than as a plain value copy. On 7.2.0 that helper isn't present on the device (link error); on the 23.x drops it's inlined into the kernel as a large scratch-spilling blob. Same root cause, two surface failures. The scalar and plain-private forms don't take that path. (Log: results/rocm-7.2.0-link-failure.txt in the repo.)

Not a stale-drop artifact

The bug is present on the newest afar drop (23.2.0), and the spill is larger than on 23.1.0 (scratch 20.8 KB -> 35.4 KB, AGPR 128 -> 256, and 23.2.0 picks up 1155 SGPR + 451 VGPR spills that 23.1.0 didn't have). Both-drops trace is in results/kernel_trace.txt.

Cray ftn and nvfortran offload builds of the same code are unaffected.

Reproduce

git clone https://github.com/sbryngelson/compiler-bugs
cd compiler-bugs/amd/flang-firstprivate-array-occupancy
./build.sh          # builds fp_A..fp_E; also prints the .llvm.offloading section size
sbatch run.sbatch   # or, interactively, run each fp_* with LIBOMPTARGET_KERNEL_TRACE=1

A quick static fingerprint without running: the embedded GPU code object (.llvm.offloading section) is ~37x larger for the firstprivate-array variants (871,504 vs 23,544 bytes on afar 23.1.0). Use the llvm-objcopy from the same drop.

Workaround

Carry the value in as scalars and firstprivate those, or as a plain private array seeded from firstprivate scalars (variants C and E). Both stay register-resident at full occupancy. Posting in case the firstprivate-of-an-array lowering is straightforward to route away from _FortranAAssign toward a value copy — happy to test patches or grab more traces (IR, --save-temps, rocprof) on Frontier.

Metadata

Metadata

Assignees

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions