On gfx90a OpenMP offload, putting a small fixed-size integer array in a firstprivate clause on a register-heavy target teams distribute parallel do kernel makes the kernel spill ~35 KB/work-item to scratch, pins AGPRs at the hardware maximum, and drops occupancy to a single wave per SIMD. The kernel runs 30-50x slower. Passing the same data as two scalars, or as a plain private array initialized from those scalars, costs nothing. The array here is 8 bytes (integer, dimension(2)).
I hit this in a production CFD code (MFC) and reduced it to a single self-contained file. Full reproducer, build/run scripts, and raw traces:
https://github.com/sbryngelson/compiler-bugs/tree/main/amd/flang-firstprivate-array-occupancy
Environment
- Hardware: AMD MI250X (gfx90a), one GCD, OLCF Frontier.
- Compilers (reproduces on all three; see below for the per-version surface behavior):
amdflang ROCm 7.2.0, LLVM 22.0.0git (public release)
- AMD AFAR drop 23.1.0, LLVM 23.0.0git (03/12/26)
- AMD AFAR drop 23.2.0, LLVM 23.0.0git (04/18/26, latest available)
- Flags:
compile: -fopenmp --offload-arch=gfx90a -O3 \
-fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription
link: -fopenmp --offload-arch=gfx90a
run: OMP_TARGET_OFFLOAD=MANDATORY LIBOMPTARGET_KERNEL_TRACE=1
What I see
The reproducer is one source file built five ways (one -DVARIANT_* each). The kernel arithmetic is byte-identical in all five — a register-heavy blob (~90 private real scalars and a few small private arrays through a long dependent sqrt/sign/divide chain so nothing folds away). The only thing that differs is how the two small integers reach the kernel. LIBOMPTARGET_KERNEL_TRACE=1 on afar 23.2.0:
variant ns/elem scratch AGPR SGPR-spill VGPR-spill occ
A baseline, no clause 0.135 0 B 0 0 0 50%
B firstprivate(re) [int(2)] 6.330 35424 B 256 1155 451 12% <- 47x
C firstprivate(re1, re2) 0.196 0 B 0 0 0 50%
D firstprivate(re), const index 6.347 35424 B 256 1155 451 12% <- 47x
E private(repriv) + fp scalars 0.203 0 B 0 0 0 50%
It's firstprivate of an array specifically — not the indexing
The natural guess is "a runtime-indexed private array can't stay in registers, so it spills — expected." That's wrong here. A 2x2 over {clause} x {how the array is indexed} rules it out:
|
read re(i) (dynamic) |
read merge(re(1),re(2),..) (constant) |
firstprivate(re) |
B: spills, 12% occ |
D: spills, 12% occ |
private, seeded from firstprivate scalars |
E: 0 scratch, 50% occ |
C (scalars): 0 scratch, 50% |
- D reads the firstprivate array with constant indices and spills just as hard as B, so the dynamic index isn't the cause.
- E reads a
private array with a dynamic index and is perfectly fine, so a dynamically-indexed private array isn't the cause either.
E is the interesting one: it expresses the exact semantics of firstprivate(re) by hand — a per-work-item private array seeded from the original values (carried in as two firstprivate scalars) — and you lower that to zero-scratch, full-occupancy code. So the back end is fully capable of generating good code for the semantics; it only goes wrong when the clause is spelled firstprivate(<array>).
Likely mechanism
The same source on the public ROCm 7.2.0 release doesn't even link — the firstprivate-array variants leave an undefined device symbol:
ld.lld: error: undefined symbol: _FortranAAssign
>>> referenced by ...__omp_offloading_..._run_sweep...
_FortranAAssign is the Fortran runtime's descriptor-assignment helper. So the firstprivate(array) copy-in appears to be lowered through the general array-assignment runtime path rather than as a plain value copy. On 7.2.0 that helper isn't present on the device (link error); on the 23.x drops it's inlined into the kernel as a large scratch-spilling blob. Same root cause, two surface failures. The scalar and plain-private forms don't take that path. (Log: results/rocm-7.2.0-link-failure.txt in the repo.)
Not a stale-drop artifact
The bug is present on the newest afar drop (23.2.0), and the spill is larger than on 23.1.0 (scratch 20.8 KB -> 35.4 KB, AGPR 128 -> 256, and 23.2.0 picks up 1155 SGPR + 451 VGPR spills that 23.1.0 didn't have). Both-drops trace is in results/kernel_trace.txt.
Cray ftn and nvfortran offload builds of the same code are unaffected.
Reproduce
git clone https://github.com/sbryngelson/compiler-bugs
cd compiler-bugs/amd/flang-firstprivate-array-occupancy
./build.sh # builds fp_A..fp_E; also prints the .llvm.offloading section size
sbatch run.sbatch # or, interactively, run each fp_* with LIBOMPTARGET_KERNEL_TRACE=1
A quick static fingerprint without running: the embedded GPU code object (.llvm.offloading section) is ~37x larger for the firstprivate-array variants (871,504 vs 23,544 bytes on afar 23.1.0). Use the llvm-objcopy from the same drop.
Workaround
Carry the value in as scalars and firstprivate those, or as a plain private array seeded from firstprivate scalars (variants C and E). Both stay register-resident at full occupancy. Posting in case the firstprivate-of-an-array lowering is straightforward to route away from _FortranAAssign toward a value copy — happy to test patches or grab more traces (IR, --save-temps, rocprof) on Frontier.
On gfx90a OpenMP offload, putting a small fixed-size integer array in a
firstprivateclause on a register-heavytarget teams distribute parallel dokernel makes the kernel spill ~35 KB/work-item to scratch, pins AGPRs at the hardware maximum, and drops occupancy to a single wave per SIMD. The kernel runs 30-50x slower. Passing the same data as two scalars, or as a plainprivatearray initialized from those scalars, costs nothing. The array here is 8 bytes (integer, dimension(2)).I hit this in a production CFD code (MFC) and reduced it to a single self-contained file. Full reproducer, build/run scripts, and raw traces:
https://github.com/sbryngelson/compiler-bugs/tree/main/amd/flang-firstprivate-array-occupancy
Environment
amdflangROCm 7.2.0, LLVM 22.0.0git (public release)What I see
The reproducer is one source file built five ways (one
-DVARIANT_*each). The kernel arithmetic is byte-identical in all five — a register-heavy blob (~90 private real scalars and a few small private arrays through a long dependent sqrt/sign/divide chain so nothing folds away). The only thing that differs is how the two small integers reach the kernel.LIBOMPTARGET_KERNEL_TRACE=1on afar 23.2.0:It's
firstprivateof an array specifically — not the indexingThe natural guess is "a runtime-indexed private array can't stay in registers, so it spills — expected." That's wrong here. A 2x2 over {clause} x {how the array is indexed} rules it out:
re(i)(dynamic)merge(re(1),re(2),..)(constant)firstprivate(re)private, seeded fromfirstprivatescalarsprivatearray with a dynamic index and is perfectly fine, so a dynamically-indexed private array isn't the cause either.E is the interesting one: it expresses the exact semantics of
firstprivate(re)by hand — a per-work-item private array seeded from the original values (carried in as twofirstprivatescalars) — and you lower that to zero-scratch, full-occupancy code. So the back end is fully capable of generating good code for the semantics; it only goes wrong when the clause is spelledfirstprivate(<array>).Likely mechanism
The same source on the public ROCm 7.2.0 release doesn't even link — the firstprivate-array variants leave an undefined device symbol:
_FortranAAssignis the Fortran runtime's descriptor-assignment helper. So thefirstprivate(array)copy-in appears to be lowered through the general array-assignment runtime path rather than as a plain value copy. On 7.2.0 that helper isn't present on the device (link error); on the 23.x drops it's inlined into the kernel as a large scratch-spilling blob. Same root cause, two surface failures. The scalar and plain-privateforms don't take that path. (Log:results/rocm-7.2.0-link-failure.txtin the repo.)Not a stale-drop artifact
The bug is present on the newest afar drop (23.2.0), and the spill is larger than on 23.1.0 (scratch 20.8 KB -> 35.4 KB, AGPR 128 -> 256, and 23.2.0 picks up 1155 SGPR + 451 VGPR spills that 23.1.0 didn't have). Both-drops trace is in
results/kernel_trace.txt.Cray
ftnandnvfortranoffload builds of the same code are unaffected.Reproduce
A quick static fingerprint without running: the embedded GPU code object (
.llvm.offloadingsection) is ~37x larger for the firstprivate-array variants (871,504 vs 23,544 bytes on afar 23.1.0). Use thellvm-objcopyfrom the same drop.Workaround
Carry the value in as scalars and
firstprivatethose, or as a plainprivatearray seeded from firstprivate scalars (variants C and E). Both stay register-resident at full occupancy. Posting in case thefirstprivate-of-an-array lowering is straightforward to route away from_FortranAAssigntoward a value copy — happy to test patches or grab more traces (IR,--save-temps, rocprof) on Frontier.