OpenMP offload: `firstprivate` of a small fixed-size array spills ~35 KB to scratch and collapses occupancy (gfx90a)

On gfx90a OpenMP offload, putting a small fixed-size integer array in a `firstprivate` clause on a register-heavy `target teams distribute parallel do` kernel makes the kernel spill ~35 KB/work-item to scratch, pins AGPRs at the hardware maximum, and drops occupancy to a single wave per SIMD. The kernel runs 30-50x slower. Passing the *same data* as two scalars, or as a plain `private` array initialized from those scalars, costs nothing. The array here is 8 bytes (`integer, dimension(2)`).

I hit this in a production CFD code (MFC) and reduced it to a single self-contained file. Full reproducer, build/run scripts, and raw traces:
https://github.com/sbryngelson/compiler-bugs/tree/main/amd/flang-firstprivate-array-occupancy

## Environment

- Hardware: AMD MI250X (gfx90a), one GCD, OLCF Frontier.
- Compilers (reproduces on all three; see below for the per-version surface behavior):
  - `amdflang` ROCm 7.2.0, LLVM 22.0.0git (public release)
  - AMD AFAR drop 23.1.0, LLVM 23.0.0git (03/12/26)
  - AMD AFAR drop 23.2.0, LLVM 23.0.0git (04/18/26, latest available)
- Flags:
  ```
  compile: -fopenmp --offload-arch=gfx90a -O3 \
           -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription
  link:    -fopenmp --offload-arch=gfx90a
  run:     OMP_TARGET_OFFLOAD=MANDATORY  LIBOMPTARGET_KERNEL_TRACE=1
  ```

## What I see

The reproducer is one source file built five ways (one `-DVARIANT_*` each). The kernel arithmetic is byte-identical in all five — a register-heavy blob (~90 private real scalars and a few small private arrays through a long dependent sqrt/sign/divide chain so nothing folds away). The only thing that differs is how the two small integers reach the kernel. `LIBOMPTARGET_KERNEL_TRACE=1` on afar 23.2.0:

```
variant                          ns/elem   scratch   AGPR  SGPR-spill  VGPR-spill  occ
A  baseline, no clause            0.135       0 B       0        0           0      50%
B  firstprivate(re)   [int(2)]    6.330   35424 B     256     1155         451      12%   <- 47x
C  firstprivate(re1, re2)         0.196       0 B       0        0           0      50%
D  firstprivate(re), const index  6.347   35424 B     256     1155         451      12%   <- 47x
E  private(repriv) + fp scalars   0.203       0 B       0        0           0      50%
```

## It's `firstprivate` of an array specifically — not the indexing

The natural guess is "a runtime-indexed private array can't stay in registers, so it spills — expected." That's wrong here. A 2x2 over {clause} x {how the array is indexed} rules it out:

|                                        | read `re(i)` (dynamic) | read `merge(re(1),re(2),..)` (constant) |
|----------------------------------------|------------------------|------------------------------------------|
| `firstprivate(re)`                     | **B: spills, 12% occ** | **D: spills, 12% occ**                   |
| `private`, seeded from `firstprivate` scalars | **E: 0 scratch, 50% occ** | C (scalars): 0 scratch, 50%      |

- **D** reads the firstprivate array with *constant* indices and spills just as hard as B, so the dynamic index isn't the cause.
- **E** reads a `private` array with a *dynamic* index and is perfectly fine, so a dynamically-indexed private array isn't the cause either.

E is the interesting one: it expresses the exact semantics of `firstprivate(re)` by hand — a per-work-item private array seeded from the original values (carried in as two `firstprivate` scalars) — and you lower that to zero-scratch, full-occupancy code. So the back end is fully capable of generating good code for the semantics; it only goes wrong when the clause is spelled `firstprivate(<array>)`.

## Likely mechanism

The same source on the public ROCm 7.2.0 release doesn't even link — the firstprivate-array variants leave an undefined device symbol:

```
ld.lld: error: undefined symbol: _FortranAAssign
>>> referenced by ...__omp_offloading_..._run_sweep...
```

`_FortranAAssign` is the Fortran runtime's descriptor-assignment helper. So the `firstprivate(array)` copy-in appears to be lowered through the general array-assignment runtime path rather than as a plain value copy. On 7.2.0 that helper isn't present on the device (link error); on the 23.x drops it's inlined into the kernel as a large scratch-spilling blob. Same root cause, two surface failures. The scalar and plain-`private` forms don't take that path. (Log: `results/rocm-7.2.0-link-failure.txt` in the repo.)

## Not a stale-drop artifact

The bug is present on the newest afar drop (23.2.0), and the spill is *larger* than on 23.1.0 (scratch 20.8 KB -> 35.4 KB, AGPR 128 -> 256, and 23.2.0 picks up 1155 SGPR + 451 VGPR spills that 23.1.0 didn't have). Both-drops trace is in `results/kernel_trace.txt`.

Cray `ftn` and `nvfortran` offload builds of the same code are unaffected.

## Reproduce

```bash
git clone https://github.com/sbryngelson/compiler-bugs
cd compiler-bugs/amd/flang-firstprivate-array-occupancy
./build.sh          # builds fp_A..fp_E; also prints the .llvm.offloading section size
sbatch run.sbatch   # or, interactively, run each fp_* with LIBOMPTARGET_KERNEL_TRACE=1
```

A quick static fingerprint without running: the embedded GPU code object (`.llvm.offloading` section) is ~37x larger for the firstprivate-array variants (871,504 vs 23,544 bytes on afar 23.1.0). Use the `llvm-objcopy` from the same drop.

## Workaround

Carry the value in as scalars and `firstprivate` those, or as a plain `private` array seeded from firstprivate scalars (variants C and E). Both stay register-resident at full occupancy. Posting in case the `firstprivate`-of-an-array lowering is straightforward to route away from `_FortranAAssign` toward a value copy — happy to test patches or grab more traces (IR, `--save-temps`, rocprof) on Frontier.


	read `re(i)` (dynamic)	read `merge(re(1),re(2),..)` (constant)
`firstprivate(re)`	B: spills, 12% occ	D: spills, 12% occ
`private`, seeded from `firstprivate` scalars	E: 0 scratch, 50% occ	C (scalars): 0 scratch, 50%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenMP offload: `firstprivate` of a small fixed-size array spills ~35 KB to scratch and collapses occupancy (gfx90a) #2909

Environment

What I see

It's `firstprivate` of an array specifically — not the indexing

Likely mechanism

Not a stale-drop artifact

Reproduce

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

OpenMP offload: firstprivate of a small fixed-size array spills ~35 KB to scratch and collapses occupancy (gfx90a) #2909

Description

Environment

What I see

It's firstprivate of an array specifically — not the indexing

Likely mechanism

Not a stale-drop artifact

Reproduce

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

OpenMP offload: `firstprivate` of a small fixed-size array spills ~35 KB to scratch and collapses occupancy (gfx90a) #2909

It's `firstprivate` of an array specifically — not the indexing