[AMDGPU/RDNA4] End-to-end support for global_load_tr (global memory transpose load)

## Background

RDNA4 (gfx1200+) introduces the `global_load_tr_b128` / `global_load_tr_b64` ISA
instructions that load 8 contiguous N-direction elements per lane and perform a
hardware 8×8 wave-level cross-lane transpose in a single operation. For BF16/F16
matmuls where the B matrix is stored in row-major order (N-innermost) but the MMA
intrinsic expects K-innermost register layout, this instruction replaces a
`buffer_load` + shared memory transpose with a single global load, saving LDS
bandwidth and reducing barrier pressure.

The MLIR `amdgpu.global_transpose_load` op and its lowering to ROCDL intrinsics
were upstreamed in LLVM: llvm/llvm-project#195287. This issue tracks the IREE-side
work to enable this instruction end-to-end through the GPU codegen pipeline.

## Planned PR Sequence

The work is structured bottom-up so each PR is independently reviewable and testable.

---

### PR 1 — Teach `ROCDLPrefetchSharedMemoryCopy` to recognize `amdgpu.global_transpose_load`

The software-pipelining pass that hoists global memory reads for double-buffering
currently only recognizes `vector.transfer_read` as a global read root. Extend it
to also treat `amdgpu::GlobalTransposeLoadOp` as a pipeable global memory read so
that future PRs get prefetching for free.

**Scope:**
- `ROCDLPrefetchSharedMemoryCopy.cpp`: extend `analyzeIfOp` and
  `identifyRootOperations` to handle `GlobalTransposeLoadOp`
- Test: lit test showing a kernel with `amdgpu.global_transpose_load` gets
  double-buffered the same way a `transfer_read` kernel does

---

### PR 2 — Pattern match `vector.transfer_read + vector.transpose` → `amdgpu.global_transpose_load`

Add a pattern to `ROCDLLoadToTransposeLoad` that matches a `vector<1x8xT>`
transfer_read from flat global memory followed by a `[1,0]` transpose and replaces
it with `amdgpu.global_transpose_load` plus a corrected contiguous write to the
shared memory allocation.

The corrected write indices use K-inner addressing:
`N_new = N_base + K_single % N`, `K_new = (K_single / N) * N`
so the 8 lanes write a contiguous K-direction slice rather than a strided N slice.

**Scope:**
- `ROCDLLoadToTransposeLoad.cpp`: new rewrite pattern, gated on gfx1200+
- `Passes.cpp`: enable the pass for RDNA4 targets in addition to the existing gfx950 path
- Test: lit test showing the `transfer_read + transpose` chain is replaced with
  `global_load_tr_b128` and the write indices are correct

---

### PR 3 — New `UseGlobalTransposeLoad` promotion attribute and specialized operand promotion

Introduce `IREEGPU_UseGlobalTransposeLoad` attr implementing both
`IREEGPU_PromotionAttr` and `IREECodegen_LoweringConfigAttrInterface`.
Extend `GPUPromoteMatmulOperands` with a specialized promotion path that:
- Strips `amdgpu.fat_raw_buffer_cast` to expose a flat global pointer
- Creates a `linalg.generic` copy with K-inner thread mapping (tile `[N=8, K=1]`,
  K→lane) so the 8×8 wave transpose semantics are correct
- Tags the copy with `UseGlobalTransposeLoadAttr` as its lowering config

**Scope:**
- `IREEGPUAttrs.td/.cpp`: new attr, `getStaticTilingLevelSizes` returns `{8, 1}`
- `DerivedConfigUtils.h/.cpp`: `globalTransposeLoadTileSizes`
- `GPUPromoteMatmulOperands.cpp`: `transposePromoteOperand`, dispatch from
  `promoteOperand` when attr is `UseGlobalTransposeLoadAttr`
- Test: lit test showing the promoted copy has the correct K-inner indexing maps
  and lowering config

---

### PR 4 — Enable `UseGlobalTransposeLoad` in kernel config and `ConfigUtils`

Wire the new promotion attr into the kernel configuration path:
- `ConfigUtils.cpp`: `supportsGlobalTransposeLoad` lambda (accepts f16, bf16, i16,
  i8, fp8 variants); `isRDNA4` check selects `UseGlobalTransposeLoadAttr` for the
  RHS when `!transposedRhs`, for the LHS when `transposedLhs`; gated on a new
  `useGlobalTransposeLoad` bool parameter (default false)
- `KernelConfig.cpp`: new hidden flag
  `--iree-llvmgpu-use-global-transpose-load` (default off) passed through to
  `setMatmulLoweringConfig` and `setIGEMMConvolutionLoweringConfig`

**Scope:**
- `ConfigUtils.h/.cpp`: new parameter, promotion array selection logic
- `KernelConfig.cpp`: new `clUseGlobalTransposeLoad` flag
- Test: end-to-end lit test compiling a BF16 matmul for gfx1201 with the flag
  enabled and checking that `global_load_tr_b128` appears in the generated ISA

---

## Testing Strategy

Each PR includes a targeted lit test. PR 4 additionally requires a numerical
correctness check (iree-run-module with structured inputs) and an assembly-level
check that `global_load_tr_b128` is present and VGPR spill count is within budget.

## Related

- LLVM upstream op: llvm/llvm-project#195287
- ISA reference: GFX12 ISA, `global_load_tr_b128` / `global_load_tr_b64`
- RDNA4 chipsets: gfx1200, gfx1201

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU/RDNA4] End-to-end support for global_load_tr (global memory transpose load) #24454

Background

Planned PR Sequence

PR 1 — Teach `ROCDLPrefetchSharedMemoryCopy` to recognize `amdgpu.global_transpose_load`

PR 2 — Pattern match `vector.transfer_read + vector.transpose` → `amdgpu.global_transpose_load`

PR 3 — New `UseGlobalTransposeLoad` promotion attribute and specialized operand promotion

PR 4 — Enable `UseGlobalTransposeLoad` in kernel config and `ConfigUtils`

Testing Strategy

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[AMDGPU/RDNA4] End-to-end support for global_load_tr (global memory transpose load) #24454

Description

Background

Planned PR Sequence

PR 1 — Teach ROCDLPrefetchSharedMemoryCopy to recognize amdgpu.global_transpose_load

PR 2 — Pattern match vector.transfer_read + vector.transpose → amdgpu.global_transpose_load

PR 3 — New UseGlobalTransposeLoad promotion attribute and specialized operand promotion

PR 4 — Enable UseGlobalTransposeLoad in kernel config and ConfigUtils

Testing Strategy

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

PR 1 — Teach `ROCDLPrefetchSharedMemoryCopy` to recognize `amdgpu.global_transpose_load`

PR 2 — Pattern match `vector.transfer_read + vector.transpose` → `amdgpu.global_transpose_load`

PR 3 — New `UseGlobalTransposeLoad` promotion attribute and specialized operand promotion

PR 4 — Enable `UseGlobalTransposeLoad` in kernel config and `ConfigUtils`