Skip to content

[AMDGPU/RDNA4] End-to-end support for global_load_tr (global memory transpose load) #24454

@nirvedhmeshram

Description

@nirvedhmeshram

Background

RDNA4 (gfx1200+) introduces the global_load_tr_b128 / global_load_tr_b64 ISA
instructions that load 8 contiguous N-direction elements per lane and perform a
hardware 8×8 wave-level cross-lane transpose in a single operation. For BF16/F16
matmuls where the B matrix is stored in row-major order (N-innermost) but the MMA
intrinsic expects K-innermost register layout, this instruction replaces a
buffer_load + shared memory transpose with a single global load, saving LDS
bandwidth and reducing barrier pressure.

The MLIR amdgpu.global_transpose_load op and its lowering to ROCDL intrinsics
were upstreamed in LLVM: llvm/llvm-project#195287. This issue tracks the IREE-side
work to enable this instruction end-to-end through the GPU codegen pipeline.

Planned PR Sequence

The work is structured bottom-up so each PR is independently reviewable and testable.


PR 1 — Teach ROCDLPrefetchSharedMemoryCopy to recognize amdgpu.global_transpose_load

The software-pipelining pass that hoists global memory reads for double-buffering
currently only recognizes vector.transfer_read as a global read root. Extend it
to also treat amdgpu::GlobalTransposeLoadOp as a pipeable global memory read so
that future PRs get prefetching for free.

Scope:

  • ROCDLPrefetchSharedMemoryCopy.cpp: extend analyzeIfOp and
    identifyRootOperations to handle GlobalTransposeLoadOp
  • Test: lit test showing a kernel with amdgpu.global_transpose_load gets
    double-buffered the same way a transfer_read kernel does

PR 2 — Pattern match vector.transfer_read + vector.transposeamdgpu.global_transpose_load

Add a pattern to ROCDLLoadToTransposeLoad that matches a vector<1x8xT>
transfer_read from flat global memory followed by a [1,0] transpose and replaces
it with amdgpu.global_transpose_load plus a corrected contiguous write to the
shared memory allocation.

The corrected write indices use K-inner addressing:
N_new = N_base + K_single % N, K_new = (K_single / N) * N
so the 8 lanes write a contiguous K-direction slice rather than a strided N slice.

Scope:

  • ROCDLLoadToTransposeLoad.cpp: new rewrite pattern, gated on gfx1200+
  • Passes.cpp: enable the pass for RDNA4 targets in addition to the existing gfx950 path
  • Test: lit test showing the transfer_read + transpose chain is replaced with
    global_load_tr_b128 and the write indices are correct

PR 3 — New UseGlobalTransposeLoad promotion attribute and specialized operand promotion

Introduce IREEGPU_UseGlobalTransposeLoad attr implementing both
IREEGPU_PromotionAttr and IREECodegen_LoweringConfigAttrInterface.
Extend GPUPromoteMatmulOperands with a specialized promotion path that:

  • Strips amdgpu.fat_raw_buffer_cast to expose a flat global pointer
  • Creates a linalg.generic copy with K-inner thread mapping (tile [N=8, K=1],
    K→lane) so the 8×8 wave transpose semantics are correct
  • Tags the copy with UseGlobalTransposeLoadAttr as its lowering config

Scope:

  • IREEGPUAttrs.td/.cpp: new attr, getStaticTilingLevelSizes returns {8, 1}
  • DerivedConfigUtils.h/.cpp: globalTransposeLoadTileSizes
  • GPUPromoteMatmulOperands.cpp: transposePromoteOperand, dispatch from
    promoteOperand when attr is UseGlobalTransposeLoadAttr
  • Test: lit test showing the promoted copy has the correct K-inner indexing maps
    and lowering config

PR 4 — Enable UseGlobalTransposeLoad in kernel config and ConfigUtils

Wire the new promotion attr into the kernel configuration path:

  • ConfigUtils.cpp: supportsGlobalTransposeLoad lambda (accepts f16, bf16, i16,
    i8, fp8 variants); isRDNA4 check selects UseGlobalTransposeLoadAttr for the
    RHS when !transposedRhs, for the LHS when transposedLhs; gated on a new
    useGlobalTransposeLoad bool parameter (default false)
  • KernelConfig.cpp: new hidden flag
    --iree-llvmgpu-use-global-transpose-load (default off) passed through to
    setMatmulLoweringConfig and setIGEMMConvolutionLoweringConfig

Scope:

  • ConfigUtils.h/.cpp: new parameter, promotion array selection logic
  • KernelConfig.cpp: new clUseGlobalTransposeLoad flag
  • Test: end-to-end lit test compiling a BF16 matmul for gfx1201 with the flag
    enabled and checking that global_load_tr_b128 appears in the generated ISA

Testing Strategy

Each PR includes a targeted lit test. PR 4 additionally requires a numerical
correctness check (iree-run-module with structured inputs) and an assembly-level
check that global_load_tr_b128 is present and VGPR spill count is within budget.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions