Background
RDNA4 (gfx1200+) introduces the global_load_tr_b128 / global_load_tr_b64 ISA
instructions that load 8 contiguous N-direction elements per lane and perform a
hardware 8×8 wave-level cross-lane transpose in a single operation. For BF16/F16
matmuls where the B matrix is stored in row-major order (N-innermost) but the MMA
intrinsic expects K-innermost register layout, this instruction replaces a
buffer_load + shared memory transpose with a single global load, saving LDS
bandwidth and reducing barrier pressure.
The MLIR amdgpu.global_transpose_load op and its lowering to ROCDL intrinsics
were upstreamed in LLVM: llvm/llvm-project#195287. This issue tracks the IREE-side
work to enable this instruction end-to-end through the GPU codegen pipeline.
Planned PR Sequence
The work is structured bottom-up so each PR is independently reviewable and testable.
PR 1 — Teach ROCDLPrefetchSharedMemoryCopy to recognize amdgpu.global_transpose_load
The software-pipelining pass that hoists global memory reads for double-buffering
currently only recognizes vector.transfer_read as a global read root. Extend it
to also treat amdgpu::GlobalTransposeLoadOp as a pipeable global memory read so
that future PRs get prefetching for free.
Scope:
ROCDLPrefetchSharedMemoryCopy.cpp: extend analyzeIfOp and
identifyRootOperations to handle GlobalTransposeLoadOp
- Test: lit test showing a kernel with
amdgpu.global_transpose_load gets
double-buffered the same way a transfer_read kernel does
PR 2 — Pattern match vector.transfer_read + vector.transpose → amdgpu.global_transpose_load
Add a pattern to ROCDLLoadToTransposeLoad that matches a vector<1x8xT>
transfer_read from flat global memory followed by a [1,0] transpose and replaces
it with amdgpu.global_transpose_load plus a corrected contiguous write to the
shared memory allocation.
The corrected write indices use K-inner addressing:
N_new = N_base + K_single % N, K_new = (K_single / N) * N
so the 8 lanes write a contiguous K-direction slice rather than a strided N slice.
Scope:
ROCDLLoadToTransposeLoad.cpp: new rewrite pattern, gated on gfx1200+
Passes.cpp: enable the pass for RDNA4 targets in addition to the existing gfx950 path
- Test: lit test showing the
transfer_read + transpose chain is replaced with
global_load_tr_b128 and the write indices are correct
PR 3 — New UseGlobalTransposeLoad promotion attribute and specialized operand promotion
Introduce IREEGPU_UseGlobalTransposeLoad attr implementing both
IREEGPU_PromotionAttr and IREECodegen_LoweringConfigAttrInterface.
Extend GPUPromoteMatmulOperands with a specialized promotion path that:
- Strips
amdgpu.fat_raw_buffer_cast to expose a flat global pointer
- Creates a
linalg.generic copy with K-inner thread mapping (tile [N=8, K=1],
K→lane) so the 8×8 wave transpose semantics are correct
- Tags the copy with
UseGlobalTransposeLoadAttr as its lowering config
Scope:
IREEGPUAttrs.td/.cpp: new attr, getStaticTilingLevelSizes returns {8, 1}
DerivedConfigUtils.h/.cpp: globalTransposeLoadTileSizes
GPUPromoteMatmulOperands.cpp: transposePromoteOperand, dispatch from
promoteOperand when attr is UseGlobalTransposeLoadAttr
- Test: lit test showing the promoted copy has the correct K-inner indexing maps
and lowering config
PR 4 — Enable UseGlobalTransposeLoad in kernel config and ConfigUtils
Wire the new promotion attr into the kernel configuration path:
ConfigUtils.cpp: supportsGlobalTransposeLoad lambda (accepts f16, bf16, i16,
i8, fp8 variants); isRDNA4 check selects UseGlobalTransposeLoadAttr for the
RHS when !transposedRhs, for the LHS when transposedLhs; gated on a new
useGlobalTransposeLoad bool parameter (default false)
KernelConfig.cpp: new hidden flag
--iree-llvmgpu-use-global-transpose-load (default off) passed through to
setMatmulLoweringConfig and setIGEMMConvolutionLoweringConfig
Scope:
ConfigUtils.h/.cpp: new parameter, promotion array selection logic
KernelConfig.cpp: new clUseGlobalTransposeLoad flag
- Test: end-to-end lit test compiling a BF16 matmul for gfx1201 with the flag
enabled and checking that global_load_tr_b128 appears in the generated ISA
Testing Strategy
Each PR includes a targeted lit test. PR 4 additionally requires a numerical
correctness check (iree-run-module with structured inputs) and an assembly-level
check that global_load_tr_b128 is present and VGPR spill count is within budget.
Related
Background
RDNA4 (gfx1200+) introduces the
global_load_tr_b128/global_load_tr_b64ISAinstructions that load 8 contiguous N-direction elements per lane and perform a
hardware 8×8 wave-level cross-lane transpose in a single operation. For BF16/F16
matmuls where the B matrix is stored in row-major order (N-innermost) but the MMA
intrinsic expects K-innermost register layout, this instruction replaces a
buffer_load+ shared memory transpose with a single global load, saving LDSbandwidth and reducing barrier pressure.
The MLIR
amdgpu.global_transpose_loadop and its lowering to ROCDL intrinsicswere upstreamed in LLVM: llvm/llvm-project#195287. This issue tracks the IREE-side
work to enable this instruction end-to-end through the GPU codegen pipeline.
Planned PR Sequence
The work is structured bottom-up so each PR is independently reviewable and testable.
PR 1 — Teach
ROCDLPrefetchSharedMemoryCopyto recognizeamdgpu.global_transpose_loadThe software-pipelining pass that hoists global memory reads for double-buffering
currently only recognizes
vector.transfer_readas a global read root. Extend itto also treat
amdgpu::GlobalTransposeLoadOpas a pipeable global memory read sothat future PRs get prefetching for free.
Scope:
ROCDLPrefetchSharedMemoryCopy.cpp: extendanalyzeIfOpandidentifyRootOperationsto handleGlobalTransposeLoadOpamdgpu.global_transpose_loadgetsdouble-buffered the same way a
transfer_readkernel doesPR 2 — Pattern match
vector.transfer_read + vector.transpose→amdgpu.global_transpose_loadAdd a pattern to
ROCDLLoadToTransposeLoadthat matches avector<1x8xT>transfer_read from flat global memory followed by a
[1,0]transpose and replacesit with
amdgpu.global_transpose_loadplus a corrected contiguous write to theshared memory allocation.
The corrected write indices use K-inner addressing:
N_new = N_base + K_single % N,K_new = (K_single / N) * Nso the 8 lanes write a contiguous K-direction slice rather than a strided N slice.
Scope:
ROCDLLoadToTransposeLoad.cpp: new rewrite pattern, gated on gfx1200+Passes.cpp: enable the pass for RDNA4 targets in addition to the existing gfx950 pathtransfer_read + transposechain is replaced withglobal_load_tr_b128and the write indices are correctPR 3 — New
UseGlobalTransposeLoadpromotion attribute and specialized operand promotionIntroduce
IREEGPU_UseGlobalTransposeLoadattr implementing bothIREEGPU_PromotionAttrandIREECodegen_LoweringConfigAttrInterface.Extend
GPUPromoteMatmulOperandswith a specialized promotion path that:amdgpu.fat_raw_buffer_castto expose a flat global pointerlinalg.genericcopy with K-inner thread mapping (tile[N=8, K=1],K→lane) so the 8×8 wave transpose semantics are correct
UseGlobalTransposeLoadAttras its lowering configScope:
IREEGPUAttrs.td/.cpp: new attr,getStaticTilingLevelSizesreturns{8, 1}DerivedConfigUtils.h/.cpp:globalTransposeLoadTileSizesGPUPromoteMatmulOperands.cpp:transposePromoteOperand, dispatch frompromoteOperandwhen attr isUseGlobalTransposeLoadAttrand lowering config
PR 4 — Enable
UseGlobalTransposeLoadin kernel config andConfigUtilsWire the new promotion attr into the kernel configuration path:
ConfigUtils.cpp:supportsGlobalTransposeLoadlambda (accepts f16, bf16, i16,i8, fp8 variants);
isRDNA4check selectsUseGlobalTransposeLoadAttrfor theRHS when
!transposedRhs, for the LHS whentransposedLhs; gated on a newuseGlobalTransposeLoadbool parameter (default false)KernelConfig.cpp: new hidden flag--iree-llvmgpu-use-global-transpose-load(default off) passed through tosetMatmulLoweringConfigandsetIGEMMConvolutionLoweringConfigScope:
ConfigUtils.h/.cpp: new parameter, promotion array selection logicKernelConfig.cpp: newclUseGlobalTransposeLoadflagenabled and checking that
global_load_tr_b128appears in the generated ISATesting Strategy
Each PR includes a targeted lit test. PR 4 additionally requires a numerical
correctness check (iree-run-module with structured inputs) and an assembly-level
check that
global_load_tr_b128is present and VGPR spill count is within budget.Related
global_load_tr_b128/global_load_tr_b64