-
Notifications
You must be signed in to change notification settings - Fork 75
[AMDGPU] Add hot block register renaming pass #371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: amd-staging
Are you sure you want to change the base?
[AMDGPU] Add hot block register renaming pass #371
Conversation
[AMDGPU] Add hot block register renaming passProblemPerformance regression was observed in register-intensive kernels (e.g., rocRAND MTGP32) due to high register pressure in hot basic blocks. The greedy register allocator tends to reuse the same physical registers for multiple short-lived values within a basic block, which creates false WAW (Write-After-Write) dependencies. These false dependencies prevent the Post-RA scheduler from reordering instructions effectively, leading to suboptimal scheduling around barriers and memory operations. SolutionThis patch introduces a new post-allocation optimization pass ( Key Features
Algorithm
Technical DetailsPass PlacementThe pass runs in the pre-rewrite phase, after greedy register allocation but before VirtRegRewriter:
Implementation
API UsageThe pass uses standard LLVM register allocation infrastructure:
TestingLit TestsThree comprehensive test cases in
All tests pass with both legacy and new pass managers. Regression Testing
Performance ResultsTested on rocRAND MTGP32 kernel (register-intensive workload):
Statistics (on MTGP32 kernel)18 hot blocks processed The most critical block (BB#31) had 34 values moved from 8 high-density registers to free registers, which allowed the Post-RA scheduler to better reorder instructions around barriers. Future WorkPotential enhancements (not included in this patch):
ReviewersPlease review with focus on:
|
This patch introduces a post-allocation register renaming optimization pass that reduces value density in hot basic blocks. The pass helps the post-RA scheduler avoid false WAW dependencies by moving local values to unused physical registers. The pass operates after greedy register allocation but before VirtRegRewriter. It identifies hot blocks (above frequency threshold), calculates value density per physical register, and selectively moves local live ranges to free registers. Only 32-bit VGPR values that live entirely within a single basic block are moved, ensuring conservative behavior. Key features: - Respects tied operands and register allocation constraints - Honors occupancy-based VGPR limits to avoid spilling - Disabled by default (enable with -amdgpu-enable-hot-block-reg-renaming) - Includes comprehensive lit tests Performance results show up to 2% improvement on register-intensive kernels such as rocRAND MTGP32.
- Rename canMoveValue to isVirtRegMovable for clarity - Add assertions to verify single-value precondition - Restore VRM->getPhys check: NOT redundant due to register aliasing (register units are shared between aliased registers like VGPR0 and VGPR0_VGPR1) - Improve tied operand check to verify tied source register compatibility
This flips the default of -amdgpu-enable-hot-block-reg-renaming to true to exercise the pass across large CI/CT builds. This is a temporary enablement to flush out issues; users can still disable with -mllvm -amdgpu-enable-hot-block-reg-renaming=false.
d07ab94 to
25b7541
Compare
Fix two assertions discovered during CI/CT testing with rocBLAS kernels: 1. isVirtRegMovable() crashed on PHI nodes with multiple value definitions. Converted assertions to early-return checks, allowing the pass to skip unmovable registers instead of crashing on legitimate IR patterns. 2. tryMoveValue() assumed LiveIntervalUnion contains only virtual registers, but it can contain physical registers after allocation. Added isVirtual() check before calling VirtRegMap::getPhys() to prevent assertion failures. Both fixes improve robustness without affecting correctness or performance.
25b7541 to
6cb509c
Compare
Three critical correctness fixes for the Hot Block Register Renaming pass: Fix #0 (Kernel-Only): Restrict pass to kernel functions only. Post-RA passes cannot safely modify non-kernel functions because they have no mechanism to update RegMask operands in caller's call instructions, which would lead to inter-procedural register corruption. Fix #1a (Redefinitions): Check that target free register is not redefined by any instruction within the virtual register's live range. Without this check, moving a value to a register that gets overwritten mid-range causes segfaults. Fix #1b (Call Clobbers): Use LiveIntervals::checkRegMaskInterference() to verify that target register is not clobbered by any call instruction within the live range. Prevents incorrect register assignments across function calls. All fixes verified on aomp-complex test case (segfault fixed) and rocRAND MTGP32 kernel (117 values remapped, original optimization preserved).
NB: Pass is enabled by default for the testing purposes. DO NOT MERGE!
This patch introduces a post-allocation register renaming optimization pass that reduces value density in hot basic blocks. The pass helps the post-RA scheduler avoid false dependencies by moving local values to unused physical registers.
The pass operates after greedy register allocation but before VirtRegRewriter. It identifies hot blocks (above frequency threshold), calculates value density per physical register, and selectively moves local live ranges to free registers. Only 32-bit VGPR values that live entirely within a single basic block are moved, ensuring conservative behavior.
Key features:
Performance results show up to 2% improvement on register-intensive kernels such as rocRAND MTGP32 on top of fixing the 5% regression.