The AMD register allocator writes a num_vgprs to the KD lower than what the emitted code actually references (tests/stress.cu declares 8 but uses up to v13), so on hardware the kernel touches registers it never allocated. Likely max_vgpr accounting in emit.c not counting register pairs or post-RA VGPRs.
The AMD register allocator writes a
num_vgprsto the KD lower than what the emitted code actually references (tests/stress.cudeclares 8 but uses up to v13), so on hardware the kernel touches registers it never allocated. Likelymax_vgpraccounting inemit.cnot counting register pairs or post-RA VGPRs.