Skip to content

[WaveTransform] Use S_AND_SAVEEXEC for the EXEC update at divergent OriginBranch nodes#3040

Open
lalaniket8 wants to merge 1 commit into
amd-feature/wave-transformfrom
amd/dev/lalaniket8/wt-use-sandsaveexec
Open

[WaveTransform] Use S_AND_SAVEEXEC for the EXEC update at divergent OriginBranch nodes#3040
lalaniket8 wants to merge 1 commit into
amd-feature/wave-transformfrom
amd/dev/lalaniket8/wt-use-sandsaveexec

Conversation

@lalaniket8

Copy link
Copy Markdown

At each divergent OriginBranch, the Wave Transform pass previously emitted a separate XOR to compute the rejoin (exit) lane delta and an S_MOV to write the primary-successor mask into EXEC. This replaces that pair with a single fused S_AND_SAVEEXEC, which both writes the primary mask into EXEC and saves the old EXEC value in one instruction. The saved value is then reused directly as the rejoin contribution, eliminating the explicit S_XOR.

This also resolves a redundant double-XOR sequence that arose when the divergence condition was itself an inverted (XOR'd) mask — both XORs and the AND-to-EXEC now collapse into the single S_AND_SAVEEXEC.

Codegen change

Case 1 — a divergent branch

Before:

%cond      = V_CMP...                      ; loop-continue condition
%PrimAcc   = %PrimAcc OR %cond             ; build primary-successor mask
%Rejoin    = S_XOR $exec, %PrimAcc         ; exit lanes = exec & ~%PrimAcc
%RejoinAcc = S_OR  %RejoinAcc, %Rejoin     ; accumulate exit lanes (reads $exec via %Rejoin)
$exec      = S_MOV %PrimAcc                ; write primary mask into EXEC
SI_WAVE_CF_EDGE / S_CBRANCH_EXECZ %exit / S_BRANCH %body

After:

%cond      = V_CMP...                      ; loop-continue condition
%PrimAcc   = %PrimAcc OR %cond             ; build primary-successor mask
%SavedExec = S_AND_SAVEEXEC %PrimAcc       ; %SavedExec = old EXEC; $exec = $exec & %PrimAcc
%RejoinAcc = S_OR %RejoinAcc, %SavedExec   ; accumulate (placed after EXEC is updated)
SI_WAVE_CF_EDGE / S_CBRANCH_EXECZ %exit / S_BRANCH %body

Case 2 — a block that rejoins lanes, then diverges again

Before:

$exec      = S_OR $exec, %RejoinAcc        ; rejoin lanes back in
%cond2     = COPY %cond / V_CMP...         ; new divergence condition
%PrimAcc2  = %PrimAcc2 OR %cond2           ;   build primary-successor mask
%Rejoin2   = S_XOR $exec, %PrimAcc2        ;   exit lanes = exec & ~%PrimAcc2
%RejoinAcc = S_OR  %RejoinAcc, %Rejoin2    ;   accumulate exit lanes
$exec      = S_MOV %PrimAcc2               ;   write primary mask into EXEC
SI_WAVE_CF_EDGE / S_CBRANCH_EXECZ / S_BRANCH

After:

$exec       = S_OR $exec, %RejoinAcc       ; rejoin lanes back in
%cond2      = COPY %cond / V_CMP...        ; new divergence condition
%PrimAcc2   = %PrimAcc2 OR %cond2          ;   build primary-successor mask
%SavedExec2 = S_AND_SAVEEXEC %PrimAcc2     ;   %SavedExec2 = old EXEC; $exec = $exec & %PrimAcc2
%RejoinAcc  = S_OR %RejoinAcc, %SavedExec2 ;   accumulate (placed after EXEC is updated)
SI_WAVE_CF_EDGE / S_CBRANCH_EXECZ / S_BRANCH

How it works

        [OriginBranch]
         /          \
   primary lanes   (other lanes)
        |               |
   [PrimarySucc]        |
         \             /
          \           /
        [SecSucc]  <-- rejoin happens here

From the OriginBranch, the primary lanes flow to PrimarySucc; all the lanes come back together (rejoin) at SecSucc. The only thing that changes is what we accumulate into %RejoinAcc:

  • Before: Rejoin = exec & ~PrimarySuccLanes (only the lanes that are not taking the primary edge).
  • After: Rejoin = exec (the full old EXEC, saved by S_AND_SAVEEXEC).

Since PrimarySuccLanes is a subset of exec, the rejoin step exec = exec OR RejoinAcc at SecSucc produces the same final mask either way: the extra primary-successor lanes carried in the "after" form are already present in exec at the rejoin point, so OR-ing them back in changes nothing. The fused form simply defers adding those lanes instead of masking them out up front.

@rocm-cciapp

rocm-cciapp Bot commented Jun 24, 2026

Copy link
Copy Markdown

@lalaniket8 lalaniket8 force-pushed the amd/dev/lalaniket8/wt-use-sandsaveexec branch from d047b7f to 8452297 Compare June 25, 2026 13:10
@rocm-cciapp

rocm-cciapp Bot commented Jun 25, 2026

Copy link
Copy Markdown

@vg0204 vg0204 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reviewed the code, still in process of reviewing LIT test.

// Donot mask CurReg if CurReg = S_AND_SAVEEXEC(_TERM) Reg
// Contributions from this Opc implies we are building the rejoin merge at
// secondary block and the contribution should be used as is , without EXEC
// AND masking.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this comment more modular with punctuations to convey it more clearly!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should clearly mention its(SavedExec) all places of usage , divergent & rejoin block, right?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot see the any S_OR using this savedExec as depicted in divergent block example between S_AND_SAVEEXEC & Branch instr. Why?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its used in L2345, the updater constructs the OR instr when we call Updater.getValueInMiddleOfBlock().

Comment on lines 244 to 247

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double XOR redundancy removed here

@cdevadas

Copy link
Copy Markdown

This work needs a proper discussion before proceeding with fruther discussions on the PR. We have to collect all the missing optimizations and further cleanup discussed so far, and have a clear plan how we want to proceed further. It is really crucial for how we want to shape the exec mask operations inserted by the wave transform pass at various blocks. Let's do that after we fix the full lit test cases and the extended PSDB is launched.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these equivalent?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants