-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENHANCEMENT] Add support for Apex RMSNorm for use in qk-norm #1261
base: main
Are you sure you want to change the base?
Conversation
k_layernorm=FusedLayerNorm if qk_layernorm else IdentityOp, | ||
# for QKLayerNorm; we instead use the Apex implementation (or pytorch | ||
# one if Apex is not installed). | ||
q_layernorm=LNImpl if qk_layernorm else IdentityOp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my case, i exactly did same patch for my own megatron fork.
so changes in this PR looks good to me, but i think we should clarify why it's happening?
like you said, someone still use tenorm for qknorm but model converges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SeunghyunSEO thanks!
Regarding the clarification on why this is happening, do you mean that we should check why the TE implementation is diverging ? (I didn't try it myself, I just assumed it does based on your PR and also based on the comment in this commit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SeunghyunSEO thanks!
Regarding the clarification on why this is happening, do you mean that we should check why the TE implementation is diverging ? (I didn't try it myself, I just assumed it does based on your PR and also based on the comment in this commit)
i mean when additional feature is added, at least we should know whether it is necessary or not.
any megatron or TE maintainers know why TEnorm for qk norm diverge sometimes??? i cc sir deepak because he is the only one i communicate with! @deepakn94 (sry for the wrong tagging but i ask you to tag expert in numerical precision issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense I agree 👍 Thanks for tagging @deepakn94 🙏
Also I think Mike Chrzanowski and Shanmugam Ramasamy can be tagged if Nvidia folks know their contact ? (because I couldn't find their github handle)
Because they are the one who created this commit which prevented the use of TENorm, and also Mike Chrzanowski wrote a paper using qk-layernorm 👍
This PR allows to use
qk-layernorm
even when one has set normalization:"RMSNorm"
(which threw an error before cf. msg from @SeunghyunSEO here from original PR here: indeed, onlyLayerNorm
was allowed when using qk-layernorm in gpt, notRMSNorm
, since this commit)What this PR does is that it:
FusedRMSNorm
inmegatroncore/fusions/fused_layer_norm.py
, which will serve as a wrapper to apex'sFusedRMSNormAffineFunction
(the same way as the existingmegatroncore/fusions/fused_layer_norm.py:FusedLayerNorm
is a wrapper to apex'sFusedLayerNormAffineFunction
) (note that whileFusedLayerNorm
can also support for the persist version of fused layer norm (apex'scontrib.layer_norm.FastLayerNorm
), there's no such persist version in Apex for RMSNorm so we just use a non-persist version)TENorm
but for Apex, which we callApexNorm
(TENorm
is a wrapper that gets transformed into eitherte.pytorch.LayerNorm
orte.pytorch.RMSNorm
depending on whether normalization: 'RMSNorm' or normalization: 'LayerNorm' is used in the config). For that we add anApexFusedNorm
which gets transformed into eithermegatroncore/fusions/fused_layer_norm.py:FusedLayerNorm
, or thefused_layer_norm.py:RMSNorm
just added aboveAdvantage: this way if we specify
LayerNorm
orRMSNorm
for--normalization
, we'll use that for qk-normalization, and it'll try using first the Apex one, and fallback to the python one if Apex is not installed (we don't use the TE one as it seems to be unstable as was put in comments in the original code and as was forced by this commit)Note: for the implementation of
FusedRMSNorm
I just copy/pasted the code fromFusedLayerNorm
and changed it to be doing an RMSNormTagging @SeunghyunSEO and @ftgreat as you could be interested in this PR given that PR of yours. Tagging @jaredcasper and @jon-barker, also the authors of this commit could be interested (Mike Chrzanowski and Shanmugam Ramasamy).