You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The reason we apply RoPE rotations after QK-layernorm(not before) is mostly about keeping things numerically stable. The layernorm helps stabilize and constrain the query and key representations before applying the rotational encoding.
You probably noticed we do qkv clipping after RoPE for clamping the values to a certain range. This whole process, normalizing first, then rotating helps keep everything stable during training and when running the model.
❓ The question
I'm just curious to know if there were motivations behind applying
ROPE
rotations after theQK-layernorm
and not the way around.The text was updated successfully, but these errors were encountered: