Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why QK-layer norm before ROPE ? #806

Open
dhia680 opened this issue Mar 4, 2025 · 1 comment
Open

Why QK-layer norm before ROPE ? #806

dhia680 opened this issue Mar 4, 2025 · 1 comment
Labels
type/question An issue that's a question

Comments

@dhia680
Copy link

dhia680 commented Mar 4, 2025

❓ The question

I'm just curious to know if there were motivations behind applying ROPE rotations after the QK-layernorm and not the way around.

@dhia680 dhia680 added the type/question An issue that's a question label Mar 4, 2025
@aman-17
Copy link
Member

aman-17 commented Mar 7, 2025

The reason we apply RoPE rotations after QK-layernorm(not before) is mostly about keeping things numerically stable. The layernorm helps stabilize and constrain the query and key representations before applying the rotational encoding.

You probably noticed we do qkv clipping after RoPE for clamping the values to a certain range. This whole process, normalizing first, then rotating helps keep everything stable during training and when running the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

2 participants