Why QK-layer norm before ROPE ? #806

dhia680 · 2025-03-04T10:45:59Z

❓ The question

I'm just curious to know if there were motivations behind applying ROPE rotations after the QK-layernorm and not the way around.

The text was updated successfully, but these errors were encountered:

aman-17 · 2025-03-07T20:13:33Z

The reason we apply RoPE rotations after QK-layernorm(not before) is mostly about keeping things numerically stable. The layernorm helps stabilize and constrain the query and key representations before applying the rotational encoding.

You probably noticed we do qkv clipping after RoPE for clamping the values to a certain range. This whole process, normalizing first, then rotating helps keep everything stable during training and when running the model.

dhia680 added the type/question An issue that's a question label Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why QK-layer norm before ROPE ? #806

Why QK-layer norm before ROPE ? #806

dhia680 commented Mar 4, 2025

aman-17 commented Mar 7, 2025

Why QK-layer norm before ROPE ? #806

Why QK-layer norm before ROPE ? #806

Comments

dhia680 commented Mar 4, 2025

❓ The question

aman-17 commented Mar 7, 2025