fix: add missing output projection in MultiHeadAttention and optimize training#13
Merged
FareedKhan-dev merged 1 commit intoJun 4, 2026
Conversation
FareedKhan-dev
added a commit
that referenced
this pull request
Jun 4, 2026
- update printed parameter count for the added per-block output projection - add gradient clipping to the README inline training loop to match the script - refresh MultiHeadAttention docstrings to mention the final projection Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
FareedKhan-dev
added a commit
that referenced
this pull request
Jun 4, 2026
…tics PR #15's rewrite dropped the explanatory comments and the estimate_loss docstring. This keeps all of its runtime diagnostics (device/VRAM report, step timing, throughput, peak-memory report, checkpoint metadata) and the gradient clipping from #13, while restoring the tutorial comments so the core training script stays beginner-friendly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces essential structural and stability improvements to the core Transformer architecture, making the tutorial mathematically accurate and robust for students learning to build an LLM from scratch.
Changes Included:
self.proj) to theMultiHeadAttentionmodule insrc/models/attention.py. Previously, head outputs were only concatenated. According to the Attention is All You Need paper, a final linear projection is strictly required to mix the representations from different attention heads before adding the residual connection. This significantly boosts model capacity.torch.nn.utils.clip_grad_norm_to the training loop inscripts/train_transformer.py. This is a standard optimization in PyTorch that prevents exploding gradients—a very common issue when training deep Transformers.README.mdto ensure the theoretical explanation of Multi-Head Attention matches the corrected code, accurately explaining the role of the projection layer to students.Why this matters for educators and students:
By ensuring the core components align perfectly with the original paper, students won't learn "anti-patterns". The addition of gradient clipping will also prevent unexpected
NaNlosses, saving beginners hours of frustrating debugging.