Skip to content

fix: add missing output projection in MultiHeadAttention and optimize training#13

Merged
FareedKhan-dev merged 1 commit into
FareedKhan-dev:mainfrom
Chamath-Adithya:improve-llm-core
Jun 4, 2026
Merged

fix: add missing output projection in MultiHeadAttention and optimize training#13
FareedKhan-dev merged 1 commit into
FareedKhan-dev:mainfrom
Chamath-Adithya:improve-llm-core

Conversation

@Chamath-Adithya
Copy link
Copy Markdown
Contributor

This PR introduces essential structural and stability improvements to the core Transformer architecture, making the tutorial mathematically accurate and robust for students learning to build an LLM from scratch.

Changes Included:

  • Mathematical Accuracy in Multi-Head Attention: Added the missing linear projection layer (self.proj) to the MultiHeadAttention module in src/models/attention.py. Previously, head outputs were only concatenated. According to the Attention is All You Need paper, a final linear projection is strictly required to mix the representations from different attention heads before adding the residual connection. This significantly boosts model capacity.
  • Training Loop Stability: Added torch.nn.utils.clip_grad_norm_ to the training loop in scripts/train_transformer.py. This is a standard optimization in PyTorch that prevents exploding gradients—a very common issue when training deep Transformers.
  • Documentation Alignment: Updated README.md to ensure the theoretical explanation of Multi-Head Attention matches the corrected code, accurately explaining the role of the projection layer to students.

Why this matters for educators and students:
By ensuring the core components align perfectly with the original paper, students won't learn "anti-patterns". The addition of gradient clipping will also prevent unexpected NaN losses, saving beginners hours of frustrating debugging.

FareedKhan-dev added a commit that referenced this pull request Jun 4, 2026
- update printed parameter count for the added per-block output projection
- add gradient clipping to the README inline training loop to match the script
- refresh MultiHeadAttention docstrings to mention the final projection

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@FareedKhan-dev FareedKhan-dev merged commit b6d12a5 into FareedKhan-dev:main Jun 4, 2026
FareedKhan-dev added a commit that referenced this pull request Jun 4, 2026
…tics

PR #15's rewrite dropped the explanatory comments and the estimate_loss
docstring. This keeps all of its runtime diagnostics (device/VRAM report,
step timing, throughput, peak-memory report, checkpoint metadata) and the
gradient clipping from #13, while restoring the tutorial comments so the
core training script stays beginner-friendly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants