fix: add missing output projection in MultiHeadAttention and optimize training by Chamath-Adithya · Pull Request #13 · FareedKhan-dev/train-llm-from-scratch

Chamath-Adithya · 2026-05-29T20:23:32Z

This PR introduces essential structural and stability improvements to the core Transformer architecture, making the tutorial mathematically accurate and robust for students learning to build an LLM from scratch.

Changes Included:

Mathematical Accuracy in Multi-Head Attention: Added the missing linear projection layer (self.proj) to the MultiHeadAttention module in src/models/attention.py. Previously, head outputs were only concatenated. According to the Attention is All You Need paper, a final linear projection is strictly required to mix the representations from different attention heads before adding the residual connection. This significantly boosts model capacity.
Training Loop Stability: Added torch.nn.utils.clip_grad_norm_ to the training loop in scripts/train_transformer.py. This is a standard optimization in PyTorch that prevents exploding gradients—a very common issue when training deep Transformers.
Documentation Alignment: Updated README.md to ensure the theoretical explanation of Multi-Head Attention matches the corrected code, accurately explaining the role of the projection layer to students.

Why this matters for educators and students:
By ensuring the core components align perfectly with the original paper, students won't learn "anti-patterns". The addition of gradient clipping will also prevent unexpected NaN losses, saving beginners hours of frustrating debugging.

… training

- update printed parameter count for the added per-block output projection - add gradient clipping to the README inline training loop to match the script - refresh MultiHeadAttention docstrings to mention the final projection Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tics PR #15's rewrite dropped the explanatory comments and the estimate_loss docstring. This keeps all of its runtime diagnostics (device/VRAM report, step timing, throughput, peak-memory report, checkpoint metadata) and the gradient clipping from #13, while restoring the tutorial comments so the core training script stays beginner-friendly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix: add missing output projection in MultiHeadAttention and optimize…

dfb9a11

… training

FareedKhan-dev merged commit b6d12a5 into FareedKhan-dev:main Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add missing output projection in MultiHeadAttention and optimize training#13

fix: add missing output projection in MultiHeadAttention and optimize training#13
FareedKhan-dev merged 1 commit into
FareedKhan-dev:mainfrom
Chamath-Adithya:improve-llm-core

Chamath-Adithya commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Chamath-Adithya commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants