A comprehensive, step-by-step journey into building Large Language Models from the ground up
Author: Solomon Eshun
This repository contains the complete source code, explanations, and visualizations for the "Building LLMs from Scratch" series. Whether you're a beginner curious about how ChatGPT works or an experienced developer wanting to understand transformer architecture deeply, this series will guide you through every component step by step.
This educational series breaks down the complexity of Large Language Models into digestible, hands-on tutorials. Each part builds upon the previous one, gradually constructing a complete transformer-based language model from scratch using PyTorch.
🎯 Learning Objectives:
- Understand the fundamental architecture of transformer models
- Implement each component (tokenization, embeddings, attention, etc.) from scratch
- Gain practical experience with PyTorch and deep learning concepts
- Learn best practices for training and evaluating language models
- Explore modern techniques used in state-of-the-art LLMs
👥 Target Audience:
- Students and researchers in AI/ML
- Software engineers interested in NLP
- Anyone curious about how LLMs actually work
- Developers wanting to build custom language models
| Part | Topic | Status | Article | Code |
|---|---|---|---|---|
| 01 | The Complete Theoretical Foundation | ✅ Complete | Medium | N/A |
| 02 | Tokenization | ✅ Complete | Medium | Code |
| 03 | Data Pipeline(Input-Target Pairs) | ✅ Complete | Medium | Code |
| 04 | Token Embeddings & Positional Encoding | ✅ Complete | Medium | Code |
| 05 | Complete Data Preprocessing Pipeline | ✅ Complete | Medium | Code |
| 06 | The Attention Mechanism | ✅ Complete | Medium | Code |
| 07 | Self-Attention with trainable weights | ✅ Complete | Medium | Code |
| 08 | Casual Attention | ✅ Complete | Medium | Code |
| 0- | Multi-Head Attention | 🔄 In progress | Medium | Code |
| 0- | Transformer Blocks & Architecture | ⏳ Planned | Medium | Code |
| 0- | Training Loop & Optimization | ⏳ Planned | Medium | Code |
| 0- | Model Evaluation & Fine-tuning | ⏳ Planned | Medium | Code |
Legend: ✅ Complete | 🔄 In Progress | ⏳ Planned
- Python 3.8 or higher
- Basic understanding of Python and neural networks
- Familiarity with PyTorch (helpful but not required)
-
Clone the repository:
git clone https://github.com/soloeinsteinmit/llm-from-scratch.git cd llm-from-scratch -
Install dependencies:
pip install -r requirements.txt
-
Run the first code example (from Part 2):
python src/part02_tokenization.py
llm-from-scratch/
├── README.md # You are here!
├── requirements.txt # Python dependencies
├── LICENSE # MIT License
│
├── notebooks/ # Jupyter notebooks for interactive learning
│ ├── part02_tokenization.ipynb
│ └── ...
│
├── animations/ # Manim visualizations and diagrams
│ └── part-02-WordTokenizationScene.mp4 # Generated animation files
│
│
└── src/ # Source code for each part
├── part02_tokenization.py
└── utils/ # Helper functions and utilities
- Start with Part 01 on Medium for the theoretical foundation.
- Follow Part 02 and subsequent parts for hands-on coding.
- Run the code to see practical implementation.
- Experiment with the parameters and try modifications.
- Check the notebooks for interactive exploration.
- Use the code examples in your courses
- Reference the visualizations for explanations
- Adapt the materials for your curriculum
- Contribute improvements and additional examples
- Use as a foundation for your own model implementations
- Reference the clean, well-documented code structure
- Build upon the base architecture for your experiments
This series includes custom Manim animations that visualize complex concepts:
- 🔄 Attention mechanisms - See how tokens "attend" to each other
- 📊 Data flow - Understand how information moves through the model
- 🧮 Matrix operations - Visualize the math behind transformers
- 📈 Training dynamics - Watch the model learn in real-time
Animations are generated using Manim and available in the animations/ directory.
We welcome contributions from the community! This is an open-source educational project aimed at making LLM understanding accessible to everyone.
Ways to contribute:
- 🐛 Report bugs or suggest improvements
- 📝 Improve documentation and explanations
- 🎨 Create additional visualizations
- 🔧 Add new features or optimizations
- 🌍 Translate content to other languages
- The Illustrated Transformer by Jay Alammar
- Attention Is All You Need - Original Transformer Paper
- GPT-3 Paper - Language Models are Few-Shot Learners
This project is licensed under the MIT License - see the LICENSE file for details.
- Community: Thanks to all contributors and learners who make this project better
- Inspiration: Built upon the excellent work of researchers and educators in the field
- Tools: Created with PyTorch, Manim, and lots of coffee ☕
- 📝 Medium: Follow the series on Medium
- 💼 LinkedIn: Connect and discuss on LinkedIn
- 🐙 GitHub: Star this repo and follow for updates
⭐ If you find this helpful, please give it a star! It helps others discover this resource.
Built with ❤️ for the open-source community