A custom implementation of a GPT model built from scratch. This project demonstrates the fundamental concepts behind large language models like ChatGPT by implementing each component step by step.
- BPE Tokenizer: Custom implementation of Byte Pair Encoding for text tokenization
- Dataset Handling: Preprocessing and management of text datasets
- Model Training: Train your own GPT model on custom data
- Clean Architecture: Modular design for easy understanding and extension
chatgpt_from_scratch/
βββ assets/
β βββ data/
β β βββ tokenizer.json # Trained tokenizer
β βββ dataset/
β βββ dataset.json # Preprocessed dataset
β
βββ fr/mrqsdf/gptlike/
β βββ resource/
β β βββ Pair.java # Pair Class
β βββ utils/
β β βββ ColoredLogger # Custom Logger with color
β β βββ Dataset.java # dataset Class
β β βββ BPETokenizer # BPE tokenizer implementation
β β βββ DatasetLoader # Dataset loading utilities
β βββ Main.java # Main Class for Train Tokenisation.
The dataset module handles loading and preprocessing text data. By default, it uses a French discussion dataset.
A custom implementation of Byte Pair Encoding (BPE) tokenization, similar to what's used in models like GPT. The tokenizer:
- Splits text into initial tokens
- Iteratively merges the most frequent adjacent token pairs
- Builds a vocabulary of subword units
- Provides encoding and decoding functionality
The GPT model architecture implementation.
- Attention Is All You Need - The original Transformer paper
- Improving Language Understanding with Unsupervised Learning - OpenAI's GPT approach
- Original Creator - The original Creator
This project is licensed under the MIT License - see the LICENSE file for details.
Created by PixelCrafted
Translated to Java By MrQsdf