Skip to content

Latest commit

Β 

History

History
66 lines (46 loc) Β· 2.28 KB

File metadata and controls

66 lines (46 loc) Β· 2.28 KB

License

A custom implementation of a GPT model built from scratch. This project demonstrates the fundamental concepts behind large language models like ChatGPT by implementing each component step by step.

🌟 Features

  • BPE Tokenizer: Custom implementation of Byte Pair Encoding for text tokenization
  • Dataset Handling: Preprocessing and management of text datasets
  • Model Training: Train your own GPT model on custom data
  • Clean Architecture: Modular design for easy understanding and extension

πŸ“Š Project Structure

chatgpt_from_scratch/
β”œβ”€β”€ assets/
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   └── tokenizer.json  # Trained tokenizer
β”‚   └── dataset/
β”‚       └── dataset.json    # Preprocessed dataset
β”‚    
β”œβ”€β”€ fr/mrqsdf/gptlike/
β”‚   β”œβ”€β”€ resource/
β”‚   β”‚   └── Pair.java       # Pair Class
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ ColoredLogger   # Custom Logger with color
β”‚   β”‚   β”œβ”€β”€ Dataset.java    # dataset Class
β”‚   β”‚   β”œβ”€β”€ BPETokenizer    # BPE tokenizer implementation
β”‚   β”‚   └── DatasetLoader   # Dataset loading utilities
β”‚   └── Main.java           # Main Class for Train Tokenisation.

πŸ” Components

Dataset

The dataset module handles loading and preprocessing text data. By default, it uses a French discussion dataset.

Tokenizer

A custom implementation of Byte Pair Encoding (BPE) tokenization, similar to what's used in models like GPT. The tokenizer:

  • Splits text into initial tokens
  • Iteratively merges the most frequent adjacent token pairs
  • Builds a vocabulary of subword units
  • Provides encoding and decoding functionality

Model (Coming Soon)

The GPT model architecture implementation.

πŸ“š References

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ‘¨β€πŸ’» Author

Created by PixelCrafted

Translated to Java By MrQsdf