Skip to content

RohanKhanBD/transformer

Repository files navigation

πŸš€ Transformer Large Language Model

A simple yet powerful Transformer LLM implementation built with PyTorch, designed for clarity, modularity, and extensibility.


✨ Features

What This Code Does

  • πŸ”€ Train BPE tokenizer from scratch β€” Build your own vocabulary
  • πŸ”₯ Load Mistral tokenizer β€” Use a proven, production-ready BPE tokenizer
  • πŸ“Š Dataset tokenization β€” Efficient pre-processing for large-scale data
  • 🧠 Transformer training β€” Train models from the ground up
  • πŸ’¬ Text generation β€” Generate text using trained checkpoints
  • πŸŽ“ Supervised Fine-Tuning (SFT) β€” Instruction / chat model fine-tuning
  • πŸ”€ Mixture-of-Experts β€” Efficient scaling via MoE routing
  • 🎯 Multi-Head Latent Attention β€” DeepSeek-inspired attention mechanism
  • πŸ–₯️ Multi-GPU training β€” Distributed training using PyTorch DDP
  • ⚑ Mixed-precision training β€” FP16/BF16 speed-ups with less memory

Current Limitations

  • ❌ No HuggingFace model loading
  • ❌ No RLHF pipeline
  • ❌ BPE-only tokenization
  • ❌ No safetensors support
  • ❌ Many advanced features still in progress

πŸ› οΈ Quick Start

Prerequisites

Install required dependencies:

pip install -r requirements.txt

Tip: All scripts include sensible defaultsβ€”run them without arguments to get started fast.


πŸ”€ Tokenizer Setup

Choose one of the two paths:


Option A: Use Mistral's Pre-trained Tokenizer

Use the flag --load_mistral_tokenizer in training and generation steps.

Setup: Download only the tokenizer.json from Mistral-Nemo-Base-2407 and place it in your project directory.

πŸ’‘ Skip tokenizer training entirely. Ideal for production or rapid prototyping.


Option B: Train Your Own Tokenizer

python train_tokenizer.py

Tokenizer Training Options:

Parameter
--dataset_path_huggingface
--dataset_sub_set
--tokenizer_file_name
--tokenizer_train_shard_size
--trust_remote_code

Pre-trained Resources (Custom Tokenizer Only):

⚠️ These resources only work with the custom tokenizer, not Mistral.


πŸ“Š Training Pipeline

Pre-training Workflow


1️⃣ Tokenize Your Dataset

python tokenize_data.py
# or
python tokenize_data.py --load_mistral_tokenizer

Data Tokenization Options:

Parameter
--dataset_path_huggingface
--dataset_sub_set
--tokenizer_file_name
--data_file_name
--encoded_dataset_shard_size
--load_mistral_tokenizer

Pre-tokenized Dataset (Custom Only): FineWeb-Edu 10B subset.

⚠️ Only compatible with the custom tokenizer.


2️⃣ Train the Model

python train.py --compile_model --use_autocast
# or
python train.py --compile_model --use_autocast --load_mistral_tokenizer

Training Options:

Parameter
--steps
--eval_rate
--eval_steps
--save_rate
--warm_up
--total_batch_size
--batch_size
--seed
--promissed_flops
--lr
--min_lr
--weight_decay
--beta1
--beta2
--backend
--save_file_name
--data_file_name
--tokenizer_file_name
--dtype
--compile_model
--use_autocast
--load_mistral_tokenizer

3️⃣ Generate Text

python generate.py --input_text "Hello" --num_tokens_to_generate 20 --compile_model
# or
python generate.py --input_text "Hello" --num_tokens_to_generate 20 --load_mistral_tokenizer --compile_model

Generation Options:

Parameter
--input_text
--num_tokens_to_generate
--temperature
--top_p
--save_file_name
--backend
--tokenizer_file_name
--compile_model
--load_mistral_tokenizer

πŸŽ“ Supervised Fine-Tuning (SFT)

Fine-tune your model to follow instructions or engage in conversation.

When to Use SFT

Perfect for:

  • πŸ’¬ Chatbots
  • πŸ“ Instruction models
  • 🎯 Domain-specific tuning
  • πŸ”„ Behavior alignment

SFT Workflow

1️⃣ Prepare the SFT Dataset

python tokenize_sft_data.py
# or
python tokenize_sft_data.py --load_mistral_tokenizer

SFT Tokenization Options:

Parameter
--sft_dataset_path_huggingface
--sft_dataset_sub_set
--tokenizer_file_name
--data_file_name
--encoded_dataset_shard_size
--load_mistral_tokenizer

2️⃣ Fine-tune the Model

python train_sft.py --compile_model --use_autocast
# or
python train_sft.py --load_mistral_tokenizer --compile_model --use_autocast

SFT Training Options:

Parameter
--steps
--eval_rate
--eval_steps
--save_rate
--warm_up
--total_batch_size
--batch_size
--promissed_flops
--lr
--min_lr
--weight_decay
--beta1
--beta2
--backend
--save_file_name
--data_file_name
--tokenizer_file_name
--dtype
--compile_model
--use_autocast
--load_mistral_tokenizer

3️⃣ Test Your Instruction Model

python generate.py --input_text "Python is" --num_tokens_to_generate 100 --save_file_name lilgpt_inst --compile_model
# or
python generate.py --input_text "Python is" --num_tokens_to_generate 100 --save_file_name lilgpt_inst --load_mistral_tokenizer --compile_model

⚠️ Important Notes

Tokenizer Consistency

You must use the same tokenizer for:

  • Pre-training
  • SFT
  • Generation

Mixing tokenizers will break compatibility.


Which Tokenizer Should You Use?

Use Case Recommended Option
πŸš€ Most users Custom tokenizer
⚑ Quick testing Mistral OR custom
🏭 Production Mistral tokenizer
πŸŽ“ SFT / Chat models Mistral OR custom (better special tokens)
πŸ”¬ Research / learning Custom tokenizer
🌍 Non-English text Custom tokenizer
πŸ“š Domain-specific content Custom tokenizer

πŸ—ΊοΈ Architecture Highlights

  • πŸ”₯ DeepSeek Multi-Head Latent Attention
  • βš–οΈ Mixture-of-Experts
  • ⚑ PyTorch Distributed Data Parallel
  • 🎯 Mixed Precision (FP16/BF16)

🀝 Contributing

Contributions are welcome!

  • πŸ› Bug reports
  • πŸ’‘ Feature ideas
  • πŸ”§ Pull requests
  • πŸ“– Documentation improvements

πŸ“„ License

GNU Affero General Public License (AGPL).


πŸ™ Acknowledgments

  • Mistral AI
  • HuggingFace
  • DeepSeek
  • PyTorch Team

About

Making a simple transformer decoder only language model.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages