A simple yet powerful Transformer LLM implementation built with PyTorch, designed for clarity, modularity, and extensibility.
- π€ Train BPE tokenizer from scratch β Build your own vocabulary
- π₯ Load Mistral tokenizer β Use a proven, production-ready BPE tokenizer
- π Dataset tokenization β Efficient pre-processing for large-scale data
- π§ Transformer training β Train models from the ground up
- π¬ Text generation β Generate text using trained checkpoints
- π Supervised Fine-Tuning (SFT) β Instruction / chat model fine-tuning
- π Mixture-of-Experts β Efficient scaling via MoE routing
- π― Multi-Head Latent Attention β DeepSeek-inspired attention mechanism
- π₯οΈ Multi-GPU training β Distributed training using PyTorch DDP
- β‘ Mixed-precision training β FP16/BF16 speed-ups with less memory
- β No HuggingFace model loading
- β No RLHF pipeline
- β BPE-only tokenization
- β No safetensors support
- β Many advanced features still in progress
Install required dependencies:
pip install -r requirements.txtTip: All scripts include sensible defaultsβrun them without arguments to get started fast.
Choose one of the two paths:
Use the flag --load_mistral_tokenizer in training and generation steps.
Setup: Download only the tokenizer.json from Mistral-Nemo-Base-2407 and place it in your project directory.
π‘ Skip tokenizer training entirely. Ideal for production or rapid prototyping.
python train_tokenizer.pyTokenizer Training Options:
| Parameter |
|---|
--dataset_path_huggingface |
--dataset_sub_set |
--tokenizer_file_name |
--tokenizer_train_shard_size |
--trust_remote_code |
Pre-trained Resources (Custom Tokenizer Only):
β οΈ These resources only work with the custom tokenizer, not Mistral.
python tokenize_data.py
# or
python tokenize_data.py --load_mistral_tokenizerData Tokenization Options:
| Parameter |
|---|
--dataset_path_huggingface |
--dataset_sub_set |
--tokenizer_file_name |
--data_file_name |
--encoded_dataset_shard_size |
--load_mistral_tokenizer |
Pre-tokenized Dataset (Custom Only): FineWeb-Edu 10B subset.
β οΈ Only compatible with the custom tokenizer.
python train.py --compile_model --use_autocast
# or
python train.py --compile_model --use_autocast --load_mistral_tokenizerTraining Options:
| Parameter |
|---|
--steps |
--eval_rate |
--eval_steps |
--save_rate |
--warm_up |
--total_batch_size |
--batch_size |
--seed |
--promissed_flops |
--lr |
--min_lr |
--weight_decay |
--beta1 |
--beta2 |
--backend |
--save_file_name |
--data_file_name |
--tokenizer_file_name |
--dtype |
--compile_model |
--use_autocast |
--load_mistral_tokenizer |
python generate.py --input_text "Hello" --num_tokens_to_generate 20 --compile_model
# or
python generate.py --input_text "Hello" --num_tokens_to_generate 20 --load_mistral_tokenizer --compile_modelGeneration Options:
| Parameter |
|---|
--input_text |
--num_tokens_to_generate |
--temperature |
--top_p |
--save_file_name |
--backend |
--tokenizer_file_name |
--compile_model |
--load_mistral_tokenizer |
Fine-tune your model to follow instructions or engage in conversation.
Perfect for:
- π¬ Chatbots
- π Instruction models
- π― Domain-specific tuning
- π Behavior alignment
python tokenize_sft_data.py
# or
python tokenize_sft_data.py --load_mistral_tokenizerSFT Tokenization Options:
| Parameter |
|---|
--sft_dataset_path_huggingface |
--sft_dataset_sub_set |
--tokenizer_file_name |
--data_file_name |
--encoded_dataset_shard_size |
--load_mistral_tokenizer |
python train_sft.py --compile_model --use_autocast
# or
python train_sft.py --load_mistral_tokenizer --compile_model --use_autocastSFT Training Options:
| Parameter |
|---|
--steps |
--eval_rate |
--eval_steps |
--save_rate |
--warm_up |
--total_batch_size |
--batch_size |
--promissed_flops |
--lr |
--min_lr |
--weight_decay |
--beta1 |
--beta2 |
--backend |
--save_file_name |
--data_file_name |
--tokenizer_file_name |
--dtype |
--compile_model |
--use_autocast |
--load_mistral_tokenizer |
python generate.py --input_text "Python is" --num_tokens_to_generate 100 --save_file_name lilgpt_inst --compile_model
# or
python generate.py --input_text "Python is" --num_tokens_to_generate 100 --save_file_name lilgpt_inst --load_mistral_tokenizer --compile_modelYou must use the same tokenizer for:
- Pre-training
- SFT
- Generation
Mixing tokenizers will break compatibility.
| Use Case | Recommended Option |
|---|---|
| π Most users | Custom tokenizer |
| β‘ Quick testing | Mistral OR custom |
| π Production | Mistral tokenizer |
| π SFT / Chat models | Mistral OR custom (better special tokens) |
| π¬ Research / learning | Custom tokenizer |
| π Non-English text | Custom tokenizer |
| π Domain-specific content | Custom tokenizer |
- π₯ DeepSeek Multi-Head Latent Attention
- βοΈ Mixture-of-Experts
- β‘ PyTorch Distributed Data Parallel
- π― Mixed Precision (FP16/BF16)
Contributions are welcome!
- π Bug reports
- π‘ Feature ideas
- π§ Pull requests
- π Documentation improvements
GNU Affero General Public License (AGPL).
- Mistral AI
- HuggingFace
- DeepSeek
- PyTorch Team