Open-Superintelligence-Lab · vukrosic · Oct 9, 2025 · Oct 9, 2025 · Oct 9, 2025 · Oct 9, 2025
diff --git a/experiments/exp4_18b_moe_training/CHECKPOINT_GUIDE.md b/experiments/exp4_18b_moe_training/CHECKPOINT_GUIDE.md
@@ -0,0 +1,185 @@
+# Checkpoint Loading and Resume Training Guide
+
+This guide explains how to load checkpoints and resume training for the MoE model.
+
+## Quick Start
+
+### Resume Training from a Checkpoint
+
+```bash
+# Resume from the latest checkpoint
+python experiments/exp4_18b_moe_training/run_experiment.py \
+    --gpu 4090 \
+    --resume experiments/exp4_18b_moe_training/checkpoints/checkpoint_latest.pt
+
+# Resume from a specific step checkpoint
+python experiments/exp4_18b_moe_training/run_experiment.py \
+    --gpu 4090 \
+    --resume experiments/exp4_18b_moe_training/checkpoints/checkpoint_step_5000.pt
+```
+
+### Load Model for Inference Only
+
+```python
+from experiments.exp4_18b_moe_training.trainer_18b import load_checkpoint
+from experiments.exp4_18b_moe_training.models_18b import MoE18BLLM
+import torch
+
+# Load checkpoint to get config
+checkpoint = torch.load('checkpoints/checkpoint_latest.pt')
+config = checkpoint['config']
+
+# Create model
+model = MoE18BLLM(config)
+
+# Load weights (no optimizers needed for inference)
+step, config, metrics = load_checkpoint(
+    'checkpoints/checkpoint_latest.pt',
+    model,
+    device='cuda'
+)
+
+model.eval()  # Set to evaluation mode
+print(f"Loaded model from step {step}")
+```
+
+## Checkpoint Structure
+
+Each checkpoint contains:
+- `step`: Training step number
+- `model_state_dict`: Model weights
+- `optimizer_states`: Optimizer states (for resuming)
+- `scheduler_states`: Learning rate scheduler states (for resuming)
+- `config`: Model configuration
+- `metrics`: Training metrics at that step
+
+## Checkpoint Locations
+
+Checkpoints are saved in: `experiments/exp4_18b_moe_training/checkpoints/`
+
+Two types of checkpoints are saved:
+1. **Periodic checkpoints**: `checkpoint_step_<STEP>.pt` (saved every 5000 steps)
+2. **Latest checkpoint**: `checkpoint_latest.pt` (always updated with the most recent state)
+
+## Resume Training Behavior
+
+When you resume training:
+1. ✅ Model weights are restored
+2. ✅ Optimizer states are restored (momentum, adaptive learning rates, etc.)
+3. ✅ Learning rate scheduler is restored
+4. ✅ Training resumes from the saved step number
+5. ✅ Token count continues from where it left off
+
+This ensures **seamless continuation** of training with no degradation in optimization.
+
+## Example Use Cases
+
+### 1. Training Interrupted? Resume It!
+
+```bash
+# Your training crashed at step 8,234?
+python experiments/exp4_18b_moe_training/run_experiment.py \
+    --gpu 4090 \
+    --resume checkpoints/checkpoint_latest.pt
+```
+
+### 2. Load Model to Evaluate on Custom Data
+
+```python
+from experiments.exp4_18b_moe_training.load_and_use_checkpoint import load_model_for_inference
+
+model, config, step = load_model_for_inference('checkpoints/checkpoint_step_10000.pt')
+
+# Now use model for evaluation or inference
+with torch.no_grad():
+    logits, _ = model(input_tokens, return_aux_loss=False)
+```
+
+### 3. Continue Training with Different Settings
+
+```python
+# Load checkpoint, modify config, and continue training
+from experiments.exp4_18b_moe_training.trainer_18b import train_18b_model, load_checkpoint
+from experiments.exp4_18b_moe_training.config_4090 import MoE4090Config
+
+# Load and modify config
+config = MoE4090Config(vocab_size=49152)
+config.muon_lr = 0.005  # Lower learning rate for fine-tuning
+
+# Resume training with new config
+model, metrics = train_18b_model(
+    config,
+    train_loader,
+    val_loader,
+    checkpoint_path='checkpoints/checkpoint_step_25000.pt'
+)
+```
+
+## Helper Script
+
+A helper script is provided: `load_and_use_checkpoint.py`
+
+```bash
+# Load model for inference
+python experiments/exp4_18b_moe_training/load_and_use_checkpoint.py --mode inference
+
+# Example of resuming training
+python experiments/exp4_18b_moe_training/load_and_use_checkpoint.py --mode resume
+```
+
+## GPU Compatibility
+
+Checkpoints saved with one GPU config can be loaded with another:
+- A model trained on 4090 can be loaded on B200 (and vice versa)
+- Just make sure the **model architecture** matches (same config file)
+- The `--gpu` flag only affects NEW training runs, not loading
+
+## Troubleshooting
+
+### "RuntimeError: Error(s) in loading state_dict"
+- Make sure the config matches the checkpoint
+- Check that you're using the same model architecture
+
+### "CUDA out of memory" when loading
+- The checkpoint was trained on a larger GPU
+- Use a smaller config or reduce batch size
+
+### Training starts from step 0 even with --resume
+- Check that the checkpoint path is correct
+- Verify the checkpoint file exists
+
+## Best Practices
+
+1. **Always save `checkpoint_latest.pt`**: It's automatically saved and easy to resume from
+2. **Keep periodic checkpoints**: They allow you to go back to earlier states if needed
+3. **Test checkpoint loading**: After saving, verify you can load it before deleting old checkpoints
+4. **Document your training runs**: Note which checkpoint corresponds to which experiment
+
+## Advanced: Manual Checkpoint Loading
+
+```python
+import torch
+
+# Load checkpoint manually
+checkpoint = torch.load('checkpoints/checkpoint_step_5000.pt')
+
+print(f"Checkpoint from step: {checkpoint['step']}")
+print(f"Training metrics: {checkpoint['metrics']}")
+print(f"Model config: {checkpoint['config']}")
+
+# Access model weights
+model_weights = checkpoint['model_state_dict']
+
+# Access optimizer state
+optimizer_state = checkpoint['optimizer_states'][0]  # Muon optimizer
+```
+
+## Summary
+
+✅ **Resume training**: Use `--resume <checkpoint_path>`
+✅ **Load for inference**: Use `load_checkpoint()` with no optimizer/scheduler
+✅ **Checkpoints are saved automatically** every 5000 steps
+✅ **Full training state is preserved**: weights, optimizer, scheduler, step number
+
+For more examples, see `load_and_use_checkpoint.py`!
+
diff --git a/experiments/exp4_18b_moe_training/EXPERIMENT_SUMMARY.md b/experiments/exp4_18b_moe_training/EXPERIMENT_SUMMARY.md
@@ -0,0 +1,211 @@
+# Experiment 4: 18B MoE Training - Summary
+
+## Objective
+Train an 18 billion parameter Mixture of Experts (MoE) language model optimized for NVIDIA B200 GPU with 192GB VRAM, utilizing gradient checkpointing and the Muon optimizer for memory efficiency.
+
+## Key Innovations
+
+### 1. **Gradient Checkpointing Implementation**
+- ✅ Implemented at transformer block level
+- ✅ Checkpoints both attention and MoE FFN layers
+- ✅ Reduces activation memory by ~40-50%
+- ✅ Uses `use_reentrant=False` for better stability
+
+### 2. **Memory-Optimized Architecture**
+- **Total Params**: 18B
+- **Active Params**: 4.5B per token (25% due to sparse routing)
+- **Memory Breakdown**:
+  - Model: 36 GB
+  - Gradients: 36 GB  
+  - Optimizer: 36 GB (Muon)
+  - Activations: 40-50 GB (with checkpointing)
+  - **Total: ~150-160 GB** (fits in 192GB with headroom)
+
+### 3. **Sparse MoE Design**
+- 8 experts total, top-2 active per token
+- 75% of expert parameters "inactive" but still in memory
+- Load balancing auxiliary loss to prevent expert collapse
+- Enables 4x model capacity vs dense model
+
+## Technical Specifications
+
+| Component | Configuration |
+|-----------|---------------|
+| Architecture | Transformer with MoE FFN |
+| Hidden Size | 4096 |
+| Layers | 40 |
+| Attention Heads | 32 |
+| FFN Size (per expert) | 11,008 |
+| Experts | 8 total, 2 active |
+| Sequence Length | 4096 tokens |
+| Vocab Size | ~50,000 (dataset dependent) |
+
+## Training Setup
+
+| Parameter | Value |
+|-----------|-------|
+| Optimizer | Muon (weights) + AdamW (embeddings) |
+| Learning Rate | 0.01 (Muon), 0.001 (AdamW) |
+| Batch Size | 4 |
+| Gradient Accumulation | 8 steps |
+| Effective Batch Size | 32 (131k tokens/step) |
+| Mixed Precision | FP16 AMP |
+| Gradient Clipping | 1.0 |
+| Max Steps | 50,000 |
+| Total Tokens | ~6.5 billion |
+
+## Memory Optimizations Applied
+
+1. **Gradient Checkpointing** ✅
+   - Saves 40-50% activation memory
+   - ~10-15% training slowdown (acceptable tradeoff)
+
+2. **Muon Optimizer** ✅
+   - 30% less memory than AdamW
+   - Only momentum buffer (no first/second moments)
+
+3. **Mixed Precision (FP16)** ✅
+   - 50% reduction in activation memory
+   - Maintains FP32 master weights
+
+4. **Sparse MoE** ✅
+   - 25% parameter utilization per forward pass
+   - 4x capacity increase vs dense model
+
+## Expected Performance
+
+### Memory Usage
+- **Estimated**: 150-160 GB
+- **Available**: 192 GB
+- **Headroom**: ~30-40 GB (20%)
+
+### Throughput
+- **Tokens/sec**: 5,000-10,000 (hardware dependent)
+- **Training Time**: 10-20 hours for 50k steps
+- **Tokens Processed**: 6.5 billion
+
+### Model Quality
+- **Baseline**: Should match 4-5B dense model
+- **Potential**: Better due to expert specialization
+- **Target Perplexity**: < 50 (dataset dependent)
+
+## Files Created
+
+```
+experiments/exp4_18b_moe_training/
+├── __init__.py                  # Module exports
+├── config_18b.py               # 18B model configuration
+├── models_18b.py               # Model with gradient checkpointing
+├── trainer_18b.py              # Training loop with memory monitoring
+├── run_experiment.py           # Main entry point
+├── test_setup.py               # Setup verification script
+├── README.md                   # Comprehensive documentation
+├── EXPERIMENT_SUMMARY.md       # This file
+└── checkpoints/                # Will be created during training
+    ├── checkpoint_step_*.pt
+    ├── checkpoint_latest.pt
+    ├── training_results.json
+    └── training_curves.png
+```
+
+## Git Branch
+- **Branch**: `exp_18b_moe_training`
+- **Created from**: `main`
+- **Purpose**: Isolate 18B training experiment
+
+## How to Run
+
+### 1. Verify Setup
+```bash
+cd experiments/exp4_18b_moe_training
+python test_setup.py
+```
+
+### 2. Start Training
+```bash
+python run_experiment.py
+```
+
+### 3. Monitor Progress
+- Watch console for real-time metrics
+- Check `checkpoints/training_results.json` for detailed metrics
+- View `checkpoints/training_curves.png` for visualizations
+
+## Scaling Guide
+
+### For Different Hardware:
+
+**80GB GPU (A100/H100):**
+```python
+d_model = 2048      # ~5B params
+n_layers = 32
+batch_size = 8
+```
+
+**40GB GPU (A100):**
+```python
+d_model = 1536      # ~2B params
+n_layers = 24
+batch_size = 8
+```
+
+**Larger Models (H200/B200+):**
+```python
+d_model = 5120      # ~25B params
+n_layers = 48
+batch_size = 2
+```
+
+## Monitoring Commands
+
+```bash
+# Watch GPU usage
+watch -n 1 nvidia-smi
+
+# Monitor training
+tail -f checkpoints/training_results.json
+
+# Check memory during training
+python -c "import torch; print(f'{torch.cuda.max_memory_allocated()/1e9:.1f} GB')"
+```
+
+## Success Criteria
+
+✅ Model fits in 192GB VRAM  
+✅ Training completes without OOM errors  
+✅ Gradient checkpointing reduces memory by ~40%  
+✅ Validation perplexity decreases over training  
+✅ Expert load balancing maintains reasonable distribution  
+✅ Checkpoints save successfully every 5000 steps  
+
+## Next Steps
+
+After successful training:
+1. Analyze expert specialization patterns
+2. Compare to dense model baseline
+3. Experiment with different routing strategies (top-1, top-3)
+4. Test different load balancing weights
+5. Scale to larger models (25B+) if memory allows
+
+## Research Questions to Explore
+
+1. **Do experts specialize?** Analyze which tokens route to which experts
+2. **What's the optimal top-k?** Compare top-1, top-2, top-3 routing
+3. **Load balancing tradeoff?** Test different balancing weights
+4. **Scaling efficiency?** Compare to dense model at same active param count
+5. **Gradient checkpointing impact?** Measure speed vs memory tradeoff
+
+## References
+
+- Mixture of Experts: [Shazeer et al., 2017](https://arxiv.org/abs/1701.06538)
+- Switch Transformers: [Fedus et al., 2021](https://arxiv.org/abs/2101.03961)
+- Gradient Checkpointing: [Chen et al., 2016](https://arxiv.org/abs/1604.06174)
+- Muon Optimizer: Custom memory-efficient optimizer
+
+---
+
+**Created**: October 9, 2025  
+**Branch**: `exp_18b_moe_training`  
+**Hardware**: Optimized for NVIDIA B200 (192GB)  
+**Status**: Ready for training ✅
+