Fast, efficient benchmarking of 17 small language models (4M-410M parameters) for sentiment analysis using LoRA fine-tuning.
# Install dependencies
uv sync
# Train models (1 hour, 5 models)
uv run moodbench train-all --dataset amazon --device=mps \
--models BERT-tiny BERT-mini BERT-small ELECTRA-small MiniLM-L12
# Evaluate models
uv run moodbench benchmark --dataset amazon
# View results
uv run moodbench report --results-dir experiments/results# Install additional dependencies
uv add gradio
# Launch web UI
python gradio_app.py
# Open http://localhost:7860 in your browserThe web interface provides modular tabs for training, benchmarking, analysis, NPS estimation, and methodology documentation.
- Quick Reference - Commands, model lists, and common workflows β‘ Start here!
- Model Configuration Guide - Complete guide to all 18 models and configurations
- Gradio Web UI - Interactive web interface with modular tabs for training, benchmarking, analysis, NPS estimation, and methodology
- CLAUDE.md - Architecture, technical implementation, and development guide
- Documentation Index - Navigate all documentation by role and use case
MoodBench is an automated benchmarking framework that fine-tunes, evaluates, and compares small language models for sentiment analysis. It uses Parameter-Efficient Fine-Tuning (PEFT) with LoRA to enable efficient training on consumer hardware.
Key Features:
- π 17 optimized models from 4M to 410M parameters
- β‘ Fast benchmarking - Ultra-tiny models train in 5-15 minutes
- πΎ Memory efficient - All models <6GB on Apple Silicon M4
- π Comprehensive metrics - Accuracy, F1, balanced accuracy, latency percentiles, throughput, memory, statistical significance, robustness
- π§ Production ready - CI/CD-friendly, reproducible benchmarks
- π Web Interface - Interactive Gradio UI for all operations
BERT-tiny BERT-mini ELECTRA-small BERT-small MiniLM-L12
DistilBERT-base Pythia-70m DistilRoBERTa DeBERTa-v3-small BERT-base GPT2-small RoBERTa-base Pythia-160m DialoGPT-small DistilGPT2
Gemma-2-2B Pythia-410m
See Quick Reference for full details and benchmarks.
export MOODBENCH_TEST_MODE=1
uv run moodbench train --model BERT-tiny --dataset imdb --device=mpsuv run moodbench train-all --dataset amazon --device=mps \
--models DistilBERT-base DistilRoBERTa DeBERTa-v3-small RoBERTa-baseuv run moodbench train-all --dataset amazon --device=mps \
--models BERT-tiny BERT-mini BERT-small BERT-base DistilBERT-base RoBERTa-baseData Pipeline β Training Engine β Evaluation Engine β Comparison Module β Visualization
β β β β β
Loader LoRA/QLoRA Metrics Statistical Dashboard
Preprocessor 4-bit Quant Speed Benchmark Analysis Reports
Tokenizer Multi-Device Memory Profile Ranking Charts
Web Interface: Modular Gradio UI with dedicated tabs for training, benchmarking, analysis, NPS estimation, and methodology documentation.
- IMDB - Movie reviews (50K samples)
- SST2 - Stanford Sentiment Treebank (67K sentences)
- Amazon - Product reviews (4M samples)
- Yelp - Business reviews (650K samples)
| Platform | Status | Optimizations |
|---|---|---|
| CUDA (NVIDIA) | β Full support | 4-bit quantization, fp16 |
| MPS (Apple Silicon) | β Full support | Dynamic batching, gradient checkpointing |
| CPU | β Supported | Optimized for ultra-tiny models |
Recommended:
- CUDA: 16GB+ RAM, 8GB+ VRAM
- MPS: M2/M3 with 32GB+ unified memory
- CPU: 32GB+ RAM (ultra-tiny models only)
git clone https://github.com/yourusername/moodbench.git
cd moodbench
uv syncgit clone https://github.com/yourusername/moodbench.git
cd moodbench
pip install -e .Requirements:
- Python 3.12+
- PyTorch 2.1+
- 50GB+ storage for datasets and models
# Train single model
uv run moodbench train --model <model-name> --dataset <dataset>
# Train multiple models
uv run moodbench train-all --dataset <dataset> --models <model1> <model2> ...
# Evaluate model
uv run moodbench evaluate --model <model> --dataset <dataset> --checkpoint <path>
# Run benchmarks
uv run moodbench benchmark --models BERT-tiny DistilBERT-base --datasets imdb sst2
# Generate reports
uv run moodbench report --results-dir experiments/resultsSee Quick Reference for detailed usage.
moodbench/
βββ config/ # Model, dataset, and training configurations
βββ src/ # Core framework code
β βββ data/ # Dataset loading and preprocessing
β βββ models/ # Model registry and LoRA configurations
β βββ training/ # Training engine and optimizers
β βββ evaluation/ # Metrics and benchmarking
β βββ comparison/ # Result aggregation and ranking
β βββ ui/ # Modular Gradio web interface components
β βββ visualization/ # Dashboard and reporting
βββ experiments/ # Training logs, checkpoints, results
βββ notebooks/ # Jupyter notebooks for analysis
βββ tests/ # Unit and integration tests
βββ scripts/ # Shell scripts for common tasks
βββ docs/ # Comprehensive documentation
Models are configured in config/models.yaml:
- name: "prajjwal1/bert-tiny"
alias: "BERT-tiny"
size_params: "4M"
architecture: "encoder-only"
lora:
rank: 4
alpha: 8
dropout: 0.05
target_modules: ["query", "value"]
recommended_batch_size:
cuda: 64
mps: 32
cpu: 16
memory_requirements:
cuda_4bit: "0.1GB"
mps_fp32: "0.5GB"
cpu: "1GB"See Model Configuration Guide for details on adding custom models.
# Use smaller models
--models BERT-tiny BERT-mini BERT-small DistilBERT-base
# Or allow higher memory usage
export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0# Enable test mode with small dataset
export MOODBENCH_TEST_MODE=1
# Start with ultra-tiny models
--models BERT-tiny BERT-mini# List available models
uv run python -c "from src.models.model_registry import ModelRegistry; \
print('\n'.join(ModelRegistry().list_models()))"See Model Configuration Guide - Troubleshooting for more solutions.
| Model | Size | Accuracy | F1 | Latency (ms) | Throughput (tok/s) | Memory (MB) |
|---|---|---|---|---|---|---|
| BERT-tiny | 4M | 0.823 | 0.815 | 8.2 | 5000+ | 500 |
| DistilBERT-base | 66M | 0.915 | 0.910 | 18.5 | 2500 | 2000 |
| RoBERTa-base | 125M | 0.932 | 0.928 | 32.1 | 1800 | 3000 |
| DeBERTa-v3-small | 86M | 0.935 | 0.931 | 24.3 | 2100 | 2500 |
Results on IMDB dataset, Apple Silicon M3 Max, 1 epoch
We welcome contributions! Areas of interest:
- Adding new models to the registry
- Supporting additional datasets
- Improving benchmarking metrics
- Enhancing visualization
- Documentation improvements
MIT License - See LICENSE file for details
Built with:
- Transformers - Model implementations
- PEFT - LoRA fine-tuning
- PyTorch - Deep learning framework
- Gradio - Interactive web interface
For detailed documentation, see:
- π Quick Reference - Get started in 5 minutes
- π§ Model Configuration Guide - Complete technical guide
- ποΈ CLAUDE.md - Architecture and implementation details
- π Documentation Index - Navigate all docs
Project version: 0.1.0 Last updated: 2025-11-24