Skip to content

MoodBench is an automated benchmarking framework that fine-tunes, evaluates, and compares small language models for sentiment analysis. It uses Parameter-Efficient Fine-Tuning (PEFT) with LoRA to enable efficient training on consumer hardware. It also calculates an approximate NPS for reviews as a POC.

Notifications You must be signed in to change notification settings

andrewmarconi/MoodBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MoodBench

Multi-LLM Sentiment Analysis Benchmark Framework

Fast, efficient benchmarking of 17 small language models (4M-410M parameters) for sentiment analysis using LoRA fine-tuning.


πŸš€ Quick Start

Command Line Interface

# Install dependencies
uv sync

# Train models (1 hour, 5 models)
uv run moodbench train-all --dataset amazon --device=mps \
  --models BERT-tiny BERT-mini BERT-small ELECTRA-small MiniLM-L12

# Evaluate models
uv run moodbench benchmark --dataset amazon

# View results
uv run moodbench report --results-dir experiments/results

Web Interface (Alternative)

# Install additional dependencies
uv add gradio

# Launch web UI
python gradio_app.py

# Open http://localhost:7860 in your browser

The web interface provides modular tabs for training, benchmarking, analysis, NPS estimation, and methodology documentation.

πŸ“š Documentation

Getting Started

User Interfaces

  • Gradio Web UI - Interactive web interface with modular tabs for training, benchmarking, analysis, NPS estimation, and methodology

Technical Details

  • CLAUDE.md - Architecture, technical implementation, and development guide
  • Documentation Index - Navigate all documentation by role and use case

🎯 What is MoodBench?

MoodBench is an automated benchmarking framework that fine-tunes, evaluates, and compares small language models for sentiment analysis. It uses Parameter-Efficient Fine-Tuning (PEFT) with LoRA to enable efficient training on consumer hardware.

Key Features:

  • πŸƒ 17 optimized models from 4M to 410M parameters
  • ⚑ Fast benchmarking - Ultra-tiny models train in 5-15 minutes
  • πŸ’Ύ Memory efficient - All models <6GB on Apple Silicon M4
  • πŸ“Š Comprehensive metrics - Accuracy, F1, balanced accuracy, latency percentiles, throughput, memory, statistical significance, robustness
  • πŸ”§ Production ready - CI/CD-friendly, reproducible benchmarks
  • 🌐 Web Interface - Interactive Gradio UI for all operations

πŸ“Š Available Models

Ultra-Tiny (4M-30M) - Fastest

BERT-tiny BERT-mini ELECTRA-small BERT-small MiniLM-L12

Tiny (60M-170M) - Production Quality

DistilBERT-base Pythia-70m DistilRoBERTa DeBERTa-v3-small BERT-base GPT2-small RoBERTa-base Pythia-160m DialoGPT-small DistilGPT2

Medium (200M-500M) - Research Quality

Gemma-2-2B Pythia-410m

See Quick Reference for full details and benchmarks.

πŸ’‘ Common Use Cases

Quick Validation

export MOODBENCH_TEST_MODE=1
uv run moodbench train --model BERT-tiny --dataset imdb --device=mps

Production Model Selection

uv run moodbench train-all --dataset amazon --device=mps \
  --models DistilBERT-base DistilRoBERTa DeBERTa-v3-small RoBERTa-base

Research Comparison

uv run moodbench train-all --dataset amazon --device=mps \
  --models BERT-tiny BERT-mini BERT-small BERT-base DistilBERT-base RoBERTa-base

🎨 Architecture

Data Pipeline β†’ Training Engine β†’ Evaluation Engine β†’ Comparison Module β†’ Visualization
     ↓               ↓                    ↓                    ↓               ↓
   Loader      LoRA/QLoRA           Metrics             Statistical      Dashboard
Preprocessor   4-bit Quant     Speed Benchmark          Analysis          Reports
 Tokenizer     Multi-Device     Memory Profile           Ranking          Charts

Web Interface: Modular Gradio UI with dedicated tabs for training, benchmarking, analysis, NPS estimation, and methodology documentation.

πŸ“¦ Supported Datasets

  • IMDB - Movie reviews (50K samples)
  • SST2 - Stanford Sentiment Treebank (67K sentences)
  • Amazon - Product reviews (4M samples)
  • Yelp - Business reviews (650K samples)

πŸ–₯️ Hardware Support

Platform Status Optimizations
CUDA (NVIDIA) βœ… Full support 4-bit quantization, fp16
MPS (Apple Silicon) βœ… Full support Dynamic batching, gradient checkpointing
CPU βœ… Supported Optimized for ultra-tiny models

Recommended:

  • CUDA: 16GB+ RAM, 8GB+ VRAM
  • MPS: M2/M3 with 32GB+ unified memory
  • CPU: 32GB+ RAM (ultra-tiny models only)

πŸ› οΈ Installation

Using uv (Recommended)

git clone https://github.com/yourusername/moodbench.git
cd moodbench
uv sync

Using pip

git clone https://github.com/yourusername/moodbench.git
cd moodbench
pip install -e .

Requirements:

  • Python 3.12+
  • PyTorch 2.1+
  • 50GB+ storage for datasets and models

πŸ“– CLI Commands

# Train single model
uv run moodbench train --model <model-name> --dataset <dataset>

# Train multiple models
uv run moodbench train-all --dataset <dataset> --models <model1> <model2> ...

# Evaluate model
uv run moodbench evaluate --model <model> --dataset <dataset> --checkpoint <path>

# Run benchmarks
uv run moodbench benchmark --models BERT-tiny DistilBERT-base --datasets imdb sst2

# Generate reports
uv run moodbench report --results-dir experiments/results

See Quick Reference for detailed usage.

πŸ“ Project Structure

moodbench/
β”œβ”€β”€ config/              # Model, dataset, and training configurations
β”œβ”€β”€ src/                 # Core framework code
β”‚   β”œβ”€β”€ data/           # Dataset loading and preprocessing
β”‚   β”œβ”€β”€ models/         # Model registry and LoRA configurations
β”‚   β”œβ”€β”€ training/       # Training engine and optimizers
β”‚   β”œβ”€β”€ evaluation/     # Metrics and benchmarking
β”‚   β”œβ”€β”€ comparison/     # Result aggregation and ranking
β”‚   β”œβ”€β”€ ui/             # Modular Gradio web interface components
β”‚   └── visualization/  # Dashboard and reporting
β”œβ”€β”€ experiments/        # Training logs, checkpoints, results
β”œβ”€β”€ notebooks/          # Jupyter notebooks for analysis
β”œβ”€β”€ tests/              # Unit and integration tests
β”œβ”€β”€ scripts/            # Shell scripts for common tasks
└── docs/               # Comprehensive documentation

πŸ”§ Configuration

Models are configured in config/models.yaml:

- name: "prajjwal1/bert-tiny"
  alias: "BERT-tiny"
  size_params: "4M"
  architecture: "encoder-only"
  lora:
    rank: 4
    alpha: 8
    dropout: 0.05
    target_modules: ["query", "value"]
  recommended_batch_size:
    cuda: 64
    mps: 32
    cpu: 16
  memory_requirements:
    cuda_4bit: "0.1GB"
    mps_fp32: "0.5GB"
    cpu: "1GB"

See Model Configuration Guide for details on adding custom models.

πŸ› Troubleshooting

Out of Memory (MPS)

# Use smaller models
--models BERT-tiny BERT-mini BERT-small DistilBERT-base

# Or allow higher memory usage
export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0

Training Too Slow

# Enable test mode with small dataset
export MOODBENCH_TEST_MODE=1

# Start with ultra-tiny models
--models BERT-tiny BERT-mini

Model Not Found

# List available models
uv run python -c "from src.models.model_registry import ModelRegistry; \
  print('\n'.join(ModelRegistry().list_models()))"

See Model Configuration Guide - Troubleshooting for more solutions.

πŸ“Š Example Results

Model Size Accuracy F1 Latency (ms) Throughput (tok/s) Memory (MB)
BERT-tiny 4M 0.823 0.815 8.2 5000+ 500
DistilBERT-base 66M 0.915 0.910 18.5 2500 2000
RoBERTa-base 125M 0.932 0.928 32.1 1800 3000
DeBERTa-v3-small 86M 0.935 0.931 24.3 2100 2500

Results on IMDB dataset, Apple Silicon M3 Max, 1 epoch

🀝 Contributing

We welcome contributions! Areas of interest:

  • Adding new models to the registry
  • Supporting additional datasets
  • Improving benchmarking metrics
  • Enhancing visualization
  • Documentation improvements

πŸ“„ License

MIT License - See LICENSE file for details

πŸ™ Acknowledgments

Built with:


For detailed documentation, see:

Project version: 0.1.0 Last updated: 2025-11-24

About

MoodBench is an automated benchmarking framework that fine-tunes, evaluates, and compares small language models for sentiment analysis. It uses Parameter-Efficient Fine-Tuning (PEFT) with LoRA to enable efficient training on consumer hardware. It also calculates an approximate NPS for reviews as a POC.

Topics

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •