A complete end-to-end pipeline for training specialized Small Language Models (SLMs) on custom business data. OTTO enables organizations to train domain-specific language models without relying on expensive LLM fine-tuning or external APIs.
OTTO provides a streamlined workflow: Upload → Process → Preprocess → Train → Evaluate
The pipeline automatically handles file processing, data cleaning, tokenization, model training, and evaluation to produce specialized language models tailored to your specific use case.
Works Best With:
- Call transcripts and customer conversations
- Text-based business documents
- Natural language content (emails, reports, reviews)
- Conversational data and dialog systems
Limited Support For:
- Structured data (CSV, spreadsheets) - produces incoherent output
- Mixed media content requiring vision/audio processing
- Highly technical formats requiring specialized preprocessing
# Clone the repository
git clone <repository-url>
cd otto
# Install dependencies
uv install
# Install system dependencies (optional, for better file type detection)
brew install libmagic # macOS
# sudo apt-get install libmagic1 # Ubuntu# Train a model on your data
uv run src/otto/cli_runner.py your_data.txt
# Custom training parameters
uv run src/otto/cli_runner.py your_data.txt \
--max-iters 2000 \
--batch-size 32 \
--block-size 128
# Skip training, only preprocess
uv run src/otto/cli_runner.py your_data.txt --no-training
# Run inference and evaluation
uv run src/otto/inference.py model_outputs/best_model.pt --interactive- FileUploadManager - Handles file uploads and basic validation
- FileTypeDetector - Intelligent file type detection with fallback chains
- ProcessedFileSet - Manages archive extraction and temporary file handling
- FileProcessor - Coordinates document processing pipeline
- DocumentProcessor - Converts documents to training-ready format
- SLMTrainer - Trains GPT-style transformer models
- SLMInference - Handles model loading and text generation
Raw Files → Upload → Type Detection → Archive Extraction → Document Loading →
Text Cleaning → Tokenization → Binary Files → Model Training → Evaluation
- Text: .txt, .md, .rst
- Structured: .csv, .tsv, .json, .jsonl
- Archives: .zip, .tar, .tar.gz, .tar.bz2, .gz
- GPT-style transformer with causal attention
- Configurable layers, heads, and embedding dimensions
- Support for mixed precision training (FP16/BF16)
- Automatic vocabulary size detection
# Small model for testing
n_layer=4, n_head=4, n_embd=256, block_size=64
# Production model
n_layer=6, n_head=6, n_embd=384, block_size=128- Gradient accumulation for large effective batch sizes
- Learning rate warmup and cosine decay
- Automatic checkpointing and best model saving
- Memory-efficient data loading with memory mapping
- Progress tracking and loss monitoring
# Train on call transcripts
uv run src/otto/cli_runner.py customer_calls.txt \
--max-iters 5000 \
--batch-size 32# Train on legal documents
uv run src/otto/cli_runner.py legal_corpus.txt \
--block-size 256 \ # Longer context for complex documents
--n-layer 8# Test your trained model
uv run src/otto/inference.py model_outputs/best_model.pt --interactive
# Evaluate model performance
uv run src/otto/inference.py model_outputs/best_model.pt \
--evaluate --test-data training_data/val.bin- Perplexity: 10-100 range
- Generation: Coherent, domain-relevant text
- Training Loss: Decreases from ~10 to 2-4 range
- Perplexity: >1000 indicates poor learning
- Generation: Random token sequences
- Loss: Not decreasing or very high (>8)
- Minimum 100k+ tokens for meaningful training
- Text should be naturally flowing (not structured tables)
- Works best with conversational or narrative content
- CPU training only (GPU support planned)
- No distributed training
- Limited to text-only data
- No fine-tuning from pretrained models
- CSV data produces incoherent output without preprocessing
- Very small datasets lead to overfitting
- No support for multi-modal data
- CSV-to-natural-language converter
- Template-based data description generation
- Configurable data formatting strategies
- Support for tabular data relationships
- CUDA acceleration for training
- Multi-GPU support for larger models
- Memory optimization for large datasets
- Automatic device detection and optimization
- Domain-specific text cleaning
- Advanced tokenization strategies
- Support for code and technical documentation
- Multi-language text handling
- Support for different transformer variants
- Configurable attention mechanisms
- Model size recommendations based on data
- Pretrained model fine-tuning capabilities
- Distributed training across multiple machines
- Curriculum learning strategies
- Advanced optimization techniques
- Hyperparameter auto-tuning
- Task-specific model heads (classification, sentiment)
- Domain adaptation techniques
- Privacy-preserving training options
- Model compression and quantization
- Web-based training interface
- Real-time training monitoring
- Model comparison tools
- Automated report generation
- API server for model serving
- Docker containerization
- Cloud deployment templates
- Integration with business tools
- Domain-specific evaluation metrics
- A/B testing framework
- Model drift detection
- Performance monitoring dashboard
# Install development dependencies
uv install --dev
# Run tests
pytest tests/
# Format code
black src/
isort src/- Create loader in
src/otto/data_loaders/ - Add MIME type mapping in
FileTypeDetector - Add tests for the new loader
- Update documentation
- Implement model in
src/otto/models/ - Update
SLMTrainerto support new architecture - Add configuration validation
- Test training and inference
MIT License
If you use OTTO in your research or business applications, please cite:
@software{otto_slm_pipeline,
title={OTTO: Small Language Model Training Pipeline},
author={[Rosemary Nwosu-Ihueze]},
year={2025},
url={https://github.com/Nwosu-Ihueze/otto}
}
- Issues: Report bugs and feature requests via GitHub Issues
Note: This is the first iteration of OTTO. While functional for text-based training, significant improvements are planned for structured data support and production deployment features.