-
Notifications
You must be signed in to change notification settings - Fork 3.6k
MLE STAR Workflow
The MLE-STAR (Machine Learning Engineering via Search and Targeted Refinement) workflow is Claude-Flow's flagship automation feature, representing a revolutionary approach to ML engineering that combines web search, foundation building, targeted optimization, intelligent ensembles, and comprehensive validation.
MLE-STAR is a systematic methodology that transforms how machine learning systems are built by:
- 🔍 Starting with Search: Web research to find state-of-the-art approaches
- 🏗️ Building Foundations: Creating robust baseline models and infrastructure
- 🎯 Targeted Refinement: Deep optimization of high-impact components only
- 🏛️ Intelligent Ensembles: Advanced model combination beyond simple averaging
- ✅ Rigorous Validation: Comprehensive testing with data leakage prevention
"Don't guess, search. Don't optimize everything, focus. Don't average models, orchestrate them."
Traditional ML approaches often involve:
- ❌ Starting from scratch without research
- ❌ Random optimization of all components
- ❌ Simple model averaging for ensembles
- ❌ Basic validation without leakage detection
MLE-STAR transforms this into:
- ✅ Research-driven foundation building
- ✅ Ablation-guided targeted optimization
- ✅ Sophisticated ensemble orchestration
- ✅ Production-ready validation and deployment
graph TB
subgraph "Phase 1: Discovery & Foundation (Parallel)"
A[🔍 Web Search Agent<br/>Research SOTA approaches<br/>Find winning solutions]
B[📊 Foundation Agent<br/>Analyze dataset<br/>Build baseline models]
end
subgraph "Phase 2: Analysis & Refinement (Sequential)"
C[🔬 Ablation Agent<br/>Identify high-impact<br/>components via testing]
D[⚡ Refinement Agent<br/>Deep optimization of<br/>critical components]
end
subgraph "Phase 3: Ensemble & Validation (Parallel)"
E[🏛️ Ensemble Agent<br/>Intelligent combination<br/>Meta-learning strategies]
F[✅ Validation Agent<br/>Comprehensive testing<br/>Production readiness]
end
G[🚀 Production Deployment<br/>Monitoring & Documentation]
A --> C
B --> C
C --> D
D --> E
D --> F
E --> G
F --> G
style A fill:#e3f2fd
style B fill:#e8f5e8
style C fill:#fff3e0
style D fill:#fce4ec
style E fill:#f3e5f5
style F fill:#e0f2f1
style G fill:#ffebee
# Basic MLE-STAR execution
claude-flow automation mle-star --dataset your-data.csv --target your-target --claude
# With custom configuration
claude-flow automation mle-star \
--dataset sales-data.csv \
--target revenue \
--output models/revenue/ \
--name "revenue-prediction" \
--search-iterations 5 \
--refinement-iterations 8 \
--max-agents 8 \
--claude- Dataset: CSV file with features and target column
-
Claude Integration:
--claudeflag for actual execution - Python Environment: For ML model execution
- Sufficient Resources: 8GB+ RAM, 10GB+ storage recommended
claude-flow automation mle-star [options]
Required:
--dataset <path> Path to dataset file (default: ./data/dataset.csv)
--target <column> Target column name (default: target)
Core Options:
--claude Enable Claude CLI integration (recommended)
--output <dir> Model output directory (default: ./models/)
--name <experiment> Experiment name for tracking
Iteration Control:
--search-iterations <n> Web search iterations (default: 3)
--refinement-iterations <n> Refinement cycles (default: 5)
--max-agents <n> Maximum agents to spawn (default: 6)
Execution Modes:
--interactive Use interactive mode with master coordinator
--non-interactive Force non-interactive mode (default)
--output-format <format> Output format (stream-json enables chaining)
Advanced:
--chaining Enable stream-json chaining between agents
--no-chaining Disable stream-json chaining
--timeout <ms> Execution timeout (default: 4 hours)
--verbose Enable detailed logging
--no-claude-warning Suppress Claude integration warningsPurpose: Research state-of-the-art approaches and winning solutions
Activities:
- Search for latest ML models for the problem domain
- Find winning Kaggle solutions and benchmarks
- Identify promising architectures and techniques
- Document implementation examples and model cards
- Focus on proven, recent approaches with good performance
Example Search Queries:
- "latest machine learning models {problem_type} 2024"
- "kaggle winning solutions {problem_type}"
- "state-of-the-art {problem_type} benchmarks"
- "feature engineering techniques {problem_type}"
Output:
- Recommended approaches with performance metrics
- Model cards and implementation guides
- Benchmark results and leaderboard positions
- Code examples and proven techniques
Purpose: Analyze dataset and build robust baseline models
Activities:
- Comprehensive dataset analysis and profiling
- Data quality assessment and issue identification
- Problem type classification (regression, classification, etc.)
- Baseline model implementation using researched approaches
- Preprocessing pipeline creation
- Initial performance benchmarking
Dataset Analysis:
# Example analysis performed
- Data shape and types
- Missing values and outliers
- Feature correlations and distributions
- Target variable analysis
- Class balance (for classification)
- Time series patterns (if applicable)Baseline Models:
- Simple models (linear regression, decision trees)
- Standard algorithms (random forest, gradient boosting)
- Research-informed models (based on search results)
- Ensemble baseline (simple voting/averaging)
Purpose: Systematically identify which components have highest performance impact
Methodology:
- Component Isolation: Test each pipeline component independently
- Impact Measurement: Quantify performance change when removing/changing components
- Statistical Significance: Ensure changes are meaningful, not noise
- Ranking Creation: Order components by impact magnitude
- Resource Allocation: Focus refinement on top-impact components
Components Tested:
- Data preprocessing steps
- Feature engineering techniques
- Model selection choices
- Hyperparameter ranges
- Cross-validation strategies
- Ensemble combination methods
Example Ablation Results:
Component Impact Analysis:
├── Feature Engineering: +12.3% performance impact (HIGH)
├── Model Selection: +8.7% performance impact (HIGH)
├── Hyperparameter Tuning: +4.2% performance impact (MEDIUM)
├── Data Preprocessing: +2.1% performance impact (LOW)
└── Cross-validation: +0.8% performance impact (LOW)
Recommendation: Focus refinement on Feature Engineering and Model Selection
Purpose: Deep, focused optimization of highest-impact components
Targeted Optimization Strategy:
- Focus on Top Components: Only optimize high-impact elements
- Iterative Improvement: Multiple refinement cycles with feedback
- Advanced Techniques: Use sophisticated optimization methods
- Performance Tracking: Monitor improvements and prevent overfitting
- Resource Efficiency: Spend time where it matters most
Optimization Techniques:
- Feature Engineering: Advanced transformations, interactions, selections
- Hyperparameter Optimization: Bayesian optimization, evolutionary algorithms
- Architecture Search: Neural architecture search for deep learning
- Regularization: Advanced regularization techniques
- Data Augmentation: Synthetic data generation where appropriate
Example Refinement Process:
Iteration 1: Feature Engineering Focus
├── Created polynomial features: +3.2% improvement
├── Applied PCA transformation: +1.8% improvement
├── Added interaction terms: +2.7% improvement
└── Total Phase Improvement: +7.7%
Iteration 2: Model Architecture Focus
├── Grid searched tree parameters: +2.1% improvement
├── Tested ensemble weights: +1.9% improvement
├── Applied stacking strategy: +3.4% improvement
└── Total Phase Improvement: +7.4%
Overall Refinement Gain: +15.1% from baseline
Purpose: Create sophisticated model combinations beyond simple averaging
Advanced Ensemble Strategies:
# Multi-level stacking approach
Level 1: Base models (RF, XGB, Neural Net, SVM)
Level 2: Meta-learner (learns optimal combination)
Level 3: Final predictor (calibrated output)# Context-aware model weighting
- Instance-specific weights based on confidence
- Feature-space adaptive weighting
- Time-varying weights for temporal data
- Performance-based dynamic adjustment# Principled uncertainty quantification
- Posterior model probabilities
- Uncertainty-aware predictions
- Calibrated confidence intervals
- Risk-adjusted decision making# Specialized models for different regions
- Gating network determines expert selection
- Experts specialized for data subspaces
- Smooth transitions between experts
- Hierarchical mixture structuresEnsemble Performance Optimization:
- Diversity Maximization: Ensure models make different types of errors
- Correlation Minimization: Reduce redundancy between predictions
- Calibration: Ensure confidence scores reflect true probabilities
- Robustness: Test ensemble stability across different conditions
Purpose: Comprehensive testing with production-readiness focus
Rigorous Validation Framework:
# Appropriate CV for different scenarios
- Stratified K-Fold: For classification
- Time Series Split: For temporal data
- Group K-Fold: For grouped/clustered data
- Nested CV: For hyperparameter selection
- Monte Carlo CV: For robust estimates# Systematic leakage prevention
- Feature leakage detection
- Temporal leakage identification
- Target encoding validation
- Data split integrity checks
- Information leakage quantification# Comprehensive error characterization
- Error distribution analysis
- Failure mode identification
- Bias and fairness assessment
- Robustness testing
- Edge case handling# Deployment preparation
- Model serialization/deserialization
- Inference speed benchmarking
- Memory usage profiling
- API endpoint testing
- Monitoring setup validationValidation Report Example:
📊 MLE-STAR Validation Results
Cross-Validation Performance:
├── 5-Fold CV Score: 0.847 ± 0.012
├── Time Series Split: 0.834 ± 0.019
├── Out-of-sample Test: 0.851
└── Performance Stability: ✅ STABLE
Data Leakage Assessment:
├── Feature Leakage: ✅ NONE DETECTED
├── Temporal Leakage: ✅ NONE DETECTED
├── Target Encoding: ✅ PROPERLY VALIDATED
└── Information Integrity: ✅ VERIFIED
Production Readiness:
├── Inference Speed: 23ms (Target: <50ms) ✅
├── Memory Usage: 245MB (Budget: 512MB) ✅
├── API Response: 45ms (Target: <100ms) ✅
└── Monitoring: ✅ CONFIGURED
Final Recommendation: ✅ READY FOR PRODUCTION
# Automated deployment package creation
├── Model artifacts (pickled models, weights)
├── Preprocessing pipelines (scalers, encoders)
├── Feature engineering code (transformations)
├── API endpoint definitions (FastAPI/Flask)
├── Docker configuration (containerization)
├── Monitoring dashboards (metrics, alerts)
└── Documentation (usage, API reference)# Production monitoring setup
├── Performance drift detection
├── Data drift monitoring
├── Feature importance tracking
├── Prediction confidence analysis
├── Error rate monitoring
├── Business metric correlation
└── Automated retraining triggers🧠 MLE-STAR: Machine Learning Engineering via Search and Targeted Refinement
🎯 This is the flagship automation workflow for ML engineering tasks
📋 Workflow: MLE-STAR Machine Learning Engineering Workflow
📄 Description: Complete Machine Learning Engineering workflow using MLE-STAR methodology
🎓 Methodology: Search → Foundation → Refinement → Ensemble → Validation
⏱️ Expected Runtime: 2-4 hours
📊 Configuration:
Dataset: customer-churn.csv
Target: churn
Output: ./models/churn/
Claude Integration: Enabled
Execution Mode: Non-interactive (default)
Stream Chaining: Enabled
🔗 Stream chaining enabled: Agent outputs will be piped to dependent agents
📋 Executing 6 tasks in 3 phases...
🔄 Phase 1: 2 concurrent tasks
╔═══════════════════════════════════════════════════════════════╗
║ 🤖 CONCURRENT TASK STATUS ║
╠═══════════════════════════════════════════════════════════════╣
║ ⠋ RUNNING (2 agents): ║
║ 🔍 Web Search Research [██████████] 4m 23s ║
║ 📊 Dataset Analysis [████████░░] 3m 15s ║
║ ✅ COMPLETED (0): ║
║ ⏳ QUEUED: 4 tasks waiting ║
╠═══════════════════════════════════════════════════════════════╣
║ 📊 Progress: 33% (2/6) │ ⚡ Active: 2 │ ❌ Failed: 0 ║
╚═══════════════════════════════════════════════════════════════╝
🔄 Phase 2: 2 concurrent tasks
🔗 Enabling stream chaining from web_search_phase to foundation_building
🔗 Enabling stream chaining from dataset_analysis to ablation_analysis
🚀 Starting: Foundation Model Creation
Agent: foundation_agent
🔗 Chaining: Piping output from search_agent
🚀 Starting: Ablation Analysis
Agent: refinement_agent
🔗 Chaining: Piping output from foundation_agent
🔄 Phase 3: 2 concurrent tasks
🔗 Enabling stream chaining from targeted_refinement to ensemble_creation
🔗 Enabling stream chaining from ensemble_creation to validation_and_debugging
🎉 MLE-STAR workflow completed successfully!
📊 Results: 6/6 tasks completed
⏱️ Duration: 2h 47m 18s
🆔 Execution ID: workflow-exec-1704067200000
📈 Key Results:
✅ web_search_phase: Found 15 SOTA approaches, 8 Kaggle winners
✅ dataset_analysis: Identified 3 data quality issues, 12 key features
✅ foundation_building: Built 4 baseline models, best: 0.734 AUC
✅ ablation_analysis: Feature engineering +12.3% impact (highest)
✅ targeted_refinement: Optimized top components, +18.7% improvement
✅ ensemble_creation: Stacking approach, final: 0.891 AUC
✅ validation_and_debugging: No leakage detected, production ready
💡 Next Steps:
• Check models in: ./models/churn/
• Review experiment: customer-churn-prediction
• Validate results with your test data
• Deploy using generated API endpoints- Improvement over Baseline: >5% performance gain required
- Statistical Significance: p-value < 0.05 for improvements
- Stability: CV variance < 5% of mean performance
- Generalization: Out-of-sample performance within 3% of CV
- Data Leakage: Zero tolerance policy
- Overfitting: Train/validation gap < 5%
- Bias Detection: Fairness metrics within acceptable ranges
- Robustness: Performance stable across data subsets
- Inference Speed: <100ms per prediction (configurable)
- Memory Usage: Within specified resource budgets
- API Reliability: >99.9% uptime in testing
- Monitoring: All key metrics instrumented
{
"variables": {
"dataset_path": "custom-data.csv",
"target_column": "outcome",
"experiment_name": "custom-experiment",
"model_output_dir": "models/custom/",
"search_iterations": 5,
"refinement_iterations": 10,
"quality_threshold": 0.85,
"max_training_time": 3600
}
}{
"agents": [
{
"id": "custom_search_agent",
"type": "researcher",
"config": {
"search_depth": "comprehensive",
"domains": ["arxiv", "papers_with_code", "kaggle"],
"focus_areas": ["time_series", "anomaly_detection"]
}
}
]
}{
"dependencies": {
"foundation_building": ["web_search_phase", "dataset_analysis"],
"ablation_analysis": ["foundation_building"],
"targeted_refinement": ["ablation_analysis"],
"ensemble_creation": ["targeted_refinement"],
"validation_and_debugging": ["ensemble_creation"],
"model_deployment_prep": ["validation_and_debugging"]
}
}| Metric | Traditional ML | MLE-STAR | Improvement |
|---|---|---|---|
| Model Performance | 0.78 AUC | 0.89 AUC | +14.1% |
| Development Time | 2-3 weeks | 2-4 hours | 21x faster |
| Research Quality | Manual/limited | Comprehensive | 5x more approaches |
| Optimization Focus | All components | High-impact only | 3x more efficient |
| Production Readiness | Manual validation | Automated checks | 100% coverage |
| Reproducibility | 60% success | 95% success | +58% |
💻 Computational Requirements:
├── CPU: 4+ cores recommended
├── Memory: 8GB+ RAM (16GB for large datasets)
├── Storage: 10GB+ available space
├── Network: Stable internet for web search
└── Time: 2-4 hours (varies by dataset size)
⚡ Performance Characteristics:
├── Scales to datasets: 10K - 10M rows
├── Supports features: 10 - 10K columns
├── Model types: All scikit-learn + deep learning
├── Problem types: Classification, regression, time series
└── Deployment: API, batch, streaming
❌ Problem: Web search returns empty results
🔍 Diagnosis: Check internet connection, verify search terms
🔧 Solution:
- Ensure stable internet connection
- Broaden search terms in workflow configuration
- Use --search-iterations 5 for more attempts❌ Problem: All baseline models show low performance
🔍 Diagnosis: Data quality or problem complexity issues
🔧 Solution:
- Review dataset analysis output
- Check for data leakage or preprocessing errors
- Consider problem reformulation or additional features❌ Problem: Targeted refinement doesn't improve performance
🔍 Diagnosis: Already near optimal or wrong components targeted
🔧 Solution:
- Increase --refinement-iterations to 10+
- Review ablation analysis results
- Consider different optimization strategies❌ Problem: Ensemble AUC lower than individual model performance
🔍 Diagnosis: Models too similar or poor combination strategy
🔧 Solution:
- Increase model diversity in foundation phase
- Try different ensemble strategies (stacking vs voting)
- Check for overfitting in ensemble training# Reduce iterations for faster execution
claude-flow automation mle-star \
--dataset data.csv \
--target label \
--claude \
--search-iterations 1 \
--refinement-iterations 3 \
--max-agents 4# Maximum quality configuration
claude-flow automation mle-star \
--dataset data.csv \
--target label \
--claude \
--search-iterations 7 \
--refinement-iterations 15 \
--max-agents 12 \
--timeout 14400000 # 4 hours- Multi-Modal Support: Text, image, and structured data combination
- AutoML Integration: Automatic architecture search and optimization
- Federated Learning: Distributed training across multiple datasets
- Causal Inference: Integration of causal discovery and inference
- Active Learning: Intelligent data labeling and acquisition
- MLOps Integration: Full CI/CD pipeline for model deployment
- Meta-Learning: Learning to learn from previous MLE-STAR runs
- Neural Architecture Search: Automated deep learning architecture design
- Automated Feature Engineering: AI-driven feature creation and selection
- Explanation Generation: Automated model interpretability and documentation
- Fairness Optimization: Built-in bias detection and mitigation
- Automation Commands - Complete automation command reference
- Stream Chaining - Real-time agent output piping
- Agent System Overview - Understanding the 64-agent ecosystem
- Non-Interactive Mode - CI/CD and automation setup
- GitHub Integration - Repository management with ML workflows
🧠 MLE-STAR represents the future of machine learning engineering: research-driven, systematically optimized, and production-ready. Transform your ML development process from weeks to hours.
Get started now: claude-flow automation mle-star --dataset your-data.csv --target your-target --claude