Summary
Orchestrates ReAlign curation workflows based on data type, enforces best practices, and validates quality gates.
Agent Workflow
STEP 1: Data Type Detection
→ User request: "Curate DPO data from book.pdf"
→ Detection: Keywords (DPO, preference, SRF, SFT, RLVR, anti-hallucination)
→ Action: Activate corresponding workflow skill
STEP 2: Workflow Selection
→ DPO → realign-dpo-workflow (#296)
→ SRF/SFT → realign-srf-workflow (#297)
→ RLVR → realign-rlvr-workflow (#298)
→ Anti-hallucination → realign-antihallucination-workflow (#299)
STEP 3: Performance Configuration (AUTOMATIC)
→ Detect model size from user request
→ Select machine: M4 Max (≤30B) or M3 Ultra (70B+)
→ Set batch size: 32 (M4) or 4 (M3)
→ Configure work split: 65/35 if distributed
STEP 4: Workflow Execution
→ Execute selected workflow step-by-step
→ Validate gates after each step
→ Block if validation fails
STEP 5: Quality Gate Validation
→ Check against realign-quality-gates (#302)
→ Tier: HIGH/MEDIUM/REJECT
→ Action: Accept HIGH, warn MEDIUM, block REJECT
STEP 6: Output Summary
→ Report: data type, workflow used, quality tier, examples processed
→ Log: machine, batch size, time, cost
→ Output: Curated dataset path
Performance Optimization (ENFORCED BY AGENT)
The agent AUTOMATICALLY applies:
- Machine selection (M4 Max vs M3 Ultra)
- Batch size (32 vs 4)
- Work distribution (65/35)
- Environment variables
Claude does NOT need to remember these - the agent enforces them.
Example Usage
User: "Curate DPO preference pairs from /data/books/ml_textbook.pdf for Qwen 7B"
Agent executes:
- Detects: Data type = DPO, Model = Qwen 7B (≤30B)
- Activates: realign-dpo-workflow
- Configures: M4 Max, batch_size=32
- Executes: 6-step DPO workflow with gates
- Validates: Quality tier = HIGH (preference_gap=0.22, agreement=92%)
- Outputs: /data/curated/dpo_ml_textbook_20260128/train.jsonl
Acceptance Criteria
Performance Optimization (COMPLETE - Claude Always Forgets)
Machine Selection (by model size)
| Model Size |
Machine |
Reason |
| ≤30B |
M4 Max |
1.9-5.1x faster (speed) |
| 30-70B |
M4 Max preferred |
Faster, unless need >128GB memory |
| 70-200B |
M3 Ultra |
Needs 512GB memory |
| 200B+ |
EXO |
Distributed across both machines |
Validated Benchmarks
- M4 Max: 12,956 GFLOPS (324 per core)
- M3 Ultra: 4,599 GFLOPS (57 per core)
- Real scoring: M4 Max 3.86 ex/s vs M3 Ultra 0.76 ex/s (5.1x faster)
Batch Size Configuration
→ M4 Max: batch_size=32 (peaks at 776 ex/s)
→ M3 Ultra: batch_size=4 (peaks at 278 ex/s - PEAKS EARLIER, don't increase!)
Work Distribution (65/35 NOT 50/50!)
→ M4 Max: 65.5% of work (M4_RATIO = 0.655)
→ M3 Ultra: 34.5% of work (M3_RATIO = 0.345)
RDMA vs Separate Batches Decision
USE RDMA WHEN:
| Scenario |
Why |
| Model > 128GB (e.g., 70B+ fp16) |
Model doesn't fit on M4 Max alone |
| Model > 512GB (e.g., 405B) |
Must shard across both machines |
| Training with gradient sync |
Need synchronized weight updates |
| Pipeline parallelism |
Layers split across machines |
USE SEPARATE BATCHES WHEN:
| Scenario |
Why |
| Model fits on one machine |
No coordination overhead |
| Independent scoring/eval |
Each machine works at own pace |
| Batch inference |
Combined throughput = M4 + M3 |
The Math (30B model example):
- Separate batches: M4 (1.95 ex/s) + M3 (1.03 ex/s) = 2.98 ex/s total
- RDMA sharding: ~2.5 ex/s (20-30% coordination overhead)
- Winner: Separate batches (no overhead)
Environment Variables + Priority
export MLX_METAL_PREALLOCATE=1
export MLX_METAL_FAST_SYNCH=1
export TOKENIZERS_PARALLELISM=false
sudo nice -n -10 realign curate ... # Elevated priority
Anti-Patterns (DON'T DO THIS!)
❌ Split work 50/50 based on GPU cores (WRONG - wastes days)
❌ "M3 Ultra is faster (more cores)" (FALSE for MLX inference)
❌ Use batch_size=32 on M3 Ultra (peaks at 4, don't go higher)
❌ Use RDMA for models that fit on one machine (adds overhead)
Cost Tracking
→ Log: machine, batch_size, examples, time, cost_per_example
→ M4 Max cheaper per example (5.1x faster = less compute time)
Source: MLX benchmarking 2026-01-28, validated by https://medium.com/@billynewport/apples-m3-ultra-mac-studio-misses-the-mark-for-llm-inference-f57f1f10a56f
Summary
Orchestrates ReAlign curation workflows based on data type, enforces best practices, and validates quality gates.
Agent Workflow
STEP 1: Data Type Detection
→ User request: "Curate DPO data from book.pdf"
→ Detection: Keywords (DPO, preference, SRF, SFT, RLVR, anti-hallucination)
→ Action: Activate corresponding workflow skill
STEP 2: Workflow Selection
→ DPO → realign-dpo-workflow (#296)
→ SRF/SFT → realign-srf-workflow (#297)
→ RLVR → realign-rlvr-workflow (#298)
→ Anti-hallucination → realign-antihallucination-workflow (#299)
STEP 3: Performance Configuration (AUTOMATIC)
→ Detect model size from user request
→ Select machine: M4 Max (≤30B) or M3 Ultra (70B+)
→ Set batch size: 32 (M4) or 4 (M3)
→ Configure work split: 65/35 if distributed
STEP 4: Workflow Execution
→ Execute selected workflow step-by-step
→ Validate gates after each step
→ Block if validation fails
STEP 5: Quality Gate Validation
→ Check against realign-quality-gates (#302)
→ Tier: HIGH/MEDIUM/REJECT
→ Action: Accept HIGH, warn MEDIUM, block REJECT
STEP 6: Output Summary
→ Report: data type, workflow used, quality tier, examples processed
→ Log: machine, batch size, time, cost
→ Output: Curated dataset path
Performance Optimization (ENFORCED BY AGENT)
The agent AUTOMATICALLY applies:
Claude does NOT need to remember these - the agent enforces them.
Example Usage
User: "Curate DPO preference pairs from /data/books/ml_textbook.pdf for Qwen 7B"
Agent executes:
Acceptance Criteria
Performance Optimization (COMPLETE - Claude Always Forgets)
Machine Selection (by model size)
Validated Benchmarks
Batch Size Configuration
→ M4 Max: batch_size=32 (peaks at 776 ex/s)
→ M3 Ultra: batch_size=4 (peaks at 278 ex/s - PEAKS EARLIER, don't increase!)
Work Distribution (65/35 NOT 50/50!)
→ M4 Max: 65.5% of work (M4_RATIO = 0.655)
→ M3 Ultra: 34.5% of work (M3_RATIO = 0.345)
RDMA vs Separate Batches Decision
USE RDMA WHEN:
USE SEPARATE BATCHES WHEN:
The Math (30B model example):
Environment Variables + Priority
Anti-Patterns (DON'T DO THIS!)
❌ Split work 50/50 based on GPU cores (WRONG - wastes days)
❌ "M3 Ultra is faster (more cores)" (FALSE for MLX inference)
❌ Use batch_size=32 on M3 Ultra (peaks at 4, don't go higher)
❌ Use RDMA for models that fit on one machine (adds overhead)
Cost Tracking
→ Log: machine, batch_size, examples, time, cost_per_example
→ M4 Max cheaper per example (5.1x faster = less compute time)
Source: MLX benchmarking 2026-01-28, validated by https://medium.com/@billynewport/apples-m3-ultra-mac-studio-misses-the-mark-for-llm-inference-f57f1f10a56f