feat(agents): Create realign-curator agent

## Summary

Orchestrates ReAlign curation workflows based on data type, enforces best practices, and validates quality gates.

## Agent Workflow

### STEP 1: Data Type Detection
→ User request: "Curate DPO data from book.pdf"
→ Detection: Keywords (DPO, preference, SRF, SFT, RLVR, anti-hallucination)
→ Action: Activate corresponding workflow skill

### STEP 2: Workflow Selection
→ DPO → realign-dpo-workflow (#296)
→ SRF/SFT → realign-srf-workflow (#297)
→ RLVR → realign-rlvr-workflow (#298)
→ Anti-hallucination → realign-antihallucination-workflow (#299)

### STEP 3: Performance Configuration (AUTOMATIC)
→ Detect model size from user request
→ Select machine: M4 Max (≤30B) or M3 Ultra (70B+)
→ Set batch size: 32 (M4) or 4 (M3)
→ Configure work split: 65/35 if distributed

### STEP 4: Workflow Execution
→ Execute selected workflow step-by-step
→ Validate gates after each step
→ Block if validation fails

### STEP 5: Quality Gate Validation
→ Check against realign-quality-gates (#302)
→ Tier: HIGH/MEDIUM/REJECT
→ Action: Accept HIGH, warn MEDIUM, block REJECT

### STEP 6: Output Summary
→ Report: data type, workflow used, quality tier, examples processed
→ Log: machine, batch size, time, cost
→ Output: Curated dataset path

## Performance Optimization (ENFORCED BY AGENT)

The agent AUTOMATICALLY applies:
- Machine selection (M4 Max vs M3 Ultra)
- Batch size (32 vs 4)
- Work distribution (65/35)
- Environment variables

**Claude does NOT need to remember these** - the agent enforces them.

## Example Usage

User: "Curate DPO preference pairs from /data/books/ml_textbook.pdf for Qwen 7B"

Agent executes:
1. Detects: Data type = DPO, Model = Qwen 7B (≤30B)
2. Activates: realign-dpo-workflow
3. Configures: M4 Max, batch_size=32
4. Executes: 6-step DPO workflow with gates
5. Validates: Quality tier = HIGH (preference_gap=0.22, agreement=92%)
6. Outputs: /data/curated/dpo_ml_textbook_20260128/train.jsonl

## Acceptance Criteria
- [ ] Detects data type from user request
- [ ] Activates correct workflow skill
- [ ] Automatically configures performance settings
- [ ] Enforces quality gates
- [ ] Provides execution summary
- [ ] Keywords: curator, orchestrator, workflow, automation



## Performance Optimization (COMPLETE - Claude Always Forgets)

### Machine Selection (by model size)
| Model Size | Machine  | Reason |
|------------|----------|--------|
| ≤30B       | M4 Max   | 1.9-5.1x faster (speed) |
| 30-70B     | M4 Max preferred | Faster, unless need >128GB memory |
| 70-200B    | M3 Ultra | Needs 512GB memory |
| 200B+      | EXO      | Distributed across both machines |

### Validated Benchmarks
- M4 Max: 12,956 GFLOPS (324 per core)
- M3 Ultra: 4,599 GFLOPS (57 per core)
- Real scoring: M4 Max 3.86 ex/s vs M3 Ultra 0.76 ex/s (5.1x faster)

### Batch Size Configuration
→ M4 Max: batch_size=32 (peaks at 776 ex/s)
→ M3 Ultra: batch_size=4 (peaks at 278 ex/s - PEAKS EARLIER, don't increase!)

### Work Distribution (65/35 NOT 50/50!)
→ M4 Max: 65.5% of work (M4_RATIO = 0.655)
→ M3 Ultra: 34.5% of work (M3_RATIO = 0.345)

### RDMA vs Separate Batches Decision

**USE RDMA WHEN:**
| Scenario | Why |
|----------|-----|
| Model > 128GB (e.g., 70B+ fp16) | Model doesn't fit on M4 Max alone |
| Model > 512GB (e.g., 405B) | Must shard across both machines |
| Training with gradient sync | Need synchronized weight updates |
| Pipeline parallelism | Layers split across machines |

**USE SEPARATE BATCHES WHEN:**
| Scenario | Why |
|----------|-----|
| Model fits on one machine | No coordination overhead |
| Independent scoring/eval | Each machine works at own pace |
| Batch inference | Combined throughput = M4 + M3 |

**The Math (30B model example):**
- Separate batches: M4 (1.95 ex/s) + M3 (1.03 ex/s) = 2.98 ex/s total
- RDMA sharding: ~2.5 ex/s (20-30% coordination overhead)
- **Winner**: Separate batches (no overhead)

### Environment Variables + Priority
```bash
export MLX_METAL_PREALLOCATE=1
export MLX_METAL_FAST_SYNCH=1
export TOKENIZERS_PARALLELISM=false
sudo nice -n -10 realign curate ...  # Elevated priority
```

### Anti-Patterns (DON'T DO THIS!)
❌ Split work 50/50 based on GPU cores (WRONG - wastes days)
❌ "M3 Ultra is faster (more cores)" (FALSE for MLX inference)
❌ Use batch_size=32 on M3 Ultra (peaks at 4, don't go higher)
❌ Use RDMA for models that fit on one machine (adds overhead)

### Cost Tracking
→ Log: machine, batch_size, examples, time, cost_per_example
→ M4 Max cheaper per example (5.1x faster = less compute time)

**Source**: MLX benchmarking 2026-01-28, validated by https://medium.com/@billynewport/apples-m3-ultra-mac-studio-misses-the-mark-for-llm-inference-f57f1f10a56f


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agents): Create realign-curator agent #302

Summary

Agent Workflow

STEP 1: Data Type Detection

STEP 2: Workflow Selection

STEP 3: Performance Configuration (AUTOMATIC)

STEP 4: Workflow Execution

STEP 5: Quality Gate Validation

STEP 6: Output Summary

Performance Optimization (ENFORCED BY AGENT)

Example Usage

Acceptance Criteria

Performance Optimization (COMPLETE - Claude Always Forgets)

Machine Selection (by model size)

Validated Benchmarks

Batch Size Configuration

Work Distribution (65/35 NOT 50/50!)

RDMA vs Separate Batches Decision

Environment Variables + Priority

Anti-Patterns (DON'T DO THIS!)

Cost Tracking

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model Size	Machine	Reason
≤30B	M4 Max	1.9-5.1x faster (speed)
30-70B	M4 Max preferred	Faster, unless need >128GB memory
70-200B	M3 Ultra	Needs 512GB memory
200B+	EXO	Distributed across both machines

Scenario	Why
Model > 128GB (e.g., 70B+ fp16)	Model doesn't fit on M4 Max alone
Model > 512GB (e.g., 405B)	Must shard across both machines
Training with gradient sync	Need synchronized weight updates
Pipeline parallelism	Layers split across machines

Scenario	Why
Model fits on one machine	No coordination overhead
Independent scoring/eval	Each machine works at own pace
Batch inference	Combined throughput = M4 + M3

feat(agents): Create realign-curator agent #302

Description

Summary

Agent Workflow

STEP 1: Data Type Detection

STEP 2: Workflow Selection

STEP 3: Performance Configuration (AUTOMATIC)

STEP 4: Workflow Execution

STEP 5: Quality Gate Validation

STEP 6: Output Summary

Performance Optimization (ENFORCED BY AGENT)

Example Usage

Acceptance Criteria

Performance Optimization (COMPLETE - Claude Always Forgets)

Machine Selection (by model size)

Validated Benchmarks

Batch Size Configuration

Work Distribution (65/35 NOT 50/50!)

RDMA vs Separate Batches Decision

Environment Variables + Priority

Anti-Patterns (DON'T DO THIS!)

Cost Tracking

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions