Skip to content

feat(agents): Create realign-curator agent #302

@akaszubski

Description

@akaszubski

Summary

Orchestrates ReAlign curation workflows based on data type, enforces best practices, and validates quality gates.

Agent Workflow

STEP 1: Data Type Detection

→ User request: "Curate DPO data from book.pdf"
→ Detection: Keywords (DPO, preference, SRF, SFT, RLVR, anti-hallucination)
→ Action: Activate corresponding workflow skill

STEP 2: Workflow Selection

→ DPO → realign-dpo-workflow (#296)
→ SRF/SFT → realign-srf-workflow (#297)
→ RLVR → realign-rlvr-workflow (#298)
→ Anti-hallucination → realign-antihallucination-workflow (#299)

STEP 3: Performance Configuration (AUTOMATIC)

→ Detect model size from user request
→ Select machine: M4 Max (≤30B) or M3 Ultra (70B+)
→ Set batch size: 32 (M4) or 4 (M3)
→ Configure work split: 65/35 if distributed

STEP 4: Workflow Execution

→ Execute selected workflow step-by-step
→ Validate gates after each step
→ Block if validation fails

STEP 5: Quality Gate Validation

→ Check against realign-quality-gates (#302)
→ Tier: HIGH/MEDIUM/REJECT
→ Action: Accept HIGH, warn MEDIUM, block REJECT

STEP 6: Output Summary

→ Report: data type, workflow used, quality tier, examples processed
→ Log: machine, batch size, time, cost
→ Output: Curated dataset path

Performance Optimization (ENFORCED BY AGENT)

The agent AUTOMATICALLY applies:

  • Machine selection (M4 Max vs M3 Ultra)
  • Batch size (32 vs 4)
  • Work distribution (65/35)
  • Environment variables

Claude does NOT need to remember these - the agent enforces them.

Example Usage

User: "Curate DPO preference pairs from /data/books/ml_textbook.pdf for Qwen 7B"

Agent executes:

  1. Detects: Data type = DPO, Model = Qwen 7B (≤30B)
  2. Activates: realign-dpo-workflow
  3. Configures: M4 Max, batch_size=32
  4. Executes: 6-step DPO workflow with gates
  5. Validates: Quality tier = HIGH (preference_gap=0.22, agreement=92%)
  6. Outputs: /data/curated/dpo_ml_textbook_20260128/train.jsonl

Acceptance Criteria

  • Detects data type from user request
  • Activates correct workflow skill
  • Automatically configures performance settings
  • Enforces quality gates
  • Provides execution summary
  • Keywords: curator, orchestrator, workflow, automation

Performance Optimization (COMPLETE - Claude Always Forgets)

Machine Selection (by model size)

Model Size Machine Reason
≤30B M4 Max 1.9-5.1x faster (speed)
30-70B M4 Max preferred Faster, unless need >128GB memory
70-200B M3 Ultra Needs 512GB memory
200B+ EXO Distributed across both machines

Validated Benchmarks

  • M4 Max: 12,956 GFLOPS (324 per core)
  • M3 Ultra: 4,599 GFLOPS (57 per core)
  • Real scoring: M4 Max 3.86 ex/s vs M3 Ultra 0.76 ex/s (5.1x faster)

Batch Size Configuration

→ M4 Max: batch_size=32 (peaks at 776 ex/s)
→ M3 Ultra: batch_size=4 (peaks at 278 ex/s - PEAKS EARLIER, don't increase!)

Work Distribution (65/35 NOT 50/50!)

→ M4 Max: 65.5% of work (M4_RATIO = 0.655)
→ M3 Ultra: 34.5% of work (M3_RATIO = 0.345)

RDMA vs Separate Batches Decision

USE RDMA WHEN:

Scenario Why
Model > 128GB (e.g., 70B+ fp16) Model doesn't fit on M4 Max alone
Model > 512GB (e.g., 405B) Must shard across both machines
Training with gradient sync Need synchronized weight updates
Pipeline parallelism Layers split across machines

USE SEPARATE BATCHES WHEN:

Scenario Why
Model fits on one machine No coordination overhead
Independent scoring/eval Each machine works at own pace
Batch inference Combined throughput = M4 + M3

The Math (30B model example):

  • Separate batches: M4 (1.95 ex/s) + M3 (1.03 ex/s) = 2.98 ex/s total
  • RDMA sharding: ~2.5 ex/s (20-30% coordination overhead)
  • Winner: Separate batches (no overhead)

Environment Variables + Priority

export MLX_METAL_PREALLOCATE=1
export MLX_METAL_FAST_SYNCH=1
export TOKENIZERS_PARALLELISM=false
sudo nice -n -10 realign curate ...  # Elevated priority

Anti-Patterns (DON'T DO THIS!)

❌ Split work 50/50 based on GPU cores (WRONG - wastes days)
❌ "M3 Ultra is faster (more cores)" (FALSE for MLX inference)
❌ Use batch_size=32 on M3 Ultra (peaks at 4, don't go higher)
❌ Use RDMA for models that fit on one machine (adds overhead)

Cost Tracking

→ Log: machine, batch_size, examples, time, cost_per_example
→ M4 Max cheaper per example (5.1x faster = less compute time)

Source: MLX benchmarking 2026-01-28, validated by https://medium.com/@billynewport/apples-m3-ultra-mac-studio-misses-the-mark-for-llm-inference-f57f1f10a56f

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions