EV = p(successful accepted record) × magnitude of improvement ÷ time-to-evidence
Not beauty. Not novelty. Not "the model likes it."
Within a tight band of claimed score (±0.001 BPB) or discarded as "not understood."
If training step-time increases materially (>5%), candidate must pay for it with proportional BPB gain immediately. At 86ms/step, 1ms overhead ≈ 0.006 BPB penalty.
If bytes increase, the agent must explicitly state where they are recovered. No handwaving. Budget ceiling: 15.9 MB with margin.
Single-seed must beat parent by ≥0.001 BPB before 3-seed compute is spent. Additionally:
- Artifact under budget with margin
- No legality issue
- No significant eval-time blowup
- No measurable training instability
- No multi-variable experiments unless reproductions are done
- One hypothesis per branch
- No branch survives without exact runtime and artifact accounting
- No TTT branch survives unless the log proves score-first legality
- No 3-seed run unless single-seed evidence crosses promotion threshold
- No "clever rewrite" of the whole stack without explicit CEO approval
- Every branch must be PR-packagable at all times
trackA/gepa-legal-ttt
trackA/gepa-legal-ttt-adamw
trackB/std414-repro
trackB/std414-legal-ttt
trackC/microdelta-bigramhash-16k
repro/pr414
repro/pr505
repro/pr508
Every run must produce a structured report in 09_RESULTS/:
hypothesis: <one line>
parent_branch: <branch name>
exact_diff: <file:lines changed>
train_step_time_ms: <number>
eval_time_s: <number>
artifact_bytes: <number>
pre_quant_bpb: <number>
post_quant_bpb: <number>
legality_risk: none | low | high
recommendation: promote | kill | revise
notes: <free text>