Status: PAUSED (2026-06-12), resumable
The internal SWE-bench Pro 119-instance evaluation run is paused partway with a fully reboot-proof resume bundle. This issue is the public mirror-of-record so the work can be resumed cold later.
Where everything lives (gitignored, internal-only)
- Resume bundle:
internal/swebench-119-resume/ (self-contained: harness scripts + ledger + patches + dataset/helper_code + official eval script).
- Full resume instructions:
internal/swebench-119-resume/RESUME.md - the single authoritative doc (exact resume command, prerequisites, hardcoded paths, pacing rule, grading procedure, cost decision).
- Survival backup:
internal/swebench-119-ledger-backup/ - auto-refreshed each window close.
Why paused
Remaining instances would project total spend over the host cap, and the run follows a deliberate paced cadence (record-spend-per-window) rather than a single burst. Pausing + documenting is the correct call vs. overshooting the cap.
To resume
Follow internal/swebench-119-resume/RESUME.md -> "TL;DR RESUME COMMAND". The driver is idempotent (skips already-recorded instances), so resume never double-charges. A cost decision (raise cap / partial-N / keep paced) is required before resuming to completion - see the doc.
Reboot-proofing
Two non-tmp anchors under internal/ survive a /tmp wipe. The large per-instance work dirs in /tmp are disposable (re-extracted per instance).
Benchmark numbers and provenance are INTERNAL pending founder publication approval and are intentionally not in this public issue.
Status: PAUSED (2026-06-12), resumable
The internal SWE-bench Pro 119-instance evaluation run is paused partway with a fully reboot-proof resume bundle. This issue is the public mirror-of-record so the work can be resumed cold later.
Where everything lives (gitignored, internal-only)
internal/swebench-119-resume/(self-contained: harness scripts + ledger + patches + dataset/helper_code + official eval script).internal/swebench-119-resume/RESUME.md- the single authoritative doc (exact resume command, prerequisites, hardcoded paths, pacing rule, grading procedure, cost decision).internal/swebench-119-ledger-backup/- auto-refreshed each window close.Why paused
Remaining instances would project total spend over the host cap, and the run follows a deliberate paced cadence (record-spend-per-window) rather than a single burst. Pausing + documenting is the correct call vs. overshooting the cap.
To resume
Follow
internal/swebench-119-resume/RESUME.md-> "TL;DR RESUME COMMAND". The driver is idempotent (skips already-recorded instances), so resume never double-charges. A cost decision (raise cap / partial-N / keep paced) is required before resuming to completion - see the doc.Reboot-proofing
Two non-tmp anchors under
internal/survive a /tmp wipe. The large per-instance work dirs in /tmp are disposable (re-extracted per instance).Benchmark numbers and provenance are INTERNAL pending founder publication approval and are intentionally not in this public issue.