Failure
TestPreventConcurrentRuns failed on main in the MySQL 9.7 /w docker-compose workflow:
https://github.com/block/spirit/actions/runs/27313098337/job/80687564402
--- FAIL: TestPreventConcurrentRuns (30.38s)
e2e_test.go:330: test database: t_testpreventconcurrentruns_4850_30
e2e_test.go:355: timeout waiting for status >= waitingOnSentinelTable, current status: checksum
Analysis
The test starts a migration with WithDeferCutOver() + WithRespectSentinel() on a 1,010-row table, then uses waitForStatus(t, m, status.WaitingOnSentinelTable) which has a hard 30s timeout (helpers_test.go).
Timeline from the job log:
23:32:42 — migration starts, metadata lock acquired
23:32:43 — row copy complete (watermark optimization toggled off for prevent_concurrent_runs)
23:32:43 → 23:33:12 — migration sits in checksum state for ~29s and never reaches postChecksum
23:33:12 — waitForStatus times out → t.Fatalf → deferred close cancels the context
The ERROR checksum failed, retrying attempt=2 / checksum encountered an error error="context canceled" lines at 23:33:12 are teardown noise from the deferred close after the test already failed, not the cause.
The runner appears to have been heavily overloaded during this window, which is the likely reason the checksum phase (binlog catch-up → table lock → checksum chunks) couldn't finish a 1,010-row table within the budget:
- Concurrent tests in the same run show 11-row chunks taking 2.3s+ against a 500ms threshold (
high chunk processing time time=2.326189693s threshold=500ms)
- The shared MySQL instance rotated through ~600 binlog files (
mysql-bin.000853 → mysql-bin.001463) in ~100s, all of which each test's binlog syncer must chew through during "waiting to catch up to source position"
Note the in-checksum table-lock acquisition timeout (30s) is the same size as the entire waitForStatus budget, so any slow binlog catch-up leaves no headroom.
Prior history
#612 was a flake in the same test but a different failure mode (failed to get binlog position, check binary is enabled), closed 2026-02-25.
Possible fixes
- Increase the
waitForStatus timeout (30s covers copy + analyze + binlog catch-up + checksum on a shared, loaded server)
- Or reduce susceptibility to cross-test binlog volume (per-test catch-up budget, less parallel binlog churn)
Failure
TestPreventConcurrentRunsfailed on main in the MySQL 9.7 /w docker-compose workflow:https://github.com/block/spirit/actions/runs/27313098337/job/80687564402
Analysis
The test starts a migration with
WithDeferCutOver()+WithRespectSentinel()on a 1,010-row table, then useswaitForStatus(t, m, status.WaitingOnSentinelTable)which has a hard 30s timeout (helpers_test.go).Timeline from the job log:
23:32:42— migration starts, metadata lock acquired23:32:43— row copy complete (watermark optimization toggled off forprevent_concurrent_runs)23:32:43 → 23:33:12— migration sits inchecksumstate for ~29s and never reachespostChecksum23:33:12—waitForStatustimes out →t.Fatalf→ deferred close cancels the contextThe
ERROR checksum failed, retrying attempt=2/checksum encountered an error error="context canceled"lines at 23:33:12 are teardown noise from the deferred close after the test already failed, not the cause.The runner appears to have been heavily overloaded during this window, which is the likely reason the checksum phase (binlog catch-up → table lock → checksum chunks) couldn't finish a 1,010-row table within the budget:
high chunk processing time time=2.326189693s threshold=500ms)mysql-bin.000853→mysql-bin.001463) in ~100s, all of which each test's binlog syncer must chew through during "waiting to catch up to source position"Note the in-checksum table-lock acquisition timeout (30s) is the same size as the entire
waitForStatusbudget, so any slow binlog catch-up leaves no headroom.Prior history
#612 was a flake in the same test but a different failure mode (
failed to get binlog position, check binary is enabled), closed 2026-02-25.Possible fixes
waitForStatustimeout (30s covers copy + analyze + binlog catch-up + checksum on a shared, loaded server)