flaky test: TestPreventConcurrentRuns (timeout waiting for waitingOnSentinelTable, stuck in checksum)

## Failure

`TestPreventConcurrentRuns` failed on main in the **MySQL 9.7 /w docker-compose** workflow:
https://github.com/block/spirit/actions/runs/27313098337/job/80687564402

```
--- FAIL: TestPreventConcurrentRuns (30.38s)
    e2e_test.go:330: test database: t_testpreventconcurrentruns_4850_30
    e2e_test.go:355: timeout waiting for status >= waitingOnSentinelTable, current status: checksum
```

## Analysis

The test starts a migration with `WithDeferCutOver()` + `WithRespectSentinel()` on a 1,010-row table, then uses `waitForStatus(t, m, status.WaitingOnSentinelTable)` which has a hard 30s timeout (`helpers_test.go`).

Timeline from the job log:

- `23:32:42` — migration starts, metadata lock acquired
- `23:32:43` — row copy complete (watermark optimization toggled off for `prevent_concurrent_runs`)
- `23:32:43 → 23:33:12` — migration sits in `checksum` state for ~29s and never reaches `postChecksum`
- `23:33:12` — `waitForStatus` times out → `t.Fatalf` → deferred close cancels the context

The `ERROR checksum failed, retrying attempt=2` / `checksum encountered an error error="context canceled"` lines at 23:33:12 are teardown noise from the deferred close after the test already failed, not the cause.

The runner appears to have been heavily overloaded during this window, which is the likely reason the checksum phase (binlog catch-up → table lock → checksum chunks) couldn't finish a 1,010-row table within the budget:

- Concurrent tests in the same run show 11-row chunks taking **2.3s+** against a 500ms threshold (`high chunk processing time time=2.326189693s threshold=500ms`)
- The shared MySQL instance rotated through ~600 binlog files (`mysql-bin.000853` → `mysql-bin.001463`) in ~100s, all of which each test's binlog syncer must chew through during "waiting to catch up to source position"

Note the in-checksum table-lock acquisition timeout (30s) is the same size as the entire `waitForStatus` budget, so any slow binlog catch-up leaves no headroom.

## Prior history

#612 was a flake in the same test but a different failure mode (`failed to get binlog position, check binary is enabled`), closed 2026-02-25.

## Possible fixes

- Increase the `waitForStatus` timeout (30s covers copy + analyze + binlog catch-up + checksum on a shared, loaded server)
- Or reduce susceptibility to cross-test binlog volume (per-test catch-up budget, less parallel binlog churn)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

flaky test: TestPreventConcurrentRuns (timeout waiting for waitingOnSentinelTable, stuck in checksum) #946

Failure

Analysis

Prior history

Possible fixes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

flaky test: TestPreventConcurrentRuns (timeout waiting for waitingOnSentinelTable, stuck in checksum) #946

Description

Failure

Analysis

Prior history

Possible fixes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions