Skip to content

flaky test: TestPreventConcurrentRuns (timeout waiting for waitingOnSentinelTable, stuck in checksum) #946

Description

@morgo

Failure

TestPreventConcurrentRuns failed on main in the MySQL 9.7 /w docker-compose workflow:
https://github.com/block/spirit/actions/runs/27313098337/job/80687564402

--- FAIL: TestPreventConcurrentRuns (30.38s)
    e2e_test.go:330: test database: t_testpreventconcurrentruns_4850_30
    e2e_test.go:355: timeout waiting for status >= waitingOnSentinelTable, current status: checksum

Analysis

The test starts a migration with WithDeferCutOver() + WithRespectSentinel() on a 1,010-row table, then uses waitForStatus(t, m, status.WaitingOnSentinelTable) which has a hard 30s timeout (helpers_test.go).

Timeline from the job log:

  • 23:32:42 — migration starts, metadata lock acquired
  • 23:32:43 — row copy complete (watermark optimization toggled off for prevent_concurrent_runs)
  • 23:32:43 → 23:33:12 — migration sits in checksum state for ~29s and never reaches postChecksum
  • 23:33:12waitForStatus times out → t.Fatalf → deferred close cancels the context

The ERROR checksum failed, retrying attempt=2 / checksum encountered an error error="context canceled" lines at 23:33:12 are teardown noise from the deferred close after the test already failed, not the cause.

The runner appears to have been heavily overloaded during this window, which is the likely reason the checksum phase (binlog catch-up → table lock → checksum chunks) couldn't finish a 1,010-row table within the budget:

  • Concurrent tests in the same run show 11-row chunks taking 2.3s+ against a 500ms threshold (high chunk processing time time=2.326189693s threshold=500ms)
  • The shared MySQL instance rotated through ~600 binlog files (mysql-bin.000853mysql-bin.001463) in ~100s, all of which each test's binlog syncer must chew through during "waiting to catch up to source position"

Note the in-checksum table-lock acquisition timeout (30s) is the same size as the entire waitForStatus budget, so any slow binlog catch-up leaves no headroom.

Prior history

#612 was a flake in the same test but a different failure mode (failed to get binlog position, check binary is enabled), closed 2026-02-25.

Possible fixes

  • Increase the waitForStatus timeout (30s covers copy + analyze + binlog catch-up + checksum on a shared, loaded server)
  • Or reduce susceptibility to cross-test binlog volume (per-test catch-up budget, less parallel binlog churn)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Fields

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions