Skip to content

flaky test: TestForceExec — 1205 when the force-kill lands inside the ~100ms budget left by LockWaitTimeout=1 (no post-kill retry) #964

Description

@morgo

What happened

TestForceExec (pkg/dbconn/dbconn_test.go) failed on PR #960 in the MySQL 8.0.28 (Aurora version) workflow:

INFO found locking transaction pid=1077 lockType=SHARED_READ ... objectName=requires_mdl runningTime=2.09s
INFO found locking transactions count=1 pids=[1077]
WARN killing locking transaction pid=1077
dbconn_test.go:291: Received unexpected error:
    Error 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
[mysql] packets.go:58 unexpected EOF
--- FAIL: TestForceExec (2.14s)

Failing run: https://github.com/block/spirit/actions/runs/27389248718/job/80942995467

The kill machinery worked — it found and killed the blocking transaction (the unexpected EOF is the killed connection) — but ForceExec's ALTER hit its lock-wait timeout anyway, so the final require.NoError failed.

Why

The test sets config.LockWaitTimeout = 1 ("as short as possible"). ForceExec schedules KillLockingTransactions via time.AfterFunc(forceKillGracePeriod(...)), and forceKillGracePeriod(1) = max(1 × 0.9, 0.9) = 0.9s — while the server expires the ALTER's MDL wait at 1.0s.

That leaves a ~100ms budget for: the timer goroutine to be scheduled, the performance_schema locking-transaction scan to run, the KILL to execute, and the server to release the MDL and grant the waiting ALTER — all on a CI host that is concurrently running the heavy copy tests (this failure landed in the same loaded window as the #946 recurrence and #963). When any of those steps takes >100ms, the ALTER gets error 1205 just before/as the kill lands, and ForceExec returns the error — it makes a single attempt with no retry after the kill fires.

At the production default LockWaitTimeout (e.g. 30s → kill at 27s, ~3s budget) the race is academic; it's the test's minimum setting that exposes it.

Why it's not the PR under test

PR #960 touches pkg/change/pkg/datasync/pkg/move/pkg/table — no pkg/dbconn code.

Suggested fixes (either suffices)

  • Test-level: use LockWaitTimeout = 2 (kill at 1.8s, ~200ms budget) or 3, trading a slightly slower test for a budget that survives loaded runners.
  • Code-level (more robust for production too): if the protected statement fails with error 1205 and the kill timer already fired, retry the statement once — the blocker is now dead, so the retry acquires the MDL immediately. This also covers the real-world case where the kill lands a hair too late at any timeout setting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Fields

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions