What happened
TestForceExec (pkg/dbconn/dbconn_test.go) failed on PR #960 in the MySQL 8.0.28 (Aurora version) workflow:
INFO found locking transaction pid=1077 lockType=SHARED_READ ... objectName=requires_mdl runningTime=2.09s
INFO found locking transactions count=1 pids=[1077]
WARN killing locking transaction pid=1077
dbconn_test.go:291: Received unexpected error:
Error 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
[mysql] packets.go:58 unexpected EOF
--- FAIL: TestForceExec (2.14s)
Failing run: https://github.com/block/spirit/actions/runs/27389248718/job/80942995467
The kill machinery worked — it found and killed the blocking transaction (the unexpected EOF is the killed connection) — but ForceExec's ALTER hit its lock-wait timeout anyway, so the final require.NoError failed.
Why
The test sets config.LockWaitTimeout = 1 ("as short as possible"). ForceExec schedules KillLockingTransactions via time.AfterFunc(forceKillGracePeriod(...)), and forceKillGracePeriod(1) = max(1 × 0.9, 0.9) = 0.9s — while the server expires the ALTER's MDL wait at 1.0s.
That leaves a ~100ms budget for: the timer goroutine to be scheduled, the performance_schema locking-transaction scan to run, the KILL to execute, and the server to release the MDL and grant the waiting ALTER — all on a CI host that is concurrently running the heavy copy tests (this failure landed in the same loaded window as the #946 recurrence and #963). When any of those steps takes >100ms, the ALTER gets error 1205 just before/as the kill lands, and ForceExec returns the error — it makes a single attempt with no retry after the kill fires.
At the production default LockWaitTimeout (e.g. 30s → kill at 27s, ~3s budget) the race is academic; it's the test's minimum setting that exposes it.
Why it's not the PR under test
PR #960 touches pkg/change/pkg/datasync/pkg/move/pkg/table — no pkg/dbconn code.
Suggested fixes (either suffices)
- Test-level: use
LockWaitTimeout = 2 (kill at 1.8s, ~200ms budget) or 3, trading a slightly slower test for a budget that survives loaded runners.
- Code-level (more robust for production too): if the protected statement fails with error 1205 and the kill timer already fired, retry the statement once — the blocker is now dead, so the retry acquires the MDL immediately. This also covers the real-world case where the kill lands a hair too late at any timeout setting.
What happened
TestForceExec(pkg/dbconn/dbconn_test.go) failed on PR #960 in the MySQL 8.0.28 (Aurora version) workflow:Failing run: https://github.com/block/spirit/actions/runs/27389248718/job/80942995467
The kill machinery worked — it found and killed the blocking transaction (the
unexpected EOFis the killed connection) — butForceExec's ALTER hit its lock-wait timeout anyway, so the finalrequire.NoErrorfailed.Why
The test sets
config.LockWaitTimeout = 1("as short as possible").ForceExecschedulesKillLockingTransactionsviatime.AfterFunc(forceKillGracePeriod(...)), andforceKillGracePeriod(1)= max(1 × 0.9, 0.9) = 0.9s — while the server expires the ALTER's MDL wait at 1.0s.That leaves a ~100ms budget for: the timer goroutine to be scheduled, the performance_schema locking-transaction scan to run, the KILL to execute, and the server to release the MDL and grant the waiting ALTER — all on a CI host that is concurrently running the heavy copy tests (this failure landed in the same loaded window as the #946 recurrence and #963). When any of those steps takes >100ms, the ALTER gets error 1205 just before/as the kill lands, and
ForceExecreturns the error — it makes a single attempt with no retry after the kill fires.At the production default
LockWaitTimeout(e.g. 30s → kill at 27s, ~3s budget) the race is academic; it's the test's minimum setting that exposes it.Why it's not the PR under test
PR #960 touches
pkg/change/pkg/datasync/pkg/move/pkg/table— nopkg/dbconncode.Suggested fixes (either suffices)
LockWaitTimeout = 2(kill at 1.8s, ~200ms budget) or 3, trading a slightly slower test for a budget that survives loaded runners.