flaky test: TestForceExec — 1205 when the force-kill lands inside the ~100ms budget left by LockWaitTimeout=1 (no post-kill retry)

## What happened

`TestForceExec` (`pkg/dbconn/dbconn_test.go`) failed on PR #960 in the MySQL 8.0.28 (Aurora version) workflow:

```
INFO found locking transaction pid=1077 lockType=SHARED_READ ... objectName=requires_mdl runningTime=2.09s
INFO found locking transactions count=1 pids=[1077]
WARN killing locking transaction pid=1077
dbconn_test.go:291: Received unexpected error:
    Error 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
[mysql] packets.go:58 unexpected EOF
--- FAIL: TestForceExec (2.14s)
```

Failing run: https://github.com/block/spirit/actions/runs/27389248718/job/80942995467

The kill machinery worked — it found and killed the blocking transaction (the `unexpected EOF` is the killed connection) — but `ForceExec`'s ALTER hit its lock-wait timeout anyway, so the final `require.NoError` failed.

## Why

The test sets `config.LockWaitTimeout = 1` ("as short as possible"). `ForceExec` schedules `KillLockingTransactions` via `time.AfterFunc(forceKillGracePeriod(...))`, and `forceKillGracePeriod(1)` = max(1 × 0.9, 0.9) = **0.9s** — while the server expires the ALTER's MDL wait at **1.0s**.

That leaves a ~100ms budget for: the timer goroutine to be scheduled, the performance_schema locking-transaction scan to run, the KILL to execute, and the server to release the MDL and grant the waiting ALTER — all on a CI host that is concurrently running the heavy copy tests (this failure landed in the same loaded window as the #946 recurrence and #963). When any of those steps takes >100ms, the ALTER gets error 1205 just before/as the kill lands, and `ForceExec` returns the error — it makes a single attempt with **no retry after the kill fires**.

At the production default `LockWaitTimeout` (e.g. 30s → kill at 27s, ~3s budget) the race is academic; it's the test's minimum setting that exposes it.

## Why it's not the PR under test

PR #960 touches `pkg/change`/`pkg/datasync`/`pkg/move`/`pkg/table` — no `pkg/dbconn` code.

## Suggested fixes (either suffices)

- **Test-level:** use `LockWaitTimeout = 2` (kill at 1.8s, ~200ms budget) or 3, trading a slightly slower test for a budget that survives loaded runners.
- **Code-level (more robust for production too):** if the protected statement fails with error 1205 *and* the kill timer already fired, retry the statement once — the blocker is now dead, so the retry acquires the MDL immediately. This also covers the real-world case where the kill lands a hair too late at any timeout setting.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

flaky test: TestForceExec — 1205 when the force-kill lands inside the ~100ms budget left by LockWaitTimeout=1 (no post-kill retry) #964

What happened

Why

Why it's not the PR under test

Suggested fixes (either suffices)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

flaky test: TestForceExec — 1205 when the force-kill lands inside the ~100ms budget left by LockWaitTimeout=1 (no post-kill retry) #964

Description

What happened

Why

Why it's not the PR under test

Suggested fixes (either suffices)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions