Skip to content

[BUG] statement_timeout leaks across pooled connections via SET (not SET LOCAL) #3898

@akeating

Description

@akeating

Full disclosure: I observed these issues in the logs, did an initial review, prepared this issue with assistance of opus 4.7 and hand-edited the following. As usual with complex codebases such as this, there could be many factors at play. However, this particular issue seems worthwhile addressing because it will at least reduce the scope of observed instability that might be obscuring other more substantial problems.

Describe the issue

Several places in the engine lower statement_timeout for a single transaction by issuing a plain SET statement_timeout=… (not SET LOCAL). Per the PostgreSQL SET docs, a plain SET is session-scoped: if the surrounding transaction commits, the new value persists for the lifetime of that backend connection. When pgxpool returns the connection to the pool, the next caller inherits the smaller timeout, and any normally-fast query that runs on that checked-out connection can fail with SQLSTATE 57014 ("canceling statement due to statement timeout") at the leaked threshold.

The most visible symptom is the dispatcher's Heartbeat RPC. Its only DB write is a single-row UPDATE "Worker" … WHERE id = $1 against the bare pool (pkg/repository/worker.go::UpdateWorkerHeartbeat) with no SET of its own, so it inherits whatever statement_timeout happens to be on the connection it draws. When that connection has been tainted with the leaked 5 s timeout and DB latency spikes (or Worker row-lock contention occurs), the update aborts at the 5 s mark:

ERR error="ERROR: canceling statement due to statement timeout (SQLSTATE 57014)" service=grpc
ERR finished call grpc.code=Internal grpc.method=Heartbeat grpc.time_ms=5167.774 grpc.service=Dispatcher

The 5167 ms gRPC duration matching the 5000 ms timeout almost exactly is the fingerprint: the pool default of 30 s never applies on this connection. Worker SDKs in turn log Failed to send heartbeat: … INTERNAL and Heartbeat interval delay (XXXXms >> 4000ms), and a few consecutive misses cause the engine to mark the worker inactive.

Environment

  • SDK: TypeScript 0.15.2
  • Engine: hatchet-lite 0.79.12

Expected behavior

Lowered statement_timeout set inside a transaction should not leak to other callers via the connection pool. Heartbeat and other unrelated short-lived queries should not fail with 57014 at the 5 s mark.

Root cause

The leak comes from the partition-heartbeat helpers in pkg/repository/tenant.go. All three follow the same pattern — open a tx on the bare pool, run SET statement_timeout=5000 (plain SET), then commit:

// pkg/repository/tenant.go
// UpdateControllerPartitionHeartbeat — same pattern in
// UpdateWorkerPartitionHeartbeat and UpdateSchedulerPartitionHeartbeat
tx, err := r.pool.Begin(ctx)
...
defer sqlchelpers.DeferRollback(ctx, r.l, tx.Rollback)
// set tx timeout to 5 seconds to avoid deadlocks
_, err = tx.Exec(ctx, "SET statement_timeout=5000")
...
if err := tx.Commit(ctx); err != nil { return "", err }

Per PG semantics, the SET survives COMMIT, and nothing in these functions resets it. These helpers run on a fast cadence on every controller / worker / scheduler partition on every engine instance, so over time a large fraction of pool connections are mutated. Once tainted, a connection stays tainted until it is recycled by MaxConnLifetime / MaxConnIdleTime — a long window in which any caller (including UpdateWorkerHeartbeat) can pick it up.

The pool's AfterConnect hook does set the default to 30 s when the connection is first established (pkg/config/loader/loader.go), but there is no AfterRelease defense to re-normalize the GUC on check-in, so a one-time mutation persists indefinitely.

Other related (but lower-severity) issues

pkg/repository/sqlchelpers/tx.go::PrepareTxWithStatementTimeout is structurally similar but does not actually leak today, because ROLLBACK reverts an in-tx plain SET, and the commit path performs an explicit SET statement_timeout=30000 before tx.Commit:

commit := func(ctx context.Context) error {
    _, err = tx.Exec(ctx, "SET statement_timeout=30000")
    ...
    return tx.Commit(ctx)
}
rollback := func() {
    DeferRollback(ctx, l, tx.Rollback)
}
_, err = tx.Exec(ctx, fmt.Sprintf("SET statement_timeout=%d", timeoutMs))
if err != nil {
    return nil, nil, nil, err
}

Two things are still worth cleaning up here:

  1. The 30000 reset value is a magic constant duplicated from loader.go::AfterConnect. Switching to SET LOCAL removes the assumption entirely.
  2. If the initial SET fails after pool.Begin succeeded, the function returns a nil commit/rollback pair without rolling back — the underlying connection is leaked until GC. Should call tx.Rollback (or DeferRollback) before returning.
    sqlchelpers.AcquireConnectionWithStatementTimeout works on a non-transactional connection, so SET LOCAL doesn't apply. Its release already resets to 30 s, but doing so via a hard-coded SET statement_timeout=30000 makes the value drift-prone (it must match AfterConnect).

Suggested fixes

In priority order:

  1. Fix the actual leak (the one causing the reported 57014s). In pkg/repository/tenant.go, change the three call sites — UpdateControllerPartitionHeartbeat, UpdateWorkerPartitionHeartbeat, UpdateSchedulerPartitionHeartbeat — from SET statement_timeout=5000 to SET LOCAL statement_timeout = 5000. PG reverts on COMMIT / ROLLBACK automatically; no manual cleanup needed.

  2. Tidy PrepareTxWithStatementTimeout in pkg/repository/sqlchelpers/tx.go:

    • Switch to SET LOCAL statement_timeout = $timeoutMs.
    • Drop the SET statement_timeout=30000 reset in the commit closure (no longer needed).
    • On the early SET error after pool.Begin, roll the tx back before returning so the connection isn't leaked.
  3. Add a defensive pgxpool.Config.AfterRelease in pkg/config/loader/loader.go (and on readReplicaConfig) that re-normalizes statement_timeout on every check-in. Either:

    • SET statement_timeout = <default> mirroring AfterConnect (extract the constant so the two cannot drift), or
    • RESET statement_timeout, if the server / role default is configured to the desired value (otherwise RESET reverts to the postgresql.conf default, not the engine's intended 30 s).
      Return false from the hook on failure so the connection is discarded rather than reused. This is belt-and-suspenders that retires the entire bug class and protects against future regressions.
  4. Optional cleanup in sqlchelpers.AcquireConnectionWithStatementTimeout: extract the 30 s default into a shared package-level constant so loader.go and tx.go reference the same value, and run the reset inside a defer within release so a caller panic cannot skip it. If you take fix (3) above, this becomes safety-net.

Item (1) on its own should resolve the reported Heartbeat 57014 errors. Items (2)–(4) are hardening.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions