Full disclosure: I observed these issues in the logs, did an initial review, prepared this issue with assistance of opus 4.7 and hand-edited the following. As usual with complex codebases such as this, there could be many factors at play. However, this particular issue seems worthwhile addressing because it will at least reduce the scope of observed instability that might be obscuring other more substantial problems.
Describe the issue
Several places in the engine lower statement_timeout for a single transaction by issuing a plain SET statement_timeout=… (not SET LOCAL). Per the PostgreSQL SET docs, a plain SET is session-scoped: if the surrounding transaction commits, the new value persists for the lifetime of that backend connection. When pgxpool returns the connection to the pool, the next caller inherits the smaller timeout, and any normally-fast query that runs on that checked-out connection can fail with SQLSTATE 57014 ("canceling statement due to statement timeout") at the leaked threshold.
The most visible symptom is the dispatcher's Heartbeat RPC. Its only DB write is a single-row UPDATE "Worker" … WHERE id = $1 against the bare pool (pkg/repository/worker.go::UpdateWorkerHeartbeat) with no SET of its own, so it inherits whatever statement_timeout happens to be on the connection it draws. When that connection has been tainted with the leaked 5 s timeout and DB latency spikes (or Worker row-lock contention occurs), the update aborts at the 5 s mark:
ERR error="ERROR: canceling statement due to statement timeout (SQLSTATE 57014)" service=grpc
ERR finished call grpc.code=Internal grpc.method=Heartbeat grpc.time_ms=5167.774 grpc.service=Dispatcher
The 5167 ms gRPC duration matching the 5000 ms timeout almost exactly is the fingerprint: the pool default of 30 s never applies on this connection. Worker SDKs in turn log Failed to send heartbeat: … INTERNAL and Heartbeat interval delay (XXXXms >> 4000ms), and a few consecutive misses cause the engine to mark the worker inactive.
Environment
- SDK: TypeScript
0.15.2
- Engine:
hatchet-lite 0.79.12
Expected behavior
Lowered statement_timeout set inside a transaction should not leak to other callers via the connection pool. Heartbeat and other unrelated short-lived queries should not fail with 57014 at the 5 s mark.
Root cause
The leak comes from the partition-heartbeat helpers in pkg/repository/tenant.go. All three follow the same pattern — open a tx on the bare pool, run SET statement_timeout=5000 (plain SET), then commit:
// pkg/repository/tenant.go
// UpdateControllerPartitionHeartbeat — same pattern in
// UpdateWorkerPartitionHeartbeat and UpdateSchedulerPartitionHeartbeat
tx, err := r.pool.Begin(ctx)
...
defer sqlchelpers.DeferRollback(ctx, r.l, tx.Rollback)
// set tx timeout to 5 seconds to avoid deadlocks
_, err = tx.Exec(ctx, "SET statement_timeout=5000")
...
if err := tx.Commit(ctx); err != nil { return "", err }
Per PG semantics, the SET survives COMMIT, and nothing in these functions resets it. These helpers run on a fast cadence on every controller / worker / scheduler partition on every engine instance, so over time a large fraction of pool connections are mutated. Once tainted, a connection stays tainted until it is recycled by MaxConnLifetime / MaxConnIdleTime — a long window in which any caller (including UpdateWorkerHeartbeat) can pick it up.
The pool's AfterConnect hook does set the default to 30 s when the connection is first established (pkg/config/loader/loader.go), but there is no AfterRelease defense to re-normalize the GUC on check-in, so a one-time mutation persists indefinitely.
Other related (but lower-severity) issues
pkg/repository/sqlchelpers/tx.go::PrepareTxWithStatementTimeout is structurally similar but does not actually leak today, because ROLLBACK reverts an in-tx plain SET, and the commit path performs an explicit SET statement_timeout=30000 before tx.Commit:
commit := func(ctx context.Context) error {
_, err = tx.Exec(ctx, "SET statement_timeout=30000")
...
return tx.Commit(ctx)
}
rollback := func() {
DeferRollback(ctx, l, tx.Rollback)
}
_, err = tx.Exec(ctx, fmt.Sprintf("SET statement_timeout=%d", timeoutMs))
if err != nil {
return nil, nil, nil, err
}
Two things are still worth cleaning up here:
- The
30000 reset value is a magic constant duplicated from loader.go::AfterConnect. Switching to SET LOCAL removes the assumption entirely.
- If the initial
SET fails after pool.Begin succeeded, the function returns a nil commit/rollback pair without rolling back — the underlying connection is leaked until GC. Should call tx.Rollback (or DeferRollback) before returning.
sqlchelpers.AcquireConnectionWithStatementTimeout works on a non-transactional connection, so SET LOCAL doesn't apply. Its release already resets to 30 s, but doing so via a hard-coded SET statement_timeout=30000 makes the value drift-prone (it must match AfterConnect).
Suggested fixes
In priority order:
-
Fix the actual leak (the one causing the reported 57014s). In pkg/repository/tenant.go, change the three call sites — UpdateControllerPartitionHeartbeat, UpdateWorkerPartitionHeartbeat, UpdateSchedulerPartitionHeartbeat — from SET statement_timeout=5000 to SET LOCAL statement_timeout = 5000. PG reverts on COMMIT / ROLLBACK automatically; no manual cleanup needed.
-
Tidy PrepareTxWithStatementTimeout in pkg/repository/sqlchelpers/tx.go:
- Switch to
SET LOCAL statement_timeout = $timeoutMs.
- Drop the
SET statement_timeout=30000 reset in the commit closure (no longer needed).
- On the early
SET error after pool.Begin, roll the tx back before returning so the connection isn't leaked.
-
Add a defensive pgxpool.Config.AfterRelease in pkg/config/loader/loader.go (and on readReplicaConfig) that re-normalizes statement_timeout on every check-in. Either:
SET statement_timeout = <default> mirroring AfterConnect (extract the constant so the two cannot drift), or
RESET statement_timeout, if the server / role default is configured to the desired value (otherwise RESET reverts to the postgresql.conf default, not the engine's intended 30 s).
Return false from the hook on failure so the connection is discarded rather than reused. This is belt-and-suspenders that retires the entire bug class and protects against future regressions.
-
Optional cleanup in sqlchelpers.AcquireConnectionWithStatementTimeout: extract the 30 s default into a shared package-level constant so loader.go and tx.go reference the same value, and run the reset inside a defer within release so a caller panic cannot skip it. If you take fix (3) above, this becomes safety-net.
Item (1) on its own should resolve the reported Heartbeat 57014 errors. Items (2)–(4) are hardening.
Full disclosure: I observed these issues in the logs, did an initial review, prepared this issue with assistance of opus 4.7 and hand-edited the following. As usual with complex codebases such as this, there could be many factors at play. However, this particular issue seems worthwhile addressing because it will at least reduce the scope of observed instability that might be obscuring other more substantial problems.
Describe the issue
Several places in the engine lower
statement_timeoutfor a single transaction by issuing a plainSET statement_timeout=…(notSET LOCAL). Per the PostgreSQLSETdocs, a plainSETis session-scoped: if the surrounding transaction commits, the new value persists for the lifetime of that backend connection. Whenpgxpoolreturns the connection to the pool, the next caller inherits the smaller timeout, and any normally-fast query that runs on that checked-out connection can fail with SQLSTATE57014("canceling statement due to statement timeout") at the leaked threshold.The most visible symptom is the dispatcher's
HeartbeatRPC. Its only DB write is a single-rowUPDATE "Worker" … WHERE id = $1against the bare pool (pkg/repository/worker.go::UpdateWorkerHeartbeat) with noSETof its own, so it inherits whateverstatement_timeouthappens to be on the connection it draws. When that connection has been tainted with the leaked 5 s timeout and DB latency spikes (orWorkerrow-lock contention occurs), the update aborts at the 5 s mark:The 5167 ms gRPC duration matching the 5000 ms timeout almost exactly is the fingerprint: the pool default of 30 s never applies on this connection. Worker SDKs in turn log
Failed to send heartbeat: … INTERNALandHeartbeat interval delay (XXXXms >> 4000ms), and a few consecutive misses cause the engine to mark the worker inactive.Environment
0.15.2hatchet-lite0.79.12Expected behavior
Lowered
statement_timeoutset inside a transaction should not leak to other callers via the connection pool.Heartbeatand other unrelated short-lived queries should not fail with57014at the 5 s mark.Root cause
The leak comes from the partition-heartbeat helpers in
pkg/repository/tenant.go. All three follow the same pattern — open a tx on the bare pool, runSET statement_timeout=5000(plainSET), then commit:Per PG semantics, the
SETsurvivesCOMMIT, and nothing in these functions resets it. These helpers run on a fast cadence on every controller / worker / scheduler partition on every engine instance, so over time a large fraction of pool connections are mutated. Once tainted, a connection stays tainted until it is recycled byMaxConnLifetime/MaxConnIdleTime— a long window in which any caller (includingUpdateWorkerHeartbeat) can pick it up.The pool's
AfterConnecthook does set the default to 30 s when the connection is first established (pkg/config/loader/loader.go), but there is noAfterReleasedefense to re-normalize the GUC on check-in, so a one-time mutation persists indefinitely.Other related (but lower-severity) issues
pkg/repository/sqlchelpers/tx.go::PrepareTxWithStatementTimeoutis structurally similar but does not actually leak today, becauseROLLBACKreverts an in-tx plainSET, and the commit path performs an explicitSET statement_timeout=30000beforetx.Commit:Two things are still worth cleaning up here:
30000reset value is a magic constant duplicated fromloader.go::AfterConnect. Switching toSET LOCALremoves the assumption entirely.SETfails afterpool.Beginsucceeded, the function returns anilcommit/rollback pair without rolling back — the underlying connection is leaked until GC. Should calltx.Rollback(orDeferRollback) before returning.sqlchelpers.AcquireConnectionWithStatementTimeoutworks on a non-transactional connection, soSET LOCALdoesn't apply. Itsreleasealready resets to 30 s, but doing so via a hard-codedSET statement_timeout=30000makes the value drift-prone (it must matchAfterConnect).Suggested fixes
In priority order:
Fix the actual leak (the one causing the reported
57014s). Inpkg/repository/tenant.go, change the three call sites —UpdateControllerPartitionHeartbeat,UpdateWorkerPartitionHeartbeat,UpdateSchedulerPartitionHeartbeat— fromSET statement_timeout=5000toSET LOCAL statement_timeout = 5000. PG reverts onCOMMIT/ROLLBACKautomatically; no manual cleanup needed.Tidy
PrepareTxWithStatementTimeoutinpkg/repository/sqlchelpers/tx.go:SET LOCAL statement_timeout = $timeoutMs.SET statement_timeout=30000reset in thecommitclosure (no longer needed).SETerror afterpool.Begin, roll the tx back before returning so the connection isn't leaked.Add a defensive
pgxpool.Config.AfterReleaseinpkg/config/loader/loader.go(and onreadReplicaConfig) that re-normalizesstatement_timeouton every check-in. Either:SET statement_timeout = <default>mirroringAfterConnect(extract the constant so the two cannot drift), orRESET statement_timeout, if the server / role default is configured to the desired value (otherwiseRESETreverts to thepostgresql.confdefault, not the engine's intended 30 s).Return
falsefrom the hook on failure so the connection is discarded rather than reused. This is belt-and-suspenders that retires the entire bug class and protects against future regressions.Optional cleanup in
sqlchelpers.AcquireConnectionWithStatementTimeout: extract the 30 s default into a shared package-level constant soloader.goandtx.goreference the same value, and run the reset inside adeferwithinreleaseso a caller panic cannot skip it. If you take fix (3) above, this becomes safety-net.