[BUG] statement_timeout leaks across pooled connections via SET (not SET LOCAL)

Full disclosure: I observed these issues in the logs, did an initial review, prepared this issue with assistance of opus 4.7 and hand-edited the following. As usual with complex codebases such as this, there could be many factors at play. However, this particular issue seems worthwhile addressing because it will at least reduce the scope of observed instability that might be obscuring other more substantial problems.

## Describe the issue
Several places in the engine lower `statement_timeout` for a single transaction by issuing a plain `SET statement_timeout=…` (not `SET LOCAL`). Per the [PostgreSQL `SET` docs](https://www.postgresql.org/docs/current/sql-set.html), a plain `SET` is session-scoped: if the surrounding transaction commits, the new value persists for the lifetime of that backend connection. When `pgxpool` returns the connection to the pool, the next caller inherits the smaller timeout, and any normally-fast query that runs on that checked-out connection can fail with SQLSTATE `57014` (*"canceling statement due to statement timeout"*) at the leaked threshold.

The most visible symptom is the dispatcher's `Heartbeat` RPC. Its only DB write is a single-row `UPDATE "Worker" … WHERE id = $1` against the bare pool (`pkg/repository/worker.go::UpdateWorkerHeartbeat`) with no `SET` of its own, so it inherits whatever `statement_timeout` happens to be on the connection it draws. When that connection has been tainted with the leaked 5 s timeout and DB latency spikes (or `Worker` row-lock contention occurs), the update aborts at the 5 s mark:
```text
ERR error="ERROR: canceling statement due to statement timeout (SQLSTATE 57014)" service=grpc
ERR finished call grpc.code=Internal grpc.method=Heartbeat grpc.time_ms=5167.774 grpc.service=Dispatcher
```
The 5167 ms gRPC duration matching the 5000 ms timeout almost exactly is the fingerprint: the pool default of 30 s never applies on this connection. Worker SDKs in turn log `Failed to send heartbeat: … INTERNAL` and `Heartbeat interval delay (XXXXms >> 4000ms)`, and a few consecutive misses cause the engine to mark the worker inactive.

## Environment
- **SDK:** TypeScript `0.15.2`
- **Engine:** `hatchet-lite` `0.79.12`

## Expected behavior
Lowered `statement_timeout` set inside a transaction should not leak to other callers via the connection pool. `Heartbeat` and other unrelated short-lived queries should not fail with `57014` at the 5 s mark.

## Root cause
The leak comes from the partition-heartbeat helpers in `pkg/repository/tenant.go`. All three follow the same pattern — open a tx on the bare pool, run `SET statement_timeout=5000` (plain `SET`), then commit:
```go
// pkg/repository/tenant.go
// UpdateControllerPartitionHeartbeat — same pattern in
// UpdateWorkerPartitionHeartbeat and UpdateSchedulerPartitionHeartbeat
tx, err := r.pool.Begin(ctx)
...
defer sqlchelpers.DeferRollback(ctx, r.l, tx.Rollback)
// set tx timeout to 5 seconds to avoid deadlocks
_, err = tx.Exec(ctx, "SET statement_timeout=5000")
...
if err := tx.Commit(ctx); err != nil { return "", err }
```
Per PG semantics, the `SET` survives `COMMIT`, and nothing in these functions resets it. These helpers run on a fast cadence on every controller / worker / scheduler partition on every engine instance, so over time a large fraction of pool connections are mutated. Once tainted, a connection stays tainted until it is recycled by `MaxConnLifetime` / `MaxConnIdleTime` — a long window in which any caller (including `UpdateWorkerHeartbeat`) can pick it up.

The pool's `AfterConnect` hook does set the default to 30 s when the connection is first established (`pkg/config/loader/loader.go`), but there is no `AfterRelease` defense to re-normalize the GUC on check-in, so a one-time mutation persists indefinitely.

## Other related (but lower-severity) issues
`pkg/repository/sqlchelpers/tx.go::PrepareTxWithStatementTimeout` is structurally similar but does not actually leak today, because `ROLLBACK` reverts an in-tx plain `SET`, and the commit path performs an explicit `SET statement_timeout=30000` before `tx.Commit`:
```go
commit := func(ctx context.Context) error {
    _, err = tx.Exec(ctx, "SET statement_timeout=30000")
    ...
    return tx.Commit(ctx)
}
rollback := func() {
    DeferRollback(ctx, l, tx.Rollback)
}
_, err = tx.Exec(ctx, fmt.Sprintf("SET statement_timeout=%d", timeoutMs))
if err != nil {
    return nil, nil, nil, err
}
```
Two things are still worth cleaning up here:
1. The `30000` reset value is a magic constant duplicated from `loader.go::AfterConnect`. Switching to `SET LOCAL` removes the assumption entirely.
2. If the initial `SET` fails after `pool.Begin` succeeded, the function returns a `nil` commit/rollback pair without rolling back — the underlying connection is leaked until GC. Should call `tx.Rollback` (or `DeferRollback`) before returning.
`sqlchelpers.AcquireConnectionWithStatementTimeout` works on a non-transactional connection, so `SET LOCAL` doesn't apply. Its `release` already resets to 30 s, but doing so via a hard-coded `SET statement_timeout=30000` makes the value drift-prone (it must match `AfterConnect`).
## Suggested fixes
In priority order:
1. **Fix the actual leak (the one causing the reported `57014`s).** In `pkg/repository/tenant.go`, change the three call sites — `UpdateControllerPartitionHeartbeat`, `UpdateWorkerPartitionHeartbeat`, `UpdateSchedulerPartitionHeartbeat` — from `SET statement_timeout=5000` to `SET LOCAL statement_timeout = 5000`. PG reverts on `COMMIT` / `ROLLBACK` automatically; no manual cleanup needed.

2. **Tidy `PrepareTxWithStatementTimeout`** in `pkg/repository/sqlchelpers/tx.go`:
   - Switch to `SET LOCAL statement_timeout = $timeoutMs`.
   - Drop the `SET statement_timeout=30000` reset in the `commit` closure (no longer needed).
   - On the early `SET` error after `pool.Begin`, roll the tx back before returning so the connection isn't leaked.

3. **Add a defensive `pgxpool.Config.AfterRelease`** in `pkg/config/loader/loader.go` (and on `readReplicaConfig`) that re-normalizes `statement_timeout` on every check-in. Either:
   - `SET statement_timeout = <default>` mirroring `AfterConnect` (extract the constant so the two cannot drift), or
   - `RESET statement_timeout`, **if** the server / role default is configured to the desired value (otherwise `RESET` reverts to the `postgresql.conf` default, not the engine's intended 30 s).
   Return `false` from the hook on failure so the connection is discarded rather than reused. This is belt-and-suspenders that retires the entire bug class and protects against future regressions.

4. **Optional cleanup in `sqlchelpers.AcquireConnectionWithStatementTimeout`:** extract the 30 s default into a shared package-level constant so `loader.go` and `tx.go` reference the same value, and run the reset inside a `defer` within `release` so a caller panic cannot skip it. If you take fix (3) above, this becomes safety-net.

> Item (1) on its own should resolve the reported `Heartbeat` `57014` errors. Items (2)–(4) are hardening.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] statement_timeout leaks across pooled connections via SET (not SET LOCAL) #3898

Describe the issue

Environment

Expected behavior

Root cause

Other related (but lower-severity) issues

Suggested fixes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] statement_timeout leaks across pooled connections via SET (not SET LOCAL) #3898

Description

Describe the issue

Environment

Expected behavior

Root cause

Other related (but lower-severity) issues

Suggested fixes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions