fix(repository/multiplexer): hijack listener conn to avoid pool slot leak on reconnect by voidborne-d · Pull Request #3701 · hatchet-dev/hatchet

voidborne-d · 2026-04-24T14:26:22Z

Closes #3694.

Root cause

pkg/repository/multiplexer.go's Connect callback handed the raw *pgx.Conn from an acquired pool slot to pgxlisten.Listener without retaining the *pgxpool.Conn wrapper:

// before
Connect: func(ctx context.Context) (*pgx.Conn, error) {
    poolConn, err := m.pool.Acquire(ctx)
    if err != nil {
        return nil, err
    }
    return poolConn.Conn(), nil        // wrapper falls out of scope
},

The *pgxpool.Conn wrapper went out of scope with no Release(), so pgxpool permanently counted the slot as acquired. When the server-side terminated the listener conn (idle_session_timeout, pgbouncer server_idle_timeout, an L7 proxy idle kill), pgxlisten's listen() returned with an error and its defer conn.Close(ctx) closed the raw conn — but the pool wrapper was already orphaned.

On reconnect pgxlisten called Connect again and acquired a fresh slot. Every reconnect cycle leaked one pool slot.

This is a separate bug from #2771 (fixed by #2772). #2772 ensured reconnect happens at all. This issue concerns the slot-leak consequence of how reconnect is wired — #2772 didn't address it.

Issue reporter's environment: AWS RDS PG 15.10 with idle_session_timeout = 1h triggered exactly one reconnect-per-hour, with one AcquiredConns increment per cycle.

Fix

Extract the Connect body into a small acquireListenerConn helper that uses poolConn.Hijack() instead of poolConn.Conn(). Hijack transfers ownership of the raw conn out of the pool immediately, so the slot is released right after Acquire. pgxlisten's existing defer conn.Close(ctx) still closes the raw conn cleanly on listener exit — it just no longer leaks bookkeeping.

// after
func acquireListenerConn(ctx context.Context, pool *pgxpool.Pool) (*pgx.Conn, error) {
    poolConn, err := pool.Acquire(ctx)
    if err != nil {
        return nil, err
    }
    return poolConn.Hijack(), nil
}

Behavioral differences vs. the old code:

Pool's AcquiredConns count correctly drops back to 0 after Connect returns. No per-reconnect leak.
Hijack reduces TotalConns by one (the listener conn is no longer pool-tracked); the next Acquire anywhere in the process opens a fresh backend if needed, up to MaxConns. This is the same net effect as the old code in the steady state, just without the orphaned bookkeeping.
Legitimate pgxlisten reconnect path is unchanged — the raw *pgx.Conn it gets and the defer conn.Close(ctx) it runs on exit work identically.

Tests

Added pkg/repository/multiplexer_listen_test.go with two testcontainer-backed regression guards:

TestAcquireListenerConn_ReleasesPoolSlotImmediately — spins up postgres:15.6, creates a pool (MaxConns=5), calls acquireListenerConn, and asserts pool.Stat().AcquiredConns() is 0 immediately after. Under the old code this would be 1 (and stay at 1 for the life of the process unless the raw conn was somehow returned to the pool — it isn't).
TestAcquireListenerConn_SurvivesReconnectCyclePastPoolLimit — simulates pgxlisten's reconnect loop by running MaxConns*4 acquire → close cycles on a MaxConns=3 pool. With the slot-leak bug the pool starves by iteration 4; with the fix all 12 iterations succeed and AcquiredConns stays at 0 throughout.

Both tests follow the existing testcontainers pattern in task_partition_test.go (no migrations needed, just a raw pgxpool) and run under the default !e2e && !load && !rampup && !integration build tag alongside the other multiplexer unit tests.

Local gates

go build ./... — clean
go vet ./pkg/repository/ — clean
go test ./pkg/repository/ (non-docker multiplexer tests) — all existing tests pass
golangci-lint run --config=.golangci.yml ./pkg/repository/... — the two pre-existing warnings on the multiplexedListener struct (fieldalignment) and newMultiplexedListener (gosec G118 on the never-called cancel) are unchanged; my new code adds zero warnings

Docker was not available on my local machine to run the testcontainer tests end-to-end, but they compile clean and will run in CI under the unit job (same infra as task_partition_test.go).

…leak on reconnect Closes hatchet-dev#3694. The previous multiplexer Connect callback did: poolConn, err := m.pool.Acquire(ctx) if err != nil { return nil, err } return poolConn.Conn(), nil The *pgxpool.Conn wrapper fell out of scope without a matching Release(), so pgxpool's internal bookkeeping counted the slot as permanently acquired. When the server-side terminated the listener conn (e.g. idle_session_timeout, pgbouncer server_idle_timeout, L7 idle kill), pgxlisten's listen() returned with an error and its defer closed the raw conn — but the orphaned pool wrapper was never released. On the next reconnect pgxlisten called Connect again and took a fresh slot. Every reconnect cycle thus leaked one slot until the pool was exhausted. This is distinct from the reconnect bug fixed by hatchet-dev#2772 (which ensured reconnect happens at all); this is the slot-leak *consequence* of how reconnect is wired. Fix: extract the Connect body into a small acquireListenerConn helper that returns poolConn.Hijack() instead of poolConn.Conn(). Hijack transfers ownership of the raw conn out of the pool immediately, so the slot is released right after Acquire. pgxlisten's existing "defer conn.Close(ctx)" still closes the raw conn cleanly on listener exit — it just no longer leaks bookkeeping. Includes two testcontainer-backed regression tests: - TestAcquireListenerConn_ReleasesPoolSlotImmediately asserts Stat().AcquiredConns() is 0 immediately after acquireListenerConn. - TestAcquireListenerConn_SurvivesReconnectCyclePastPoolLimit runs more acquire+close cycles than MaxConns; with the old code the pool starves within MaxConns iterations.

vercel · 2026-04-24T14:26:28Z

Someone is attempting to deploy a commit to the Hatchet Team on Vercel.

A member of the Team first needs to authorize it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(repository/multiplexer): hijack listener conn to avoid pool slot leak on reconnect#3701

fix(repository/multiplexer): hijack listener conn to avoid pool slot leak on reconnect#3701
voidborne-d wants to merge 1 commit intohatchet-dev:mainfrom
voidborne-d:fix/multiplexer-hijack-listener-conn

voidborne-d commented Apr 24, 2026

Uh oh!

vercel Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

voidborne-d commented Apr 24, 2026

Root cause

Fix

Tests

Local gates

Uh oh!

vercel Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant