Skip to content

fix(repository/multiplexer): hijack listener conn to avoid pool slot leak on reconnect#3701

Open
voidborne-d wants to merge 1 commit intohatchet-dev:mainfrom
voidborne-d:fix/multiplexer-hijack-listener-conn
Open

fix(repository/multiplexer): hijack listener conn to avoid pool slot leak on reconnect#3701
voidborne-d wants to merge 1 commit intohatchet-dev:mainfrom
voidborne-d:fix/multiplexer-hijack-listener-conn

Conversation

@voidborne-d
Copy link
Copy Markdown

Closes #3694.

Root cause

pkg/repository/multiplexer.go's Connect callback handed the raw *pgx.Conn from an acquired pool slot to pgxlisten.Listener without retaining the *pgxpool.Conn wrapper:

// before
Connect: func(ctx context.Context) (*pgx.Conn, error) {
    poolConn, err := m.pool.Acquire(ctx)
    if err != nil {
        return nil, err
    }
    return poolConn.Conn(), nil        // wrapper falls out of scope
},

The *pgxpool.Conn wrapper went out of scope with no Release(), so pgxpool permanently counted the slot as acquired. When the server-side terminated the listener conn (idle_session_timeout, pgbouncer server_idle_timeout, an L7 proxy idle kill), pgxlisten's listen() returned with an error and its defer conn.Close(ctx) closed the raw conn — but the pool wrapper was already orphaned.

On reconnect pgxlisten called Connect again and acquired a fresh slot. Every reconnect cycle leaked one pool slot.

This is a separate bug from #2771 (fixed by #2772). #2772 ensured reconnect happens at all. This issue concerns the slot-leak consequence of how reconnect is wired — #2772 didn't address it.

Issue reporter's environment: AWS RDS PG 15.10 with idle_session_timeout = 1h triggered exactly one reconnect-per-hour, with one AcquiredConns increment per cycle.

Fix

Extract the Connect body into a small acquireListenerConn helper that uses poolConn.Hijack() instead of poolConn.Conn(). Hijack transfers ownership of the raw conn out of the pool immediately, so the slot is released right after Acquire. pgxlisten's existing defer conn.Close(ctx) still closes the raw conn cleanly on listener exit — it just no longer leaks bookkeeping.

// after
func acquireListenerConn(ctx context.Context, pool *pgxpool.Pool) (*pgx.Conn, error) {
    poolConn, err := pool.Acquire(ctx)
    if err != nil {
        return nil, err
    }
    return poolConn.Hijack(), nil
}

Behavioral differences vs. the old code:

  • Pool's AcquiredConns count correctly drops back to 0 after Connect returns. No per-reconnect leak.
  • Hijack reduces TotalConns by one (the listener conn is no longer pool-tracked); the next Acquire anywhere in the process opens a fresh backend if needed, up to MaxConns. This is the same net effect as the old code in the steady state, just without the orphaned bookkeeping.
  • Legitimate pgxlisten reconnect path is unchanged — the raw *pgx.Conn it gets and the defer conn.Close(ctx) it runs on exit work identically.

Tests

Added pkg/repository/multiplexer_listen_test.go with two testcontainer-backed regression guards:

  1. TestAcquireListenerConn_ReleasesPoolSlotImmediately — spins up postgres:15.6, creates a pool (MaxConns=5), calls acquireListenerConn, and asserts pool.Stat().AcquiredConns() is 0 immediately after. Under the old code this would be 1 (and stay at 1 for the life of the process unless the raw conn was somehow returned to the pool — it isn't).

  2. TestAcquireListenerConn_SurvivesReconnectCyclePastPoolLimit — simulates pgxlisten's reconnect loop by running MaxConns*4 acquire → close cycles on a MaxConns=3 pool. With the slot-leak bug the pool starves by iteration 4; with the fix all 12 iterations succeed and AcquiredConns stays at 0 throughout.

Both tests follow the existing testcontainers pattern in task_partition_test.go (no migrations needed, just a raw pgxpool) and run under the default !e2e && !load && !rampup && !integration build tag alongside the other multiplexer unit tests.

Local gates

  • go build ./... — clean
  • go vet ./pkg/repository/ — clean
  • go test ./pkg/repository/ (non-docker multiplexer tests) — all existing tests pass
  • golangci-lint run --config=.golangci.yml ./pkg/repository/... — the two pre-existing warnings on the multiplexedListener struct (fieldalignment) and newMultiplexedListener (gosec G118 on the never-called cancel) are unchanged; my new code adds zero warnings

Docker was not available on my local machine to run the testcontainer tests end-to-end, but they compile clean and will run in CI under the unit job (same infra as task_partition_test.go).

…leak on reconnect

Closes hatchet-dev#3694.

The previous multiplexer Connect callback did:

    poolConn, err := m.pool.Acquire(ctx)
    if err != nil { return nil, err }
    return poolConn.Conn(), nil

The *pgxpool.Conn wrapper fell out of scope without a matching Release(),
so pgxpool's internal bookkeeping counted the slot as permanently
acquired. When the server-side terminated the listener conn (e.g.
idle_session_timeout, pgbouncer server_idle_timeout, L7 idle kill),
pgxlisten's listen() returned with an error and its defer closed the
raw conn — but the orphaned pool wrapper was never released. On the
next reconnect pgxlisten called Connect again and took a fresh slot.
Every reconnect cycle thus leaked one slot until the pool was
exhausted.

This is distinct from the reconnect bug fixed by hatchet-dev#2772 (which ensured
reconnect happens at all); this is the slot-leak *consequence* of how
reconnect is wired.

Fix: extract the Connect body into a small acquireListenerConn helper
that returns poolConn.Hijack() instead of poolConn.Conn(). Hijack
transfers ownership of the raw conn out of the pool immediately, so
the slot is released right after Acquire. pgxlisten's existing
"defer conn.Close(ctx)" still closes the raw conn cleanly on listener
exit — it just no longer leaks bookkeeping.

Includes two testcontainer-backed regression tests:

  - TestAcquireListenerConn_ReleasesPoolSlotImmediately asserts
    Stat().AcquiredConns() is 0 immediately after acquireListenerConn.
  - TestAcquireListenerConn_SurvivesReconnectCyclePastPoolLimit runs
    more acquire+close cycles than MaxConns; with the old code the
    pool starves within MaxConns iterations.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 24, 2026

Someone is attempting to deploy a commit to the Hatchet Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Multiplexed LISTEN connection leaks pool slots on reconnect under server-side idle connection termination

1 participant