fix(repository/multiplexer): hijack listener conn to avoid pool slot leak on reconnect#3701
Open
voidborne-d wants to merge 1 commit intohatchet-dev:mainfrom
Open
Conversation
…leak on reconnect Closes hatchet-dev#3694. The previous multiplexer Connect callback did: poolConn, err := m.pool.Acquire(ctx) if err != nil { return nil, err } return poolConn.Conn(), nil The *pgxpool.Conn wrapper fell out of scope without a matching Release(), so pgxpool's internal bookkeeping counted the slot as permanently acquired. When the server-side terminated the listener conn (e.g. idle_session_timeout, pgbouncer server_idle_timeout, L7 idle kill), pgxlisten's listen() returned with an error and its defer closed the raw conn — but the orphaned pool wrapper was never released. On the next reconnect pgxlisten called Connect again and took a fresh slot. Every reconnect cycle thus leaked one slot until the pool was exhausted. This is distinct from the reconnect bug fixed by hatchet-dev#2772 (which ensured reconnect happens at all); this is the slot-leak *consequence* of how reconnect is wired. Fix: extract the Connect body into a small acquireListenerConn helper that returns poolConn.Hijack() instead of poolConn.Conn(). Hijack transfers ownership of the raw conn out of the pool immediately, so the slot is released right after Acquire. pgxlisten's existing "defer conn.Close(ctx)" still closes the raw conn cleanly on listener exit — it just no longer leaks bookkeeping. Includes two testcontainer-backed regression tests: - TestAcquireListenerConn_ReleasesPoolSlotImmediately asserts Stat().AcquiredConns() is 0 immediately after acquireListenerConn. - TestAcquireListenerConn_SurvivesReconnectCyclePastPoolLimit runs more acquire+close cycles than MaxConns; with the old code the pool starves within MaxConns iterations.
|
Someone is attempting to deploy a commit to the Hatchet Team on Vercel. A member of the Team first needs to authorize it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3694.
Root cause
pkg/repository/multiplexer.go'sConnectcallback handed the raw*pgx.Connfrom an acquired pool slot topgxlisten.Listenerwithout retaining the*pgxpool.Connwrapper:The
*pgxpool.Connwrapper went out of scope with noRelease(), so pgxpool permanently counted the slot as acquired. When the server-side terminated the listener conn (idle_session_timeout, pgbouncerserver_idle_timeout, an L7 proxy idle kill), pgxlisten'slisten()returned with an error and itsdefer conn.Close(ctx)closed the raw conn — but the pool wrapper was already orphaned.On reconnect pgxlisten called
Connectagain and acquired a fresh slot. Every reconnect cycle leaked one pool slot.This is a separate bug from #2771 (fixed by #2772). #2772 ensured reconnect happens at all. This issue concerns the slot-leak consequence of how reconnect is wired — #2772 didn't address it.
Issue reporter's environment: AWS RDS PG 15.10 with
idle_session_timeout = 1htriggered exactly one reconnect-per-hour, with oneAcquiredConnsincrement per cycle.Fix
Extract the Connect body into a small
acquireListenerConnhelper that usespoolConn.Hijack()instead ofpoolConn.Conn().Hijacktransfers ownership of the raw conn out of the pool immediately, so the slot is released right afterAcquire. pgxlisten's existingdefer conn.Close(ctx)still closes the raw conn cleanly on listener exit — it just no longer leaks bookkeeping.Behavioral differences vs. the old code:
AcquiredConnscount correctly drops back to 0 afterConnectreturns. No per-reconnect leak.TotalConnsby one (the listener conn is no longer pool-tracked); the nextAcquireanywhere in the process opens a fresh backend if needed, up toMaxConns. This is the same net effect as the old code in the steady state, just without the orphaned bookkeeping.*pgx.Connit gets and thedefer conn.Close(ctx)it runs on exit work identically.Tests
Added
pkg/repository/multiplexer_listen_test.gowith two testcontainer-backed regression guards:TestAcquireListenerConn_ReleasesPoolSlotImmediately— spins uppostgres:15.6, creates a pool (MaxConns=5), callsacquireListenerConn, and assertspool.Stat().AcquiredConns()is0immediately after. Under the old code this would be1(and stay at 1 for the life of the process unless the raw conn was somehow returned to the pool — it isn't).TestAcquireListenerConn_SurvivesReconnectCyclePastPoolLimit— simulates pgxlisten's reconnect loop by runningMaxConns*4acquire → close cycles on aMaxConns=3pool. With the slot-leak bug the pool starves by iteration 4; with the fix all 12 iterations succeed andAcquiredConnsstays at 0 throughout.Both tests follow the existing testcontainers pattern in
task_partition_test.go(no migrations needed, just a raw pgxpool) and run under the default!e2e && !load && !rampup && !integrationbuild tag alongside the other multiplexer unit tests.Local gates
go build ./...— cleango vet ./pkg/repository/— cleango test ./pkg/repository/(non-docker multiplexer tests) — all existing tests passgolangci-lint run --config=.golangci.yml ./pkg/repository/...— the two pre-existing warnings on themultiplexedListenerstruct (fieldalignment) andnewMultiplexedListener(gosec G118 on the never-calledcancel) are unchanged; my new code adds zero warningsDocker was not available on my local machine to run the testcontainer tests end-to-end, but they compile clean and will run in CI under the
unitjob (same infra astask_partition_test.go).