fix(env-client): reconnect when connect() is called from a different loop#860
Closed
acharyaanusha wants to merge 1 commit into
Closed
fix(env-client): reconnect when connect() is called from a different loop#860acharyaanusha wants to merge 1 commit into
acharyaanusha wants to merge 1 commit into
Conversation
…loop
EnvClient.connect() had a guard `if self._ws is not None: return self` that
no-ops regardless of which event loop established that websocket. This
breaks the officially documented pattern:
client = await SomeClient.from_env(...) # connects inside this loop
with client.sync() as sync_client: # drives calls on a NEW,
sync_client.reset() # separate background loop
`from_env()` ends with `await client.connect()`, binding `_ws` to whichever
loop ran that await (e.g. an `asyncio.run()` call, which is closed by the
time `from_env()` returns). `.sync()` then drives every later call through
`SyncEnvClient`'s own dedicated background-thread loop via
`run_coroutine_threadsafe`. `SyncEnvClient.connect()` does call
`self._async.connect()`, but the no-op guard returns immediately since `_ws`
is already set -- so the websocket never gets rebound to the loop that's
actually being used, and every `reset()`/`step()` call schedules work on a
live loop while operating on a connection object tied to a dead one. Found
while building a training example that does exactly this (huggingface#853) -- not
specific to that example; any `from_env()` + `.sync()` caller hits it.
Fix: track which loop created `_ws` (`_ws_loop`). `connect()` only no-ops if
the *current* running loop matches; otherwise it drops the stale reference
(unusable anyway, since its loop is typically already closed) and connects
fresh on the current loop. `disconnect()` clears `_ws_loop` alongside `_ws`.
Added `TestForeignLoopReconnect` (3 cases): same-loop reconnect stays a
no-op, a foreign-loop reconnect actually re-establishes the connection, and
disconnect() clears the loop-tracking state.
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EnvClient.connect()has a guard --if self._ws is not None: return self-- that no-ops regardless of which event loop established that websocket. This breaks the officially documented pattern (docs/source/guides/async-sync.md):from_env()ends withawait client.connect(), binding_wsto whichever loop ran that await (e.g. anasyncio.run()call, which is closed by the timefrom_env()returns)..sync()then drives every later call throughSyncEnvClient's own dedicated background-thread loop viarun_coroutine_threadsafe.SyncEnvClient.connect()does callself._async.connect(), but the no-op guard returns immediately since_wsis already set -- so the websocket never gets rebound to the loop that's actually being used, and everyreset()/step()call schedules work on a live loop while operating on a connection object tied to a dead one. The result is a hang or an asyncio "Future attached to a different loop" error.Found this while building a training example (#853) that does exactly this combination. It's not specific to that example -- any
from_env()+.sync()caller hits it, including the pattern in this repo's own async-sync guide.Fix
EnvClientnow tracks which loop created_ws(self._ws_loop).connect()only treats an existing_wsas reusable if the current running loop matches_ws_loop; otherwise it drops the stale reference (unusable anyway, since its loop is typically already closed) and connects fresh on the current loop.disconnect()clears_ws_loopalongside_wsfor consistency.Test plan
TestForeignLoopReconnectintests/test_core/test_generic_client.py(3 cases): same-loop reconnect stays a no-op (no extraws_connectcall), a foreign-loop reconnect actually re-establishes the connection on the current loop, anddisconnect()clears the loop-tracking state.pytest tests/test_core/ -q-- 190 passed, 11 pre-existing skips.ruff format --check/ruff checkclean.🤖 Generated with Claude Code