Skip to content

fix(env-client): reconnect when connect() is called from a different loop#860

Closed
acharyaanusha wants to merge 1 commit into
huggingface:mainfrom
acharyaanusha:fix/sync-client-foreign-loop
Closed

fix(env-client): reconnect when connect() is called from a different loop#860
acharyaanusha wants to merge 1 commit into
huggingface:mainfrom
acharyaanusha:fix/sync-client-foreign-loop

Conversation

@acharyaanusha

Copy link
Copy Markdown
Contributor

Summary

EnvClient.connect() has a guard -- if self._ws is not None: return self -- that no-ops regardless of which event loop established that websocket. This breaks the officially documented pattern (docs/source/guides/async-sync.md):

client = await SomeClient.from_env(...)   # connects inside this loop
with client.sync() as sync_client:        # drives calls on a NEW, separate
    sync_client.reset()                   # background loop

from_env() ends with await client.connect(), binding _ws to whichever loop ran that await (e.g. an asyncio.run() call, which is closed by the time from_env() returns). .sync() then drives every later call through SyncEnvClient's own dedicated background-thread loop via run_coroutine_threadsafe. SyncEnvClient.connect() does call self._async.connect(), but the no-op guard returns immediately since _ws is already set -- so the websocket never gets rebound to the loop that's actually being used, and every reset()/step() call schedules work on a live loop while operating on a connection object tied to a dead one. The result is a hang or an asyncio "Future attached to a different loop" error.

Found this while building a training example (#853) that does exactly this combination. It's not specific to that example -- any from_env() + .sync() caller hits it, including the pattern in this repo's own async-sync guide.

Fix

  • EnvClient now tracks which loop created _ws (self._ws_loop).
  • connect() only treats an existing _ws as reusable if the current running loop matches _ws_loop; otherwise it drops the stale reference (unusable anyway, since its loop is typically already closed) and connects fresh on the current loop.
  • disconnect() clears _ws_loop alongside _ws for consistency.

Test plan

  • Added TestForeignLoopReconnect in tests/test_core/test_generic_client.py (3 cases): same-loop reconnect stays a no-op (no extra ws_connect call), a foreign-loop reconnect actually re-establishes the connection on the current loop, and disconnect() clears the loop-tracking state.
  • pytest tests/test_core/ -q -- 190 passed, 11 pre-existing skips.
  • ruff format --check / ruff check clean.

🤖 Generated with Claude Code

…loop

EnvClient.connect() had a guard `if self._ws is not None: return self` that
no-ops regardless of which event loop established that websocket. This
breaks the officially documented pattern:

    client = await SomeClient.from_env(...)   # connects inside this loop
    with client.sync() as sync_client:        # drives calls on a NEW,
        sync_client.reset()                   # separate background loop

`from_env()` ends with `await client.connect()`, binding `_ws` to whichever
loop ran that await (e.g. an `asyncio.run()` call, which is closed by the
time `from_env()` returns). `.sync()` then drives every later call through
`SyncEnvClient`'s own dedicated background-thread loop via
`run_coroutine_threadsafe`. `SyncEnvClient.connect()` does call
`self._async.connect()`, but the no-op guard returns immediately since `_ws`
is already set -- so the websocket never gets rebound to the loop that's
actually being used, and every `reset()`/`step()` call schedules work on a
live loop while operating on a connection object tied to a dead one. Found
while building a training example that does exactly this (huggingface#853) -- not
specific to that example; any `from_env()` + `.sync()` caller hits it.

Fix: track which loop created `_ws` (`_ws_loop`). `connect()` only no-ops if
the *current* running loop matches; otherwise it drops the stale reference
(unusable anyway, since its loop is typically already closed) and connects
fresh on the current loop. `disconnect()` clears `_ws_loop` alongside `_ws`.

Added `TestForeignLoopReconnect` (3 cases): same-loop reconnect stays a
no-op, a foreign-loop reconnect actually re-establishes the connection, and
disconnect() clears the loop-tracking state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant