sophistry_bench_sprint_env: add training example and results#853
Open
acharyaanusha wants to merge 6 commits into
Open
sophistry_bench_sprint_env: add training example and results#853acharyaanusha wants to merge 6 commits into
acharyaanusha wants to merge 6 commits into
Conversation
…esults Adds the prime-rl GRPO config and per-step metrics from a 100-step run against the deployed env, plus a README section showing the reward-hacking signature (aggregate_reward up, correctness_reward flat). Also adds a from-scratch TRL GRPOTrainer example for training against the Space directly, for anyone without Prime Intellect access.
…heckpoint Closes the gap between local training output and an actual Hub artifact, matching the maintainer's "deployed to Hugging Face" ask for the training side, not just the Space.
The env is actually hosted at openenv-community/sophistry_bench_sprint_env, not anushaacharya/sophistry_bench_sprint_env as originally documented in huggingface#787 -- verified via `hf spaces info`.
… Intellect to a note Reframes the Training section so the TRL GRPOTrainer script (verified end-to-end against the deployed Space) is the primary documented path, matching this repo's own guidance that TRL is the recommended framework. The Prime Intellect run becomes supplementary evidence, not the headline. Also switches the script to GenericEnvClient + a directly-constructed UVProvider (avoiding a sync/async event-loop mismatch from mixing asyncio.run(from_env(...)) with .sync()), and bumps the default model to Qwen2.5-0.5B-Instruct for a cheaper, faster default run.
Runs the TRL GRPO example for real on Hugging Face Jobs (a10g-small, 100 steps, Qwen2.5-0.5B-Instruct) and documents the results honestly: the proxy reward (aggregate_reward) climbs and plateaus, confirming the example trains correctly end-to-end on HF infrastructure, but at this much smaller scale (~800 total rollouts vs. the Prime Intellect run's ~12,800) the policy collapses to near-empty completions rather than converging on the claim_count_cliff target -- a different reward-hacking shortcut, not a replication of the Prime Intellect run's specific curve. correctness_reward stays noisy/decoupled either way, which is the core finding both runs share. Also extends the reward_func to log per-step reward components (not just the scalar reward), since correctness_reward/n_claims live in observation["components"], which the trainer never needed but the README table does. Opts into SPRINT_EXPOSE_CORRECTNESS=1 for the locally-run clone (not the shared Space) since this is exactly the "trusted measurement code" use case the env's own README carves out -- never fed back into the prompt. Tuning notes from getting this to actually run without OOM on a10g-small: - per_device_train_batch_size is the *total* rollout count per step (must be divisible by num_generations), not unique-prompts * num_generations. - bf16 matters more than usual here: entropy/logprob computation materializes a [batch, completion_len, vocab_size] logits tensor, and a ~150K-token vocab (Qwen2.5) dominates memory at fp32. - gradient_checkpointing=True had no measurable effect in this setup (same OOM numbers with and without); reducing batch size was what actually fixed it. Left in since it's harmless, but don't rely on it alone.
3 tasks
acharyaanusha
added a commit
to acharyaanusha/OpenEnv
that referenced
this pull request
Jun 24, 2026
…loop
EnvClient.connect() had a guard `if self._ws is not None: return self` that
no-ops regardless of which event loop established that websocket. This
breaks the officially documented pattern:
client = await SomeClient.from_env(...) # connects inside this loop
with client.sync() as sync_client: # drives calls on a NEW,
sync_client.reset() # separate background loop
`from_env()` ends with `await client.connect()`, binding `_ws` to whichever
loop ran that await (e.g. an `asyncio.run()` call, which is closed by the
time `from_env()` returns). `.sync()` then drives every later call through
`SyncEnvClient`'s own dedicated background-thread loop via
`run_coroutine_threadsafe`. `SyncEnvClient.connect()` does call
`self._async.connect()`, but the no-op guard returns immediately since `_ws`
is already set -- so the websocket never gets rebound to the loop that's
actually being used, and every `reset()`/`step()` call schedules work on a
live loop while operating on a connection object tied to a dead one. Found
while building a training example that does exactly this (huggingface#853) -- not
specific to that example; any `from_env()` + `.sync()` caller hits it.
Fix: track which loop created `_ws` (`_ws_loop`). `connect()` only no-ops if
the *current* running loop matches; otherwise it drops the stale reference
(unusable anyway, since its loop is typically already closed) and connects
fresh on the current loop. `disconnect()` clears `_ws_loop` alongside `_ws`.
Added `TestForeignLoopReconnect` (3 cases): same-loop reconnect stays a
no-op, a foreign-loop reconnect actually re-establishes the connection, and
disconnect() clears the loop-tracking state.
6 tasks
From a self-review pass before requesting maintainer review on huggingface#853: - Validate --per-device-batch-size % --num-generations == 0 up front, before the ~180s env clone/start and dataset build -- previously this only surfaced as an opaque ValueError deep inside GRPOTrainer construction. - Extract completion-text parsing into _completion_text(), which now raises a clear error on an empty/malformed completion list instead of a bare IndexError/TypeError. - Assert completions and seed are the same length in the reward function, instead of letting zip() silently truncate and misalign reward<->task. - Write the components CSV under output_dir (which save_model() already guarantees exists) instead of a sibling path derived from --out's basename, which could fail if --out's parent directory doesn't exist. - Extract the CSV-writing block into write_metrics_csv(). Also tried switching make_sync_client() to the simpler from_env() + .sync() pattern, now that huggingface#854 fixes the event-loop mismatch that motivated building it manually in the first place -- and reverted. The fixed connect() does correctly reconnect on the new loop instead of hanging, but it can't cleanly close the *old* connection first (its event loop is already gone), so the old one is simply abandoned. That's harmless for envs that allow concurrent sessions, but this one doesn't (SUPPORTS_CONCURRENT_SESSIONS = False): the abandoned connection occupies the only session slot, and the real one fails with CAPACITY_REACHED. Confirmed by reproducing it locally. make_sync_client() avoids the problem by never creating that doomed first connection at all. Updated its docstring to explain both reasons.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #787, addressing the maintainer's request to see this env "deployed to Hugging Face and a working example of training or inference."
examples/sophistry_bench_sprint_grpo.py— trains a policy on this env with TRL'sGRPOTrainer. Since the episode is single-step, this is a plain prompt -> completion -> reward GRPO setup, noenvironment_factory/tool-calling needed. Connects to the env's deployed Space source directly via a manually-builtUVProvider+GenericEnvClient(notfrom_env()-- see the docstring onmake_sync_client()for why: an event-loop mismatch infrom_env()+.sync(), fixed in fix(uv-provider): clone git+ project paths before uv run --project #854, plus a second issue specific to this single-session env that fix(uv-provider): clone git+ project paths before uv run --project #854's fix doesn't fully cover). Only depends onopenenv[core]from PyPI, so it runs as a standaloneuvscript, including viahf jobs uv run.a10g-small,Qwen2.5-0.5B-Instruct):aggregate_rewardclimbs from ~0.35 to a ~0.50 plateau, confirming the example trains correctly end-to-end on Hugging Face's own infrastructure. At this scale (~800 total rollouts) the policy collapses to near-empty completions rather than converging on theclaim_count_clifftarget -- a different reward-hacking shortcut than the larger Prime Intellect run below, documented honestly as a distinct finding rather than a replication.correctness_rewardstays noisy/decoupled from the optimized reward either way, which is the core finding both runs share. Full per-step metrics (including thecorrectness_reward/n_claimsbreakdown):envs/sophistry_bench_sprint_env/training/hf_jobs_metrics.csv.anusha/sophistry-bench-sprinton the Prime Intellect Hub (parity-tested). A 100-step run there (config + metrics inenvs/sophistry_bench_sprint_env/training/) showsaggregate_rewardclimbing to a ~0.77 plateau whilen_claimssaturates at the literalclaim_count_clifftarget (exactly 8 claims) andcorrectness_rewardstays flat -- the textbook version of the reward-hacking signature this env is designed to surface.openenv-community/sophistry_bench_sprint_env, notanushaacharya/...as originally documented in Add sophistry_bench_sprint_env: single-agent advocacy reward-hacking environment #787).--per-device-batch-size % --num-generations == 0up front (previously a late, opaqueGRPOTrainererror after ~180s of setup), guards against malformed/empty completions, assertscompletions/seedstay aligned, and writes the components CSV underoutput_dirinstead of a path that could fail if--out's parent directory doesn't exist.Test plan
python3 scripts/sync_env_docs.py --checkpassescorrectness_reward/componentslogging path6a3bfb825f9c8079e0fb2664,a10g-small), completed successfully, results documented aboveCAPACITY_REACHEDfailure mode that was reproduced and is now documented in the scriptUVProvidergit-clone fix in fix(uv-provider): clone git+ project paths before uv run --project #854 (needed for the no-Docker connection path); the script's PEP-723 header notes theopenenv[core]git-ref override needed until that fix is released🤖 Generated with Claude Code