sophistry_bench_sprint_env: add training example and results by acharyaanusha · Pull Request #853 · huggingface/OpenEnv

acharyaanusha · 2026-06-24T00:32:25Z

Summary

Follow-up to #787, addressing the maintainer's request to see this env "deployed to Hugging Face and a working example of training or inference."

examples/sophistry_bench_sprint_grpo.py — trains a policy on this env with TRL's GRPOTrainer. Since the episode is single-step, this is a plain prompt -> completion -> reward GRPO setup, no environment_factory/tool-calling needed. Connects to the env's deployed Space source directly via a manually-built UVProvider + GenericEnvClient (not from_env() -- see the docstring on make_sync_client() for why: an event-loop mismatch in from_env()+.sync(), fixed in fix(uv-provider): clone git+ project paths before uv run --project #854, plus a second issue specific to this single-session env that fix(uv-provider): clone git+ project paths before uv run --project #854's fix doesn't fully cover). Only depends on openenv[core] from PyPI, so it runs as a standalone uv script, including via hf jobs uv run.
Real 100-step run on Hugging Face Jobs (a10g-small, Qwen2.5-0.5B-Instruct): aggregate_reward climbs from ~0.35 to a ~0.50 plateau, confirming the example trains correctly end-to-end on Hugging Face's own infrastructure. At this scale (~800 total rollouts) the policy collapses to near-empty completions rather than converging on the claim_count_cliff target -- a different reward-hacking shortcut than the larger Prime Intellect run below, documented honestly as a distinct finding rather than a replication. correctness_reward stays noisy/decoupled from the optimized reward either way, which is the core finding both runs share. Full per-step metrics (including the correctness_reward/n_claims breakdown): envs/sophistry_bench_sprint_env/training/hf_jobs_metrics.csv.
Also validated on Prime Intellect: the same scoring is registered as anusha/sophistry-bench-sprint on the Prime Intellect Hub (parity-tested). A 100-step run there (config + metrics in envs/sophistry_bench_sprint_env/training/) shows aggregate_reward climbing to a ~0.77 plateau while n_claims saturates at the literal claim_count_cliff target (exactly 8 claims) and correctness_reward stays flat -- the textbook version of the reward-hacking signature this env is designed to surface.
Fixed a stale repo id throughout (env was actually deployed at openenv-community/sophistry_bench_sprint_env, not anushaacharya/... as originally documented in Add sophistry_bench_sprint_env: single-agent advocacy reward-hacking environment #787).
Hardened the script per self-review: validates --per-device-batch-size % --num-generations == 0 up front (previously a late, opaque GRPOTrainer error after ~180s of setup), guards against malformed/empty completions, asserts completions/seed stay aligned, and writes the components CSV under output_dir instead of a path that could fail if --out's parent directory doesn't exist.

Test plan

python3 scripts/sync_env_docs.py --check passes
Local smoke tests verified end-to-end against the live deployed Space, including the correctness_reward/components logging path
Real 100-step GRPO run executed on Hugging Face Jobs (job 6a3bfb825f9c8079e0fb2664, a10g-small), completed successfully, results documented above
Self-reviewed (8-angle pass) before requesting review; see commit messages for what was found and fixed, including a real CAPACITY_REACHED failure mode that was reproduced and is now documented in the script
Depends on the UVProvider git-clone fix in fix(uv-provider): clone git+ project paths before uv run --project #854 (needed for the no-Docker connection path); the script's PEP-723 header notes the openenv[core] git-ref override needed until that fix is released

🤖 Generated with Claude Code

…esults Adds the prime-rl GRPO config and per-step metrics from a 100-step run against the deployed env, plus a README section showing the reward-hacking signature (aggregate_reward up, correctness_reward flat). Also adds a from-scratch TRL GRPOTrainer example for training against the Space directly, for anyone without Prime Intellect access.

…heckpoint Closes the gap between local training output and an actual Hub artifact, matching the maintainer's "deployed to Hugging Face" ask for the training side, not just the Space.

The env is actually hosted at openenv-community/sophistry_bench_sprint_env, not anushaacharya/sophistry_bench_sprint_env as originally documented in huggingface#787 -- verified via `hf spaces info`.

… Intellect to a note Reframes the Training section so the TRL GRPOTrainer script (verified end-to-end against the deployed Space) is the primary documented path, matching this repo's own guidance that TRL is the recommended framework. The Prime Intellect run becomes supplementary evidence, not the headline. Also switches the script to GenericEnvClient + a directly-constructed UVProvider (avoiding a sync/async event-loop mismatch from mixing asyncio.run(from_env(...)) with .sync()), and bumps the default model to Qwen2.5-0.5B-Instruct for a cheaper, faster default run.

Runs the TRL GRPO example for real on Hugging Face Jobs (a10g-small, 100 steps, Qwen2.5-0.5B-Instruct) and documents the results honestly: the proxy reward (aggregate_reward) climbs and plateaus, confirming the example trains correctly end-to-end on HF infrastructure, but at this much smaller scale (~800 total rollouts vs. the Prime Intellect run's ~12,800) the policy collapses to near-empty completions rather than converging on the claim_count_cliff target -- a different reward-hacking shortcut, not a replication of the Prime Intellect run's specific curve. correctness_reward stays noisy/decoupled either way, which is the core finding both runs share. Also extends the reward_func to log per-step reward components (not just the scalar reward), since correctness_reward/n_claims live in observation["components"], which the trainer never needed but the README table does. Opts into SPRINT_EXPOSE_CORRECTNESS=1 for the locally-run clone (not the shared Space) since this is exactly the "trusted measurement code" use case the env's own README carves out -- never fed back into the prompt. Tuning notes from getting this to actually run without OOM on a10g-small: - per_device_train_batch_size is the *total* rollout count per step (must be divisible by num_generations), not unique-prompts * num_generations. - bf16 matters more than usual here: entropy/logprob computation materializes a [batch, completion_len, vocab_size] logits tensor, and a ~150K-token vocab (Qwen2.5) dominates memory at fp32. - gradient_checkpointing=True had no measurable effect in this setup (same OOM numbers with and without); reducing batch size was what actually fixed it. Left in since it's harmless, but don't rely on it alone.

…loop EnvClient.connect() had a guard `if self._ws is not None: return self` that no-ops regardless of which event loop established that websocket. This breaks the officially documented pattern: client = await SomeClient.from_env(...) # connects inside this loop with client.sync() as sync_client: # drives calls on a NEW, sync_client.reset() # separate background loop `from_env()` ends with `await client.connect()`, binding `_ws` to whichever loop ran that await (e.g. an `asyncio.run()` call, which is closed by the time `from_env()` returns). `.sync()` then drives every later call through `SyncEnvClient`'s own dedicated background-thread loop via `run_coroutine_threadsafe`. `SyncEnvClient.connect()` does call `self._async.connect()`, but the no-op guard returns immediately since `_ws` is already set -- so the websocket never gets rebound to the loop that's actually being used, and every `reset()`/`step()` call schedules work on a live loop while operating on a connection object tied to a dead one. Found while building a training example that does exactly this (huggingface#853) -- not specific to that example; any `from_env()` + `.sync()` caller hits it. Fix: track which loop created `_ws` (`_ws_loop`). `connect()` only no-ops if the *current* running loop matches; otherwise it drops the stale reference (unusable anyway, since its loop is typically already closed) and connects fresh on the current loop. `disconnect()` clears `_ws_loop` alongside `_ws`. Added `TestForeignLoopReconnect` (3 cases): same-loop reconnect stays a no-op, a foreign-loop reconnect actually re-establishes the connection, and disconnect() clears the loop-tracking state.

From a self-review pass before requesting maintainer review on huggingface#853: - Validate --per-device-batch-size % --num-generations == 0 up front, before the ~180s env clone/start and dataset build -- previously this only surfaced as an opaque ValueError deep inside GRPOTrainer construction. - Extract completion-text parsing into _completion_text(), which now raises a clear error on an empty/malformed completion list instead of a bare IndexError/TypeError. - Assert completions and seed are the same length in the reward function, instead of letting zip() silently truncate and misalign reward<->task. - Write the components CSV under output_dir (which save_model() already guarantees exists) instead of a sibling path derived from --out's basename, which could fail if --out's parent directory doesn't exist. - Extract the CSV-writing block into write_metrics_csv(). Also tried switching make_sync_client() to the simpler from_env() + .sync() pattern, now that huggingface#854 fixes the event-loop mismatch that motivated building it manually in the first place -- and reverted. The fixed connect() does correctly reconnect on the new loop instead of hanging, but it can't cleanly close the *old* connection first (its event loop is already gone), so the old one is simply abandoned. That's harmless for envs that allow concurrent sessions, but this one doesn't (SUPPORTS_CONCURRENT_SESSIONS = False): the abandoned connection occupies the only session slot, and the real one fails with CAPACITY_REACHED. Confirmed by reproducing it locally. make_sync_client() avoids the problem by never creating that doomed first connection at all. Updated its docstring to explain both reasons.

acharyaanusha added 5 commits June 23, 2026 17:30

examples(sophistry_bench_sprint_grpo): add --push-to-hub to publish c…

00d48a1

…heckpoint Closes the gap between local training output and an actual Hub artifact, matching the maintainer's "deployed to Hugging Face" ask for the training side, not just the Space.

fix(sophistry_bench_sprint_env): correct deployed Space repo id

8642072

The env is actually hosted at openenv-community/sophistry_bench_sprint_env, not anushaacharya/sophistry_bench_sprint_env as originally documented in huggingface#787 -- verified via `hf spaces info`.

acharyaanusha mentioned this pull request Jun 24, 2026

fix(env-client): reconnect when connect() is called from a different loop #860

Closed

3 tasks

acharyaanusha mentioned this pull request Jun 24, 2026

fix(uv-provider): clone git+ project paths before uv run --project #854

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sophistry_bench_sprint_env: add training example and results#853

sophistry_bench_sprint_env: add training example and results#853
acharyaanusha wants to merge 6 commits into
huggingface:mainfrom
acharyaanusha:feature/sophistry-bench-sprint-grpo-training

acharyaanusha commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

acharyaanusha commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

acharyaanusha commented Jun 24, 2026 •

edited

Loading