Skip to content

sophistry_bench_sprint_env: add training example and results#853

Open
acharyaanusha wants to merge 6 commits into
huggingface:mainfrom
acharyaanusha:feature/sophistry-bench-sprint-grpo-training
Open

sophistry_bench_sprint_env: add training example and results#853
acharyaanusha wants to merge 6 commits into
huggingface:mainfrom
acharyaanusha:feature/sophistry-bench-sprint-grpo-training

Conversation

@acharyaanusha

@acharyaanusha acharyaanusha commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #787, addressing the maintainer's request to see this env "deployed to Hugging Face and a working example of training or inference."

  • examples/sophistry_bench_sprint_grpo.py — trains a policy on this env with TRL's GRPOTrainer. Since the episode is single-step, this is a plain prompt -> completion -> reward GRPO setup, no environment_factory/tool-calling needed. Connects to the env's deployed Space source directly via a manually-built UVProvider + GenericEnvClient (not from_env() -- see the docstring on make_sync_client() for why: an event-loop mismatch in from_env()+.sync(), fixed in fix(uv-provider): clone git+ project paths before uv run --project #854, plus a second issue specific to this single-session env that fix(uv-provider): clone git+ project paths before uv run --project #854's fix doesn't fully cover). Only depends on openenv[core] from PyPI, so it runs as a standalone uv script, including via hf jobs uv run.
  • Real 100-step run on Hugging Face Jobs (a10g-small, Qwen2.5-0.5B-Instruct): aggregate_reward climbs from ~0.35 to a ~0.50 plateau, confirming the example trains correctly end-to-end on Hugging Face's own infrastructure. At this scale (~800 total rollouts) the policy collapses to near-empty completions rather than converging on the claim_count_cliff target -- a different reward-hacking shortcut than the larger Prime Intellect run below, documented honestly as a distinct finding rather than a replication. correctness_reward stays noisy/decoupled from the optimized reward either way, which is the core finding both runs share. Full per-step metrics (including the correctness_reward/n_claims breakdown): envs/sophistry_bench_sprint_env/training/hf_jobs_metrics.csv.
  • Also validated on Prime Intellect: the same scoring is registered as anusha/sophistry-bench-sprint on the Prime Intellect Hub (parity-tested). A 100-step run there (config + metrics in envs/sophistry_bench_sprint_env/training/) shows aggregate_reward climbing to a ~0.77 plateau while n_claims saturates at the literal claim_count_cliff target (exactly 8 claims) and correctness_reward stays flat -- the textbook version of the reward-hacking signature this env is designed to surface.
  • Fixed a stale repo id throughout (env was actually deployed at openenv-community/sophistry_bench_sprint_env, not anushaacharya/... as originally documented in Add sophistry_bench_sprint_env: single-agent advocacy reward-hacking environment #787).
  • Hardened the script per self-review: validates --per-device-batch-size % --num-generations == 0 up front (previously a late, opaque GRPOTrainer error after ~180s of setup), guards against malformed/empty completions, asserts completions/seed stay aligned, and writes the components CSV under output_dir instead of a path that could fail if --out's parent directory doesn't exist.

Test plan

  • python3 scripts/sync_env_docs.py --check passes
  • Local smoke tests verified end-to-end against the live deployed Space, including the correctness_reward/components logging path
  • Real 100-step GRPO run executed on Hugging Face Jobs (job 6a3bfb825f9c8079e0fb2664, a10g-small), completed successfully, results documented above
  • Self-reviewed (8-angle pass) before requesting review; see commit messages for what was found and fixed, including a real CAPACITY_REACHED failure mode that was reproduced and is now documented in the script
  • Depends on the UVProvider git-clone fix in fix(uv-provider): clone git+ project paths before uv run --project #854 (needed for the no-Docker connection path); the script's PEP-723 header notes the openenv[core] git-ref override needed until that fix is released

🤖 Generated with Claude Code

…esults

Adds the prime-rl GRPO config and per-step metrics from a 100-step run
against the deployed env, plus a README section showing the reward-hacking
signature (aggregate_reward up, correctness_reward flat). Also adds a
from-scratch TRL GRPOTrainer example for training against the Space
directly, for anyone without Prime Intellect access.
…heckpoint

Closes the gap between local training output and an actual Hub artifact,
matching the maintainer's "deployed to Hugging Face" ask for the training
side, not just the Space.
The env is actually hosted at openenv-community/sophistry_bench_sprint_env,
not anushaacharya/sophistry_bench_sprint_env as originally documented in huggingface#787 --
verified via `hf spaces info`.
… Intellect to a note

Reframes the Training section so the TRL GRPOTrainer script (verified
end-to-end against the deployed Space) is the primary documented path,
matching this repo's own guidance that TRL is the recommended framework.
The Prime Intellect run becomes supplementary evidence, not the headline.

Also switches the script to GenericEnvClient + a directly-constructed
UVProvider (avoiding a sync/async event-loop mismatch from mixing
asyncio.run(from_env(...)) with .sync()), and bumps the default model to
Qwen2.5-0.5B-Instruct for a cheaper, faster default run.
Runs the TRL GRPO example for real on Hugging Face Jobs (a10g-small, 100
steps, Qwen2.5-0.5B-Instruct) and documents the results honestly: the proxy
reward (aggregate_reward) climbs and plateaus, confirming the example trains
correctly end-to-end on HF infrastructure, but at this much smaller scale
(~800 total rollouts vs. the Prime Intellect run's ~12,800) the policy
collapses to near-empty completions rather than converging on the
claim_count_cliff target -- a different reward-hacking shortcut, not a
replication of the Prime Intellect run's specific curve. correctness_reward
stays noisy/decoupled either way, which is the core finding both runs share.

Also extends the reward_func to log per-step reward components (not just the
scalar reward), since correctness_reward/n_claims live in
observation["components"], which the trainer never needed but the README
table does. Opts into SPRINT_EXPOSE_CORRECTNESS=1 for the locally-run clone
(not the shared Space) since this is exactly the "trusted measurement code"
use case the env's own README carves out -- never fed back into the prompt.

Tuning notes from getting this to actually run without OOM on a10g-small:
- per_device_train_batch_size is the *total* rollout count per step (must be
  divisible by num_generations), not unique-prompts * num_generations.
- bf16 matters more than usual here: entropy/logprob computation materializes
  a [batch, completion_len, vocab_size] logits tensor, and a ~150K-token
  vocab (Qwen2.5) dominates memory at fp32.
- gradient_checkpointing=True had no measurable effect in this setup (same
  OOM numbers with and without); reducing batch size was what actually fixed
  it. Left in since it's harmless, but don't rely on it alone.
acharyaanusha added a commit to acharyaanusha/OpenEnv that referenced this pull request Jun 24, 2026
…loop

EnvClient.connect() had a guard `if self._ws is not None: return self` that
no-ops regardless of which event loop established that websocket. This
breaks the officially documented pattern:

    client = await SomeClient.from_env(...)   # connects inside this loop
    with client.sync() as sync_client:        # drives calls on a NEW,
        sync_client.reset()                   # separate background loop

`from_env()` ends with `await client.connect()`, binding `_ws` to whichever
loop ran that await (e.g. an `asyncio.run()` call, which is closed by the
time `from_env()` returns). `.sync()` then drives every later call through
`SyncEnvClient`'s own dedicated background-thread loop via
`run_coroutine_threadsafe`. `SyncEnvClient.connect()` does call
`self._async.connect()`, but the no-op guard returns immediately since `_ws`
is already set -- so the websocket never gets rebound to the loop that's
actually being used, and every `reset()`/`step()` call schedules work on a
live loop while operating on a connection object tied to a dead one. Found
while building a training example that does exactly this (huggingface#853) -- not
specific to that example; any `from_env()` + `.sync()` caller hits it.

Fix: track which loop created `_ws` (`_ws_loop`). `connect()` only no-ops if
the *current* running loop matches; otherwise it drops the stale reference
(unusable anyway, since its loop is typically already closed) and connects
fresh on the current loop. `disconnect()` clears `_ws_loop` alongside `_ws`.

Added `TestForeignLoopReconnect` (3 cases): same-loop reconnect stays a
no-op, a foreign-loop reconnect actually re-establishes the connection, and
disconnect() clears the loop-tracking state.
From a self-review pass before requesting maintainer review on huggingface#853:

- Validate --per-device-batch-size % --num-generations == 0 up front, before
  the ~180s env clone/start and dataset build -- previously this only
  surfaced as an opaque ValueError deep inside GRPOTrainer construction.
- Extract completion-text parsing into _completion_text(), which now raises
  a clear error on an empty/malformed completion list instead of a bare
  IndexError/TypeError.
- Assert completions and seed are the same length in the reward function,
  instead of letting zip() silently truncate and misalign reward<->task.
- Write the components CSV under output_dir (which save_model() already
  guarantees exists) instead of a sibling path derived from --out's
  basename, which could fail if --out's parent directory doesn't exist.
- Extract the CSV-writing block into write_metrics_csv().

Also tried switching make_sync_client() to the simpler from_env() + .sync()
pattern, now that huggingface#854 fixes the event-loop mismatch that motivated building
it manually in the first place -- and reverted. The fixed connect() does
correctly reconnect on the new loop instead of hanging, but it can't cleanly
close the *old* connection first (its event loop is already gone), so the
old one is simply abandoned. That's harmless for envs that allow concurrent
sessions, but this one doesn't (SUPPORTS_CONCURRENT_SESSIONS = False): the
abandoned connection occupies the only session slot, and the real one fails
with CAPACITY_REACHED. Confirmed by reproducing it locally. make_sync_client()
avoids the problem by never creating that doomed first connection at all.
Updated its docstring to explain both reasons.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant