Skip to content

RFC: 008 Environment auto validation #778

Description

@burtenshaw

RFC008 Environment auto validation

Contributors: @zkwentz, @jspisak, @Darktex

Context & Framing

OpenEnv launched in October 2025 as a joint project between Hugging Face and Meta. It has since grown to dozens of supporters and over 4k environments on the hub.

This unprecedented growth makes it increasingly hard to discern which environments are high quality and worth adding to a model training collection. Now is the time to implement auto-validation for environments submitted to the OpenEnv hub, to:

  1. Define a clear bar for what a high-quality environment is
  2. Drive up quality broadly
  3. Give the community a scalable way to evaluate their environments — think hackathons!

Summary

Taking a frontier lab view, an environment is only useful for improving models and generalizing capabilities if the following are true:

  1. The environment scales on infra and how well.
  2. The environment is learnable e.g. can a model hill-climb on it after O(100s) steps?
  3. The environment is secure.
  4. The environment isn't prone to reward hacking.
  5. The environment is observable and self-describing — its observations, state, tools, and tasks can be inspected and match what it declares.
  6. The environment is reproducible — the same inputs yield the same outcomes across runs, hosts, and time.

Non-goals

  • Auto-validation does not measure an environment's research value or whether training on it transfers to general capability. It certifies that an environment is well-formed and runnable.
  • It assesses the environment, not the policy that happens to run in it.
  • It does not replace human review for high-stakes inclusion or content decisions.

Criteria

The acceptance tests below are grouped by the properties above. Each test pairs a concrete pass condition with the requirement it traces back to. A run is conformant when every test meets its pass condition.

Scales on infra — build & delivery

  • Reproducible build — Two builds from the same source produce an identical digest. (Build determinism)
  • Layer-change isolation — An app-file change moves only the top 1–2 layers. (Layer ordering)
  • Multi-stage hygiene — No build tools, .git, or fixtures appear in the final image. (What goes in / stays out)
  • Archive-free layout — No archive blobs above the threshold, and no single oversize binary. (What goes in / stays out + Hot-path locality)
  • Conversion clean — Both nydusify and ctr-remote estargz conversions succeed. (Format and compatibility)
  • Time-to-first-useful-work — The container starts before the full image pull completes. (Hot-path locality)
  • Composition inspection — dive / nydus-image inspect shows no duplicated large files and reasonable chunk boundaries. (Layer structure)
  • Signature + SBOM — cosign verification passes; an SBOM is attached and parseable. (Delivery)
  • OCI labels — All required labels are populated. (Delivery)

Resources

Declared budgets are verified at ingest by measuring actual usage. Where the environment provides a reference (golden) solution, the validator runs it over-provisioned and records peak usage; environments without one supply a representative workload through a built-in measurement hook. The same reference-solution run feeds the verifier-sanity check under reward integrity.

  • Resource declaration — The manifest declares a budget for memory, CPU, per-episode timeout, and scratch-disk space. (Resource declaration)
  • Measured envelope — Under an over-provisioned run of the reference solution (or a declared representative workload), peak memory, CPU, and scratch-disk usage fall within the declared budget, and wall-clock completes within the declared timeout, each with margin. Under-declaration fails. (Resource accuracy)
  • Timeout ceiling — The declared per-episode timeout is bounded by a hard maximum so a hung episode terminates. (Liveness)

Learnable

  • Reward reachability — On a sample of tasks, a reference model attains nonzero reward within a bounded number of attempts (pass@k > 0) without saturating (pass@1 < 1). (Difficulty band)
  • Difficulty separation — Reference models of differing capability produce measurably different scores. (Curriculum fit)
  • Headroom — The best reference score leaves measurable distance to the maximum achievable reward. (Saturation)
  • Reward signal-to-noise — Repeated rollouts of a fixed policy produce reward variance within a declared bound. (Signal quality)
  • Improvement signal (advisory) — Under a short training run with a budget normalized to episode horizon, mean reward increases. (Trainability)

Secure

Supply-chain integrity is covered by the Signature + SBOM check under Build & delivery. The checks below cover the runtime case, where the environment executes model-generated actions inside the container.

  • Network egress — With outbound network denied by default, an episode completes, or the environment declares the endpoints it requires. (Runtime isolation)
  • Filesystem / host containment — Adversarial actions cannot read or write outside the container's allotted paths or reach the host. (Runtime isolation)
  • Resource bounds — A single episode cannot exceed its declared CPU / memory / storage / process limits. (Runtime isolation)
  • Cross-episode isolation — Sequential episodes in a reused container share no leaked filesystem, process, or cache state. (Runtime isolation)
  • Reward / ground-truth containment — In environments that compute reward in-container, policy actions cannot read ground-truth artifacts or write to the reward computation. (Runtime isolation + reward integrity)

Not prone to reward hacking

Reward is produced inside the environment; the checks below validate its integrity. Detailed procedures for the adversarial panel and its parameters are specified in the reward-integrity companion spec.

  • Well-formed reward — Every step and episode returns a finite reward within a declared range. (Reward integrity)
  • Rubric introspectability — Reward is produced by an inspectable rubric whose components, thresholds, and weights can be enumerated. (Reward integrity)
  • Verifier sanity — Applying the reference (golden) solution yields the maximum reward; an empty or degenerate solution yields no more than the floor. (Reward integrity)
  • Adversarial floor — A panel of non-solving policies (null, degenerate, grader-injection, reward-search) scores at or below the empty-trajectory floor. (Reward hacking)
  • Gameability gap — Where ground truth is available, reward correlates with true task success above a declared bound. (Reward hacking)
  • Canary suite — Seeded known reward-hacking trajectories receive no more than the floor reward. (Reward hacking)

Observable & self-describing

Training and evaluation require visibility into what the environment presents and does. These checks validate that observations, state, declared tools, and declared tasks are inspectable and accurate.

  • Observation conformance — Observations conform to a declared, serializable schema and are non-empty. (Observation schema)
  • No solution leakage — Observations and state do not contain the reward value or the ground-truth solution. (Observation integrity + reward hacking)
  • State endpoint — state() returns a parseable state including episode_id and step_count. (State observability)
  • Trajectory record — A run emits a structured trajectory (per-step observation, action, reward, terminal) to outputs/. (Instrumentation)
  • Reward attribution — For rubric-based reward, per-component contributions are exposed alongside the scalar. (Reward observability)
  • Tool declaration accuracy — Declared tools match those discoverable and callable at runtime; each declares a typed input schema; no undeclared or missing tools. (Tool declaration)
  • Task declaration accuracy — The declared task set is enumerable and addressable (selectable by id or seed), the declared count matches, and each task resets and runs. (Task declaration)

Reproducible

Build reproducibility is covered by the Reproducible build check under Build & delivery. The checks below extend reproducibility to runtime behavior, reward, and task content, so that a result obtained once can be obtained again across runs, hosts, and time.

  • Seed control — A given seed yields the same task instance and initial state. (Seeding)
  • Episode determinism — For a fixed seed and action sequence, repeated episodes produce identical observations and rewards. (Determinism)
  • Cross-host reproducibility — The same seed and action sequence produce identical observations and rewards on a different host, within a declared tolerance for any stochastic component. (Environment portability)
  • Verifier determinism — The reward computation returns the same score for the same trajectory across runs and hosts; LLM-graded rubrics declare a tolerance and a pinned judge configuration. (Reward reproducibility)
  • Verifier portability — The reward computation produces a valid score on a clean host using only the trajectory and the environment's pinned contents, with no dependence on host state, wall-clock, or network access. (Verifier hermeticity)
  • Dependency pinning — All software dependencies resolve to exact, pinned versions. (Pinned dependencies)
  • Task-distribution pinning — Generated or externally sourced tasks pin their generator seed, model version, or dataset version, so the same task distribution is reconstructable. (Pinned task distribution)
  • Immutable versioning — A published environment version resolves to the same image, task distribution, and reward logic. (Versioned identity)
  • Replayability — A recorded trajectory replays to the same observations, rewards, and terminal state. (Replay)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions