RFC: 008 Environment auto validation

# RFC008 Environment auto validation

Contributors: @zkwentz, @jspisak, @Darktex 

## Context & Framing

OpenEnv [launched](https://huggingface.co/blog/openenv) in October 2025 as a joint project between Hugging Face and Meta. It has since grown to dozens of [supporters](https://github.com/meta-pytorch/OpenEnv?tab=readme-ov-file#community-support--acknowledgments) and over 4k environments on the [hub](https://huggingface.co/openenv/spaces).

This unprecedented growth makes it increasingly hard to discern which environments are high quality and worth adding to a model training collection. Now is the time to implement auto-validation for environments submitted to the OpenEnv hub, to:

1. Define a clear bar for what a high-quality environment is  
2. Drive up quality broadly  
3. Give the community a scalable way to evaluate their environments — think hackathons\!

## Summary

Taking a frontier lab view, an environment is only useful for improving models and generalizing capabilities if the following are true:

1. The environment scales on infra and how well.
2. The environment is learnable e.g. can a model hill-climb on it after O(100s) steps?
3. The environment is secure.
4. The environment isn't prone to reward hacking.
5. The environment is observable and self-describing — its observations, state, tools, and tasks can be inspected and match what it declares.
6. The environment is reproducible — the same inputs yield the same outcomes across runs, hosts, and time.

## Non-goals

- Auto-validation does not measure an environment's research value or whether training on it transfers to general capability. It certifies that an environment is well-formed and runnable.
- It assesses the environment, not the policy that happens to run in it.
- It does not replace human review for high-stakes inclusion or content decisions.

## Criteria

The acceptance tests below are grouped by the properties above. Each test pairs a concrete pass condition with the requirement it traces back to. A run is conformant when every test meets its pass condition.

### Scales on infra — build & delivery

- Reproducible build — Two builds from the same source produce an identical digest. *(Build determinism)*
- Layer-change isolation — An app-file change moves only the top 1–2 layers. *(Layer ordering)*
- Multi-stage hygiene — No build tools, .git, or fixtures appear in the final image. *(What goes in / stays out)*
- Archive-free layout — No archive blobs above the threshold, and no single oversize binary. *(What goes in / stays out + Hot-path locality)*
- Conversion clean — Both nydusify and ctr-remote estargz conversions succeed. *(Format and compatibility)*
- Time-to-first-useful-work — The container starts before the full image pull completes. *(Hot-path locality)*
- Composition inspection — dive / nydus-image inspect shows no duplicated large files and reasonable chunk boundaries. *(Layer structure)*
- Signature + SBOM — cosign verification passes; an SBOM is attached and parseable. *(Delivery)*
- OCI labels — All required labels are populated. *(Delivery)*

### Resources

Declared budgets are verified at ingest by measuring actual usage. Where the environment provides a reference (golden) solution, the validator runs it over-provisioned and records peak usage; environments without one supply a representative workload through a built-in measurement hook. The same reference-solution run feeds the verifier-sanity check under reward integrity.

- Resource declaration — The manifest declares a budget for memory, CPU, per-episode timeout, and scratch-disk space. *(Resource declaration)*
- Measured envelope — Under an over-provisioned run of the reference solution (or a declared representative workload), peak memory, CPU, and scratch-disk usage fall within the declared budget, and wall-clock completes within the declared timeout, each with margin. Under-declaration fails. *(Resource accuracy)*
- Timeout ceiling — The declared per-episode timeout is bounded by a hard maximum so a hung episode terminates. *(Liveness)*

### Learnable

- Reward reachability — On a sample of tasks, a reference model attains nonzero reward within a bounded number of attempts (pass@k > 0) without saturating (pass@1 < 1). *(Difficulty band)*
- Difficulty separation — Reference models of differing capability produce measurably different scores. *(Curriculum fit)*
- Headroom — The best reference score leaves measurable distance to the maximum achievable reward. *(Saturation)*
- Reward signal-to-noise — Repeated rollouts of a fixed policy produce reward variance within a declared bound. *(Signal quality)*
- Improvement signal *(advisory)* — Under a short training run with a budget normalized to episode horizon, mean reward increases. *(Trainability)*

### Secure

Supply-chain integrity is covered by the Signature + SBOM check under Build & delivery. The checks below cover the runtime case, where the environment executes model-generated actions inside the container.

- Network egress — With outbound network denied by default, an episode completes, or the environment declares the endpoints it requires. *(Runtime isolation)*
- Filesystem / host containment — Adversarial actions cannot read or write outside the container's allotted paths or reach the host. *(Runtime isolation)*
- Resource bounds — A single episode cannot exceed its declared CPU / memory / storage / process limits. *(Runtime isolation)*
- Cross-episode isolation — Sequential episodes in a reused container share no leaked filesystem, process, or cache state. *(Runtime isolation)*
- Reward / ground-truth containment — In environments that compute reward in-container, policy actions cannot read ground-truth artifacts or write to the reward computation. *(Runtime isolation + reward integrity)*

### Not prone to reward hacking

Reward is produced inside the environment; the checks below validate its integrity. Detailed procedures for the adversarial panel and its parameters are specified in the reward-integrity companion spec.

- Well-formed reward — Every step and episode returns a finite reward within a declared range. *(Reward integrity)*
- Rubric introspectability — Reward is produced by an inspectable rubric whose components, thresholds, and weights can be enumerated. *(Reward integrity)*
- Verifier sanity — Applying the reference (golden) solution yields the maximum reward; an empty or degenerate solution yields no more than the floor. *(Reward integrity)*
- Adversarial floor — A panel of non-solving policies (null, degenerate, grader-injection, reward-search) scores at or below the empty-trajectory floor. *(Reward hacking)*
- Gameability gap — Where ground truth is available, reward correlates with true task success above a declared bound. *(Reward hacking)*
- Canary suite — Seeded known reward-hacking trajectories receive no more than the floor reward. *(Reward hacking)*

### Observable & self-describing

Training and evaluation require visibility into what the environment presents and does. These checks validate that observations, state, declared tools, and declared tasks are inspectable and accurate.

- Observation conformance — Observations conform to a declared, serializable schema and are non-empty. *(Observation schema)*
- No solution leakage — Observations and state do not contain the reward value or the ground-truth solution. *(Observation integrity + reward hacking)*
- State endpoint — `state()` returns a parseable state including episode_id and step_count. *(State observability)*
- Trajectory record — A run emits a structured trajectory (per-step observation, action, reward, terminal) to outputs/. *(Instrumentation)*
- Reward attribution — For rubric-based reward, per-component contributions are exposed alongside the scalar. *(Reward observability)*
- Tool declaration accuracy — Declared tools match those discoverable and callable at runtime; each declares a typed input schema; no undeclared or missing tools. *(Tool declaration)*
- Task declaration accuracy — The declared task set is enumerable and addressable (selectable by id or seed), the declared count matches, and each task resets and runs. *(Task declaration)*

### Reproducible

Build reproducibility is covered by the Reproducible build check under Build & delivery. The checks below extend reproducibility to runtime behavior, reward, and task content, so that a result obtained once can be obtained again across runs, hosts, and time.

- Seed control — A given seed yields the same task instance and initial state. *(Seeding)*
- Episode determinism — For a fixed seed and action sequence, repeated episodes produce identical observations and rewards. *(Determinism)*
- Cross-host reproducibility — The same seed and action sequence produce identical observations and rewards on a different host, within a declared tolerance for any stochastic component. *(Environment portability)*
- Verifier determinism — The reward computation returns the same score for the same trajectory across runs and hosts; LLM-graded rubrics declare a tolerance and a pinned judge configuration. *(Reward reproducibility)*
- Verifier portability — The reward computation produces a valid score on a clean host using only the trajectory and the environment's pinned contents, with no dependence on host state, wall-clock, or network access. *(Verifier hermeticity)*
- Dependency pinning — All software dependencies resolve to exact, pinned versions. *(Pinned dependencies)*
- Task-distribution pinning — Generated or externally sourced tasks pin their generator seed, model version, or dataset version, so the same task distribution is reconstructable. *(Pinned task distribution)*
- Immutable versioning — A published environment version resolves to the same image, task distribution, and reward logic. *(Versioned identity)*
- Replayability — A recorded trajectory replays to the same observations, rewards, and terminal state. *(Replay)*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: 008 Environment auto validation #778

RFC008 Environment auto validation

Context & Framing

Summary

Non-goals

Criteria

Scales on infra — build & delivery

Resources

Learnable

Secure

Not prone to reward hacking

Observable & self-describing

Reproducible

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

RFC: 008 Environment auto validation #778

Description

RFC008 Environment auto validation

Context & Framing

Summary

Non-goals

Criteria

Scales on infra — build & delivery

Resources

Learnable

Secure

Not prone to reward hacking

Observable & self-describing

Reproducible

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions