Skip to content

Improve sandbox readiness for real dataset experiments#284

Open
DermotOBrien-EC wants to merge 2 commits into
aiming-lab:mainfrom
DermotOBrien-EC:sandbox-readiness-metrics-datasets
Open

Improve sandbox readiness for real dataset experiments#284
DermotOBrien-EC wants to merge 2 commits into
aiming-lab:mainfrom
DermotOBrien-EC:sandbox-readiness-metrics-datasets

Conversation

@DermotOBrien-EC

Copy link
Copy Markdown
Contributor

Problem

Sandbox-mode runs had two distinct silent failure paths that combined to produce misleading downstream papers:

  1. Metric capture: when Stage 12 ran in sandbox mode, the executor wrote results.json into a path the discovery logic did not glob, fell back to stdout parsing, and dropped the structured per-condition / per-seed metrics on the floor — so run-1.json:metrics came out empty or shape-degraded even when the experiment had succeeded.
  2. Dataset access: the network_disabled_guidance prompt block hardcoded /opt/datasets as the only legal dataset path and did not require any provenance signal. When a deployment's pre-cached data lived elsewhere (or partially), generated code silently substituted synthetic / random-projection tensors and produced plausible-looking numbers that were not the dataset the paper claimed.

Fix

Two commits, each independently useful, bundled because they close the same end-to-end contract a focused replay was designed to validate.

457dfd3 — Capture structured sandbox experiment metrics

  • _execution.py: clear the sandbox project dir before each run, anchor result-file discovery to a mtime taken just before launch, and glob both the project root and a nested results/ subdir so the auto-suffixed _project_N/ (or _project_N/results/) path the harness picks is always found.
  • _helpers.py: parse the structured stdout convention (PER_SEED:, CONDITION_SUMMARY:, GAP_TO_BN: lines) and add a _flatten_structured_metrics helper that promotes a per-condition / per-seed dict into the namespaced <condition>/<seed>/<metric> keys the rest of the pipeline expects.

ded68e8 — Parameterise sandbox dataset cache root

  • config.py: add experiment.sandbox.dataset_cache_root (default /opt/datasets, so existing configs are byte-equivalent).
  • prompts/shared.py: render the network_disabled_guidance block with the configured cache root in all six pre-cached dataset examples, forbid silent synthetic-data fallback (raise FileNotFoundError, exit non-zero), and require a single-line DATASET_USED: <name> stdout stamp emitted exactly once after a successful dataset load.
  • _code_generation.py: pass dataset_cache_root into the block call.
  • config.researchclaw.example.yaml: document the new field with a comment that names the behavior tightening.

Validation

Tests

  • 2830 passed, 56 skipped (full suite excluding live-LLM and Docker e2e).
  • 5 new focused tests cover: SandboxConfig.dataset_cache_root default, custom override, fallback when the YAML omits the field, prompt rendering with default + custom roots (six dataset examples each), fail-loud-on-missing-data instruction present, and DATASET_USED: stamp instruction present.

Focused replay

CODE_GENERATIONEXPERIMENT_RUN against MNIST raw IDX files pre-staged at a non-default path (/tmp/arc_sandbox_trial/datasets). Three hard criteria:

  1. stdout contains DATASET_USED: MNIST — ✅ exact single line emitted by generated code.
  2. run-1.json:metrics has per-condition structured keys — ✅ 105 namespaced keys (e.g. baseline_batchnorm_mlp/0/accuracy, baseline_rmsnorm_mlp/test_accuracy_mean, proposed_curriculum_batchnorm_mlp/2/test_accuracy).
  3. runs/results.json is the structured harness output, not a stdout_parsed fallback — ✅ canonical file carries the full harness_metrics + conditions structure; run-1.json:structured_results is populated.

Stage 12 ran 117.7s of real MNIST training on MPS (3 conditions × 3 seeds, test accuracy 97–98%). Pre-patch the same stage degraded to 1.30s of synthetic substitution.

Behavior change (intentional)

Sandbox network_policy="none" continues to forbid network access and download=True — that is not relaxed. The change is in missing-data semantics:

  • Before: if a pre-cached dataset file was missing, generated code could silently substitute synthetic tensors and report metrics.
  • After: the prompt explicitly requires raise FileNotFoundError and a non-zero exit, plus a DATASET_USED: <name> provenance stamp on success.

Configs that omit dataset_cache_root still get /opt/datasets, so the rendered prompt path is byte-identical to before for existing deployments — only the missing-data behavior is tightened.

Out of scope

  • Stage 22 figure rendering / BeastMode wiring.
  • Broader network_policy redesign (e.g., a per-stage allowlist).
  • Promoting dataset_cache_root to other execution backends (Docker / SSH-remote) — only sandbox mode uses it today.

Add experiment.sandbox.dataset_cache_root (default /opt/datasets) and
thread it through the network_disabled_guidance prompt block so
generated experiment code is instructed to load torchvision datasets
from the configured path with download=False. The default value
matches the prior hardcoded constant, so existing configs that omit
the field render identical prompts.

Tighten missing-data semantics: the prompt now forbids silent
synthetic-data fallback and requires FileNotFoundError + non-zero
exit if a pre-cached dataset file is missing. This is an
intentional behaviour change for every sandbox network_policy="none"
codegen call, motivated by a focused-replay defect where missing
MNIST raw files were papered over with synthetic tensors.

Add an explicit DATASET_USED: <name> stdout-stamp requirement so
downstream metric capture has a dataset-provenance signal
independent of whatever JSON result schema CodeAgent invents.

Focused replay (CODE_GENERATION..EXPERIMENT_RUN against MNIST raw
files pre-staged at /tmp/arc_sandbox_trial/datasets) confirms all
three checks: the stamp appears in stdout, run-1.json:metrics
carries 105 per-condition namespaced keys, and the canonical
runs/results.json contains the structured harness output rather
than a stdout-parsed fallback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant