Improve sandbox readiness for real dataset experiments by DermotOBrien-EC · Pull Request #284 · aiming-lab/AutoResearchClaw

DermotOBrien-EC · 2026-05-29T13:20:26Z

Problem

Sandbox-mode runs had two distinct silent failure paths that combined to produce misleading downstream papers:

Metric capture: when Stage 12 ran in sandbox mode, the executor wrote results.json into a path the discovery logic did not glob, fell back to stdout parsing, and dropped the structured per-condition / per-seed metrics on the floor — so run-1.json:metrics came out empty or shape-degraded even when the experiment had succeeded.
Dataset access: the network_disabled_guidance prompt block hardcoded /opt/datasets as the only legal dataset path and did not require any provenance signal. When a deployment's pre-cached data lived elsewhere (or partially), generated code silently substituted synthetic / random-projection tensors and produced plausible-looking numbers that were not the dataset the paper claimed.

Fix

Two commits, each independently useful, bundled because they close the same end-to-end contract a focused replay was designed to validate.

`457dfd3` — Capture structured sandbox experiment metrics

_execution.py: clear the sandbox project dir before each run, anchor result-file discovery to a mtime taken just before launch, and glob both the project root and a nested results/ subdir so the auto-suffixed _project_N/ (or _project_N/results/) path the harness picks is always found.
_helpers.py: parse the structured stdout convention (PER_SEED:, CONDITION_SUMMARY:, GAP_TO_BN: lines) and add a _flatten_structured_metrics helper that promotes a per-condition / per-seed dict into the namespaced <condition>/<seed>/<metric> keys the rest of the pipeline expects.

`ded68e8` — Parameterise sandbox dataset cache root

config.py: add experiment.sandbox.dataset_cache_root (default /opt/datasets, so existing configs are byte-equivalent).
prompts/shared.py: render the network_disabled_guidance block with the configured cache root in all six pre-cached dataset examples, forbid silent synthetic-data fallback (raise FileNotFoundError, exit non-zero), and require a single-line DATASET_USED: <name> stdout stamp emitted exactly once after a successful dataset load.
_code_generation.py: pass dataset_cache_root into the block call.
config.researchclaw.example.yaml: document the new field with a comment that names the behavior tightening.

Validation

Tests

2830 passed, 56 skipped (full suite excluding live-LLM and Docker e2e).
5 new focused tests cover: SandboxConfig.dataset_cache_root default, custom override, fallback when the YAML omits the field, prompt rendering with default + custom roots (six dataset examples each), fail-loud-on-missing-data instruction present, and DATASET_USED: stamp instruction present.

Focused replay

CODE_GENERATION → EXPERIMENT_RUN against MNIST raw IDX files pre-staged at a non-default path (/tmp/arc_sandbox_trial/datasets). Three hard criteria:

stdout contains DATASET_USED: MNIST — ✅ exact single line emitted by generated code.
run-1.json:metrics has per-condition structured keys — ✅ 105 namespaced keys (e.g. baseline_batchnorm_mlp/0/accuracy, baseline_rmsnorm_mlp/test_accuracy_mean, proposed_curriculum_batchnorm_mlp/2/test_accuracy).
runs/results.json is the structured harness output, not a stdout_parsed fallback — ✅ canonical file carries the full harness_metrics + conditions structure; run-1.json:structured_results is populated.

Stage 12 ran 117.7s of real MNIST training on MPS (3 conditions × 3 seeds, test accuracy 97–98%). Pre-patch the same stage degraded to 1.30s of synthetic substitution.

Behavior change (intentional)

Sandbox network_policy="none" continues to forbid network access and download=True — that is not relaxed. The change is in missing-data semantics:

Before: if a pre-cached dataset file was missing, generated code could silently substitute synthetic tensors and report metrics.
After: the prompt explicitly requires raise FileNotFoundError and a non-zero exit, plus a DATASET_USED: <name> provenance stamp on success.

Configs that omit dataset_cache_root still get /opt/datasets, so the rendered prompt path is byte-identical to before for existing deployments — only the missing-data behavior is tightened.

Out of scope

Stage 22 figure rendering / BeastMode wiring.
Broader network_policy redesign (e.g., a per-stage allowlist).
Promoting dataset_cache_root to other execution backends (Docker / SSH-remote) — only sandbox mode uses it today.

Add experiment.sandbox.dataset_cache_root (default /opt/datasets) and thread it through the network_disabled_guidance prompt block so generated experiment code is instructed to load torchvision datasets from the configured path with download=False. The default value matches the prior hardcoded constant, so existing configs that omit the field render identical prompts. Tighten missing-data semantics: the prompt now forbids silent synthetic-data fallback and requires FileNotFoundError + non-zero exit if a pre-cached dataset file is missing. This is an intentional behaviour change for every sandbox network_policy="none" codegen call, motivated by a focused-replay defect where missing MNIST raw files were papered over with synthetic tensors. Add an explicit DATASET_USED: <name> stdout-stamp requirement so downstream metric capture has a dataset-provenance signal independent of whatever JSON result schema CodeAgent invents. Focused replay (CODE_GENERATION..EXPERIMENT_RUN against MNIST raw files pre-staged at /tmp/arc_sandbox_trial/datasets) confirms all three checks: the stamp appears in stdout, run-1.json:metrics carries 105 per-condition namespaced keys, and the canonical runs/results.json contains the structured harness output rather than a stdout-parsed fallback.

DermotOBrien-EC added 2 commits May 29, 2026 01:45

Capture structured sandbox experiment metrics

457dfd3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve sandbox readiness for real dataset experiments#284

Improve sandbox readiness for real dataset experiments#284
DermotOBrien-EC wants to merge 2 commits into
aiming-lab:mainfrom
DermotOBrien-EC:sandbox-readiness-metrics-datasets

DermotOBrien-EC commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DermotOBrien-EC commented May 29, 2026

Problem

Fix

457dfd3 — Capture structured sandbox experiment metrics

ded68e8 — Parameterise sandbox dataset cache root

Validation

Tests

Focused replay

Behavior change (intentional)

Out of scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`457dfd3` — Capture structured sandbox experiment metrics

`ded68e8` — Parameterise sandbox dataset cache root