Improve sandbox readiness for real dataset experiments#284
Open
DermotOBrien-EC wants to merge 2 commits into
Open
Improve sandbox readiness for real dataset experiments#284DermotOBrien-EC wants to merge 2 commits into
DermotOBrien-EC wants to merge 2 commits into
Conversation
Add experiment.sandbox.dataset_cache_root (default /opt/datasets) and thread it through the network_disabled_guidance prompt block so generated experiment code is instructed to load torchvision datasets from the configured path with download=False. The default value matches the prior hardcoded constant, so existing configs that omit the field render identical prompts. Tighten missing-data semantics: the prompt now forbids silent synthetic-data fallback and requires FileNotFoundError + non-zero exit if a pre-cached dataset file is missing. This is an intentional behaviour change for every sandbox network_policy="none" codegen call, motivated by a focused-replay defect where missing MNIST raw files were papered over with synthetic tensors. Add an explicit DATASET_USED: <name> stdout-stamp requirement so downstream metric capture has a dataset-provenance signal independent of whatever JSON result schema CodeAgent invents. Focused replay (CODE_GENERATION..EXPERIMENT_RUN against MNIST raw files pre-staged at /tmp/arc_sandbox_trial/datasets) confirms all three checks: the stamp appears in stdout, run-1.json:metrics carries 105 per-condition namespaced keys, and the canonical runs/results.json contains the structured harness output rather than a stdout-parsed fallback.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Sandbox-mode runs had two distinct silent failure paths that combined to produce misleading downstream papers:
results.jsoninto a path the discovery logic did not glob, fell back to stdout parsing, and dropped the structured per-condition / per-seed metrics on the floor — sorun-1.json:metricscame out empty or shape-degraded even when the experiment had succeeded.network_disabled_guidanceprompt block hardcoded/opt/datasetsas the only legal dataset path and did not require any provenance signal. When a deployment's pre-cached data lived elsewhere (or partially), generated code silently substituted synthetic / random-projection tensors and produced plausible-looking numbers that were not the dataset the paper claimed.Fix
Two commits, each independently useful, bundled because they close the same end-to-end contract a focused replay was designed to validate.
457dfd3— Capture structured sandbox experiment metrics_execution.py: clear the sandbox project dir before each run, anchor result-file discovery to a mtime taken just before launch, and glob both the project root and a nestedresults/subdir so the auto-suffixed_project_N/(or_project_N/results/) path the harness picks is always found._helpers.py: parse the structured stdout convention (PER_SEED:,CONDITION_SUMMARY:,GAP_TO_BN:lines) and add a_flatten_structured_metricshelper that promotes a per-condition / per-seed dict into the namespaced<condition>/<seed>/<metric>keys the rest of the pipeline expects.ded68e8— Parameterise sandbox dataset cache rootconfig.py: addexperiment.sandbox.dataset_cache_root(default/opt/datasets, so existing configs are byte-equivalent).prompts/shared.py: render thenetwork_disabled_guidanceblock with the configured cache root in all six pre-cached dataset examples, forbid silent synthetic-data fallback (raise FileNotFoundError, exit non-zero), and require a single-lineDATASET_USED: <name>stdout stamp emitted exactly once after a successful dataset load._code_generation.py: passdataset_cache_rootinto the block call.config.researchclaw.example.yaml: document the new field with a comment that names the behavior tightening.Validation
Tests
SandboxConfig.dataset_cache_rootdefault, custom override, fallback when the YAML omits the field, prompt rendering with default + custom roots (six dataset examples each), fail-loud-on-missing-data instruction present, andDATASET_USED:stamp instruction present.Focused replay
CODE_GENERATION→EXPERIMENT_RUNagainst MNIST raw IDX files pre-staged at a non-default path (/tmp/arc_sandbox_trial/datasets). Three hard criteria:DATASET_USED: MNIST— ✅ exact single line emitted by generated code.run-1.json:metricshas per-condition structured keys — ✅ 105 namespaced keys (e.g.baseline_batchnorm_mlp/0/accuracy,baseline_rmsnorm_mlp/test_accuracy_mean,proposed_curriculum_batchnorm_mlp/2/test_accuracy).runs/results.jsonis the structured harness output, not astdout_parsedfallback — ✅ canonical file carries the fullharness_metrics+conditionsstructure;run-1.json:structured_resultsis populated.Stage 12 ran 117.7s of real MNIST training on MPS (3 conditions × 3 seeds, test accuracy 97–98%). Pre-patch the same stage degraded to 1.30s of synthetic substitution.
Behavior change (intentional)
Sandbox
network_policy="none"continues to forbid network access anddownload=True— that is not relaxed. The change is in missing-data semantics:raise FileNotFoundErrorand a non-zero exit, plus aDATASET_USED: <name>provenance stamp on success.Configs that omit
dataset_cache_rootstill get/opt/datasets, so the rendered prompt path is byte-identical to before for existing deployments — only the missing-data behavior is tightened.Out of scope
network_policyredesign (e.g., a per-stage allowlist).dataset_cache_rootto other execution backends (Docker / SSH-remote) — only sandbox mode uses it today.