Skip to content

Fix notebook restart flow: clear stale artifacts and guard against empty/corrupt checkpoint CSVs#134

Open
charlesmartin14 wants to merge 1 commit intomainfrom
codex/fix-notebook-to-delete-old-results
Open

Fix notebook restart flow: clear stale artifacts and guard against empty/corrupt checkpoint CSVs#134
charlesmartin14 wants to merge 1 commit intomainfrom
codex/fix-notebook-to-delete-old-results

Conversation

@charlesmartin14
Copy link
Member

Motivation

  • The notebook could attempt to resume from existing but empty or corrupted checkpoint CSVs, causing pandas.errors.EmptyDataError and aborting runs.
  • When not actually resuming, prior run artifacts should be removed so stale files do not clobber a fresh run.

Description

  • Added a remove_stale_run_artifacts() helper and start_fresh logic that clears registry/, per_dataset/, aggregate/, and logs/ when FORCE_RESTART_ALL is True or when restart/resume is disabled (RESTART=False or RESUME_FROM_CHECKPOINT=False).
  • Hardened registry cache loading for full_registry.csv with try/except catching pd.errors.EmptyDataError and pd.errors.ParserError, unlinking invalid caches and rebuilding via the scan when needed.
  • Hardened sampled-registry reuse for random10_registry.csv with the same guards and fallback resampling when the cached CSV is invalid or empty.
  • Made per-dataset checkpoint loading (load_dataset_checkpoint) robust to empty/corrupt metrics CSVs by catching CSV parse errors, removing bad files, and starting the dataset from round 1.

Testing

  • Parsed the updated notebook JSON with python -c "import json; json.load(open('notebooks/XGBWW_Random10_LongRun_Alpha_Tracking.ipynb')); print('ok')", which succeeded.
  • Verified presence of the new guards/reset messages with grep searches for Starting fresh run, remove_stale_run_artifacts, Invalid cached full registry, Invalid cached sampled registry, and Invalid dataset checkpoint in notebooks/XGBWW_Random10_LongRun_Alpha_Tracking.ipynb, which returned matches.
  • Opened and inspected relevant notebook cells to confirm the remove_stale_run_artifacts, safe CSV reads for full_registry.csv/random10_registry.csv, and updated load_dataset_checkpoint behavior were inserted correctly.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant