Skip to content

Notebook: add checkpoint resume, CPU threading, and finer checkpoints to AdultIncome long-run tracking#115

Open
charlesmartin14 wants to merge 1 commit intomainfrom
codex/find-checkpoint-behavior-in-notebook-m8t5al
Open

Notebook: add checkpoint resume, CPU threading, and finer checkpoints to AdultIncome long-run tracking#115
charlesmartin14 wants to merge 1 commit intomainfrom
codex/find-checkpoint-behavior-in-notebook-m8t5al

Conversation

@charlesmartin14
Copy link
Member

Motivation

  • Improve robustness for long-running training by enabling resume-from-checkpoint behavior so interrupted runs can continue without restarting from zero.
  • Improve CPU utilization and platform awareness by auto-detecting available threads and Apple Silicon to tune XGBoost threading.
  • Increase checkpoint frequency and finer-grained training chunks to reduce potential wasted work between checkpoints.

Description

  • Added import os and import platform and compute CPU_THREADS and IS_APPLE_SILICON, and set nthread in params with explanatory comment for macOS ARM.
  • Reduced CHUNK_SIZE from 50 to 25, changed CHECKPOINT_EVERY_STEPS from 2 to 1, and introduced RESUME_FROM_CHECKPOINT = True.
  • Implemented resume logic that reads results_path_csv and latest_model_path, validates that the last saved round is divisible by CHUNK_SIZE, loads the xgb.Booster to continue training, and sets start_step accordingly.
  • Ensure checkpointing persists metrics to CSV and Feather (with try/except), saves per-round model and latest_model_path, and added diagnostic prints and a final results_df fallback to ensure results are available.

Testing

  • Executed the notebook initialization cells in Colab to verify configuration prints for TOTAL_ROUNDS, CHUNK_SIZE, CHECKPOINT_EVERY_STEPS, RESUME_FROM_CHECKPOINT, and CPU_THREADS, and the prints completed successfully.
  • Ran the training loop cells for several steps to verify checkpoint saving to results_path_csv and model files and to confirm results_df.tail() returns without error.
  • Verified resume path by simulating existing results_path_csv and latest_model_path, and confirmed the code loads the booster and resumes from the expected round without exceptions.
  • All automated notebook cell executions used during testing completed successfully with no uncaught exceptions.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant