-
Notifications
You must be signed in to change notification settings - Fork 247
feat: Added save_optimizer flag to control saving optimizer or not in checkpointing #1843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughIntroduces a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/utils/checkpoint.py (1)
52-63: Addsave_optimizerto exemplar YAML files and move default from code to YAML.The
save_optimizerkey is not reflected in any exemplar YAML files underexamples/configs/, violating the guideline that config defaults live in YAML. Remove the inline default comment# Default: Truefrom line 62 and addsave_optimizer: trueto the checkpointing sections in all exemplar YAMLs (e.g.,examples/configs/dpo.yaml,examples/configs/grpo_math_1B.yaml). Additionally, update the docstring to explicitly document the recommended default value and that the field accepts boolean values.
🤖 Fix all issues with AI agents
In `@nemo_rl/utils/checkpoint.py`:
- Line 110: Replace the in-code default for save_optimizer with a strict read
from the config so the YAML remains the source of truth: change the assignment
to read the key without a non-None default (e.g., use config["save_optimizer"]
or config.get("save_optimizer") and validate presence) in the
constructor/initializer where self.save_optimizer is set (the assignment using
self.save_optimizer = config.get("save_optimizer", True)); add a clear error or
validation message if the key is missing so callers know to set it in YAML.
🧹 Nitpick comments (1)
nemo_rl/utils/checkpoint.py (1)
125-132: Addstacklevelto the warning for better caller attribution.This warning points at the helper instead of the caller.
🔧 Suggested tweak
- warnings.warn( + warnings.warn( f"Optimizer state not found at {optimizer_path}. " "Optimizer will be freshly initialized." - ) + , stacklevel=2)
Signed-off-by: Oded Ovadia <odedov@dreamgroup.com>
b44f4c8 to
140822d
Compare
Signed-off-by: Oded Ovadia <odedov@dreamgroup.com>
|
@odedov-dream is there any test in unit or functional test that we can add to guard the functionality of the |
Signed-off-by: Oded Ovadia <odedov@dreamgroup.com>
Sure, added 2 unit tests. |
What does this PR do ?
Add
save_optimizerconfiguration flag to control whether optimizer state is saved with checkpoints, greatly reducing checkpoint size when optimizer state is not needed for resumption.save_optimizerfield toCheckpointingConfig(default:Truefor backward compatibility).CheckpointManager.get_resume_paths()helper to centralize checkpoint path resolution logic.Issues
Closes #1828
Usage
Before your PR is "Ready for review"
Pre checks:
Summary by CodeRabbit
Release Notes
New Features
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.