feat: Update mlflow to work better with env vars, manual run id, fix tests#1874
feat: Update mlflow to work better with env vars, manual run id, fix tests#1874nathan-az wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
Conversation
📝 WalkthroughWalkthroughThe changes refactor MLflow logging in Nemo RL to support explicit run_id management, improve run lifecycle handling, and modernize logging APIs. MLflowConfig fields become all optional with NotRequired annotations. Logger initialization is reworked to derive tracking_uri from config or environment, manage run state conditionally, and store run_id. Logging methods are updated to use MLflow's batch APIs and pass run_id parameters. Plot logging switches from artifact-based to figure-based approaches. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Important Action Needed: IP Allowlist UpdateIf your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:
Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@nemo_rl/utils/logger.py`:
- Line 776: Remove the dead call to mlflow.get_tracking_uri() in
nemo_rl/utils/logger.py — locate the unused standalone expression
mlflow.get_tracking_uri() and delete it (no replacement needed since its return
value is discarded and there are no side-effects relied upon); ensure no other
code expects a side-effect from this call and run tests/lint to confirm.
- Around line 786-803: The current condition "if run is None or run_id" causes
start_run() to be called even when an active run exists; update the logic around
mlflow.active_run(), run_id, mlflow.start_run and mlflow.end_run so you only
start a new run when needed: if no active run (run is None) start_run as before;
if there is an active run and run_id is provided, check if run.info.run_id ==
run_id and reuse the active run, otherwise either call mlflow.end_run() before
mlflow.start_run(run_id=run_id) or call mlflow.start_run(..., nested=True) to
avoid the exception; ensure mlflow.set_experiment(experiment_name) still runs
when creating/setting the experiment and that self.run and self.run_id are
assigned from the resolved run.
Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>
Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>
Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>
aeb1549 to
30a40e2
Compare
Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>
terrykong
left a comment
There was a problem hiding this comment.
thanks @nathan-az for the fixes!
some questions
Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>
Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>
Head branch was pushed to by a user without write access
5ce2492 to
bb48abe
Compare
This PR improves the MLFlow integration by:
log_metricsfor single API call for multiple metrics./mlrunsdirectoryThis is both a bug fix and series of feature improvements. Key benefit to the new structure is hierarchical specification (i.e. run ID -> experiment -> run name), and support for specifying via environment variables.
Summary by CodeRabbit
New Features
Bug Fixes