Skip to content

feat: Update mlflow to work better with env vars, manual run id, fix tests#1874

Open
nathan-az wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
nathan-az:nazrak/mlflow-env
Open

feat: Update mlflow to work better with env vars, manual run id, fix tests#1874
nathan-az wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
nathan-az:nazrak/mlflow-env

Conversation

@nathan-az
Copy link
Contributor

@nathan-az nathan-az commented Feb 4, 2026

This PR improves the MLFlow integration by:

  • FIXING metric logging by unnesting the metric dicts (without this MLFlow logging appears to be breaking currently)
  • supporting env var specifications according to the standard MLFlow environment variables
  • uses log_metrics for single API call for multiple metrics
  • corrects tests
  • mocks the MLFlow logger object so tests can be run locally without creating ./mlruns directory

This is both a bug fix and series of feature improvements. Key benefit to the new structure is hierarchical specification (i.e. run ID -> experiment -> run name), and support for specifying via environment variables.

Summary by CodeRabbit

  • New Features

    • MLflow configuration fields are now more flexible with optional settings for experiment names, run IDs, and tracking URIs
    • Improved run management allows reusing or creating runs based on configuration
  • Bug Fixes

    • Enhanced resource cleanup ensures proper logger shutdown and avoids resource leaks
    • Added defensive checks to prevent errors in concurrent multi-logger scenarios

@nathan-az nathan-az requested review from a team as code owners February 4, 2026 04:46
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

The changes refactor MLflow logging in Nemo RL to support explicit run_id management, improve run lifecycle handling, and modernize logging APIs. MLflowConfig fields become all optional with NotRequired annotations. Logger initialization is reworked to derive tracking_uri from config or environment, manage run state conditionally, and store run_id. Logging methods are updated to use MLflow's batch APIs and pass run_id parameters. Plot logging switches from artifact-based to figure-based approaches.

Changes

Cohort / File(s) Summary
MLflow Configuration
nemo_rl/utils/logger.py
MLflowConfig TypedDict refactored: all fields now NotRequired[str | None], including new run_id field. Allows more flexible configuration with explicit None handling.
MLflow Logger Implementation
nemo_rl/utils/logger.py
__init__ reworked to derive tracking_uri from config or environment variable, obtain active run, and conditionally start or reuse runs based on run_id and experiment_name. Introduces self.run_id storage. log_metrics now batches metrics into a dict and calls mlflow.log_metrics with run_id. log_params and log_plot updated to accept and use run_id parameter. log_plot switches from temporary PNG + artifact logging to mlflow.log_figure. __del__ now safely calls end_run() with try/except guard. Defensive hasattr checks added for GPU monitor cleanup.
MLflow Logger Tests
tests/unit/utils/test_logger.py
Test signatures updated to reflect new method parameters (run_id in start_run, log_params, log_metrics). Tests patched for active_run() returning None. Batch metric logging tests replaced per-metric calls with single log_metrics call. test_log_plot updated to assert mlflow.log_figure calls instead of artifact logging with temporary files. Multi-logger integration tests extended to include and verify MLflowLogger initialization alongside other loggers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR contains major changes (API modifications, new parameters, feature support) but lacks documented test execution results. Add explicit test execution results to PR description and address identified review comments before merging.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main changes: MLflow environment variable handling, manual run ID support, and test fixes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@nemo_rl/utils/logger.py`:
- Line 776: Remove the dead call to mlflow.get_tracking_uri() in
nemo_rl/utils/logger.py — locate the unused standalone expression
mlflow.get_tracking_uri() and delete it (no replacement needed since its return
value is discarded and there are no side-effects relied upon); ensure no other
code expects a side-effect from this call and run tests/lint to confirm.
- Around line 786-803: The current condition "if run is None or run_id" causes
start_run() to be called even when an active run exists; update the logic around
mlflow.active_run(), run_id, mlflow.start_run and mlflow.end_run so you only
start a new run when needed: if no active run (run is None) start_run as before;
if there is an active run and run_id is provided, check if run.info.run_id ==
run_id and reuse the active run, otherwise either call mlflow.end_run() before
mlflow.start_run(run_id=run_id) or call mlflow.start_run(..., nested=True) to
avoid the exception; ensure mlflow.set_experiment(experiment_name) still runs
when creating/setting the experiment and that self.run and self.run_id are
assigned from the resolved run.

Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>
Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>
Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>
Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @nathan-az for the fixes!

some questions

Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>
terrykong
terrykong previously approved these changes Feb 4, 2026
@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Feb 4, 2026
@terrykong terrykong enabled auto-merge (squash) February 4, 2026 22:21
Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>
auto-merge was automatically disabled February 4, 2026 22:31

Head branch was pushed to by a user without write access

@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 4, 2026
terrykong
terrykong previously approved these changes Feb 4, 2026
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 5, 2026
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Feb 6, 2026
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 8, 2026
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests community-request needs-follow-up Issue needs follow-up

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants