feat: Update mlflow to work better with env vars, manual run id, fix tests by nathan-az · Pull Request #1874 · NVIDIA-NeMo/RL

nathan-az · 2026-02-04T04:46:35Z

This PR improves the MLFlow integration by:

FIXING metric logging by unnesting the metric dicts (without this MLFlow logging appears to be breaking currently)
supporting env var specifications according to the standard MLFlow environment variables
uses log_metrics for single API call for multiple metrics
corrects tests
mocks the MLFlow logger object so tests can be run locally without creating ./mlruns directory

This is both a bug fix and series of feature improvements. Key benefit to the new structure is hierarchical specification (i.e. run ID -> experiment -> run name), and support for specifying via environment variables.

Summary by CodeRabbit

New Features
- MLflow configuration fields are now more flexible with optional settings for experiment names, run IDs, and tracking URIs
- Improved run management allows reusing or creating runs based on configuration
Bug Fixes
- Enhanced resource cleanup ensures proper logger shutdown and avoids resource leaks
- Added defensive checks to prevent errors in concurrent multi-logger scenarios

coderabbitai · 2026-02-04T04:51:51Z

📝 Walkthrough

Walkthrough

The changes refactor MLflow logging in Nemo RL to support explicit run_id management, improve run lifecycle handling, and modernize logging APIs. MLflowConfig fields become all optional with NotRequired annotations. Logger initialization is reworked to derive tracking_uri from config or environment, manage run state conditionally, and store run_id. Logging methods are updated to use MLflow's batch APIs and pass run_id parameters. Plot logging switches from artifact-based to figure-based approaches.

Changes

Cohort / File(s)	Summary
MLflow Configuration `nemo_rl/utils/logger.py`	MLflowConfig TypedDict refactored: all fields now `NotRequired[str \| None]`, including new `run_id` field. Allows more flexible configuration with explicit None handling.
MLflow Logger Implementation `nemo_rl/utils/logger.py`	`__init__` reworked to derive `tracking_uri` from config or environment variable, obtain active run, and conditionally start or reuse runs based on `run_id` and `experiment_name`. Introduces `self.run_id` storage. `log_metrics` now batches metrics into a dict and calls `mlflow.log_metrics` with `run_id`. `log_params` and `log_plot` updated to accept and use `run_id` parameter. `log_plot` switches from temporary PNG + artifact logging to `mlflow.log_figure`. `__del__` now safely calls `end_run()` with try/except guard. Defensive `hasattr` checks added for GPU monitor cleanup.
MLflow Logger Tests `tests/unit/utils/test_logger.py`	Test signatures updated to reflect new method parameters (`run_id` in `start_run`, `log_params`, `log_metrics`). Tests patched for `active_run()` returning None. Batch metric logging tests replaced per-metric calls with single `log_metrics` call. `test_log_plot` updated to assert `mlflow.log_figure` calls instead of artifact logging with temporary files. Multi-logger integration tests extended to include and verify MLflowLogger initialization alongside other loggers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR contains major changes (API modifications, new parameters, feature support) but lacks documented test execution results.	Add explicit test execution results to PR description and address identified review comments before merging.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main changes: MLflow environment variable handling, manual run ID support, and test fixes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@nemo_rl/utils/logger.py`:
- Line 776: Remove the dead call to mlflow.get_tracking_uri() in
nemo_rl/utils/logger.py — locate the unused standalone expression
mlflow.get_tracking_uri() and delete it (no replacement needed since its return
value is discarded and there are no side-effects relied upon); ensure no other
code expects a side-effect from this call and run tests/lint to confirm.
- Around line 786-803: The current condition "if run is None or run_id" causes
start_run() to be called even when an active run exists; update the logic around
mlflow.active_run(), run_id, mlflow.start_run and mlflow.end_run so you only
start a new run when needed: if no active run (run is None) start_run as before;
if there is an active run and run_id is provided, check if run.info.run_id ==
run_id and reuse the active run, otherwise either call mlflow.end_run() before
mlflow.start_run(run_id=run_id) or call mlflow.start_run(..., nested=True) to
avoid the exception; ensure mlflow.set_experiment(experiment_name) still runs
when creating/setting the experiment and that self.run and self.run_id are
assigned from the resolved run.

nemo_rl/utils/logger.py

Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>

terrykong

thanks @nathan-az for the fixes!

some questions

nemo_rl/utils/logger.py

Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>

nathan-az requested review from a team as code owners February 4, 2026 04:46

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

nemo_rl/utils/logger.py Outdated Show resolved Hide resolved

nemo_rl/utils/logger.py Show resolved Hide resolved

github-actions bot added the community-request label Feb 4, 2026

nathan-az added 3 commits February 4, 2026 13:24

Update mlflow to work better with env vars, manual run id, fix tests

2268d2f

Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>

Add safety to end active run if exists and doesn't match desired run id

60d0139

Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>

Use a clearer hierarchy of run id -> experiment name -> run name

30a40e2

Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>

nathan-az force-pushed the nazrak/mlflow-env branch from aeb1549 to 30a40e2 Compare February 4, 2026 13:24

Flatten metrics before logging in mlflow

26af78c

Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>

terrykong reviewed Feb 4, 2026

View reviewed changes

nemo_rl/utils/logger.py Show resolved Hide resolved

nemo_rl/utils/logger.py Outdated Show resolved Hide resolved

remove redundant check

b1cf929

Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>

terrykong previously approved these changes Feb 4, 2026

View reviewed changes

terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Feb 4, 2026

terrykong enabled auto-merge (squash) February 4, 2026 22:21

Run pre-commit

bb48abe

Signed-off-by: Nathan Azrak <nathan.azrak@gmail.com>

auto-merge was automatically disabled February 4, 2026 22:31
Head branch was pushed to by a user without write access

nathan-az dismissed terrykong’s stale review via bb48abe February 4, 2026 22:31

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 4, 2026

terrykong had a problem deploying to nemo-ci February 4, 2026 22:36 — with GitHub Actions Error

terrykong previously approved these changes Feb 4, 2026

View reviewed changes

nathan-az dismissed terrykong’s stale review via 5ce2492 February 5, 2026 01:17

nathan-az force-pushed the nazrak/mlflow-env branch from 5ce2492 to bb48abe Compare February 5, 2026 01:57

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 5, 2026

terrykong approved these changes Feb 5, 2026

View reviewed changes

terrykong enabled auto-merge (squash) February 5, 2026 01:58

terrykong temporarily deployed to nemo-ci February 5, 2026 01:58 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci February 5, 2026 08:26 — with GitHub Actions Inactive

terrykong had a problem deploying to nemo-ci February 5, 2026 13:47 — with GitHub Actions Failure

chtruong814 added the needs-follow-up Issue needs follow-up label Feb 6, 2026

Merge branch 'main' into nazrak/mlflow-env

06db067

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 8, 2026

terrykong had a problem deploying to nemo-ci February 8, 2026 22:50 — with GitHub Actions Failure

Merge branch 'main' into nazrak/mlflow-env

0fd01c7

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 9, 2026

terrykong had a problem deploying to nemo-ci February 9, 2026 17:13 — with GitHub Actions Error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Update mlflow to work better with env vars, manual run id, fix tests#1874

feat: Update mlflow to work better with env vars, manual run id, fix tests#1874
nathan-az wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
nathan-az:nazrak/mlflow-env

nathan-az commented Feb 4, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 4, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

terrykong left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nathan-az commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nathan-az commented Feb 4, 2026 •

edited

Loading

coderabbitai bot commented Feb 4, 2026 •

edited

Loading