feat: Add native Comet ML experiment tracking by LoganVegnaSHOP · Pull Request #1411 · NVIDIA-NeMo/Automodel

LoganVegnaSHOP · 2026-02-27T22:01:12Z

Summary

Add CometLogger class in nemo_automodel/components/loggers/comet_utils.py — mirrors the existing MLflowLogger pattern
Wire Comet logging into TrainFinetuneRecipeForNextTokenPrediction:
- Training metrics (loss, TPS, grad_norm, lr, mem) via log_train_metrics()
- Validation metrics (val_loss) via log_val_metrics()
- MoE load balance metrics via _log_moe_metrics()
Enable by adding a comet: block to the training YAML config:

comet:
  project_name: "my-project"
  experiment_name: "my-run"      # optional, auto-generated from model name
  workspace: "my-workspace"      # optional, uses COMET_WORKSPACE env var
  api_key: null                  # optional, uses COMET_API_KEY env var
  tags: ["finetune", "llama"]    # optional

Motivation

Currently, Comet ML users must rely on Comet's wandb auto-patcher, which does not
reliably intercept wandb.log() calls because NeMo gates all wandb logging behind
if wandb.run is not None — requiring a valid wandb API key and active run even when
wandb is only used as a bridge to Comet. Native Comet support (matching the existing
MLflow integration pattern) eliminates this fragile dependency.

Test plan

Run a fine-tuning job with comet: config and verify metrics appear in Comet dashboard
Verify training metrics (loss, TPS, grad_norm, lr, mem) are logged at log_remote_every_steps frequency
Verify validation metrics (val_loss) are logged on each validation step
Verify MoE load balance metrics are logged (on MoE models)
Verify that omitting comet: config has no effect (backward compatible)
Verify comet_ml not installed raises clear ImportError

copy-pr-bot · 2026-02-27T22:01:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

nemo_automodel/components/loggers/comet_utils.py

hemildesai · 2026-02-28T00:57:26Z

/ok to test b24a082

akoumpa · 2026-03-03T09:50:33Z

/ok to test 1f47ddc

With seq_length=4 and truncation=True, the sequence is truncated to just the start of the system message — no assistant tokens survive. Skip the "must have supervised tokens" assertion when truncation is enabled since it may legitimately remove all assistant content. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com>

When a chat template is used, the template's own turn-ending token (e.g. <|im_end|>) terminates sequences — not eos_token_id (e.g. <|endoftext|>). These are different tokens for Qwen2.5/3. Remove assertions that eos_token_id must appear in supervised labels since it was only true when EOS was manually appended. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com>

…met-logger

Signed-off-by: root <root@pool0-00155.cm.cluster>

…to feat/comet-logger

…elism Strip the 4D attention_mask from the batch and register forward pre-hooks on self_attn modules to set is_causal=True, so that SDPA handles causal masking internally when using dense context parallelism without TE. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

Replace functools.partial(F.scaled_dot_product_attention, ...) with a closure that resolves F.scaled_dot_product_attention at call time. This ensures CP's runtime monkey-patch of the function is picked up by all custom models instead of being bypassed by the early-bound reference. Also make _attach_context_parallel_hooks public (renamed to attach_context_parallel_hooks). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

…ends Extract SDPA backend selection into a resolve_sdpa_method() helper that accepts string names from YAML config (e.g. ["flash_attention", "efficient_attention"]) and converts them to SDPBackend enum members. When no explicit config is provided, auto-selects based on CP and activation checkpointing constraints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

Replace the assert that required all attention modules to be TE DotProductAttention with a continue, so dense (SDPA) attention modules are gracefully skipped. This allows MoE models to use context parallelism with non-TE attention backends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

…t/comet-logger Made-with: Cursor # Conflicts: # tests/unit_tests/distributed/test_cp_utils.py

akoumpa · 2026-03-06T23:10:51Z

/ok to test 88f0443

LoganVegnaSHOP requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa and hemildesai as code owners February 27, 2026 22:01

github-actions bot added the community-request label Feb 27, 2026

LoganVegnaSHOP marked this pull request as draft February 27, 2026 23:05

LoganVegnaSHOP marked this pull request as ready for review February 27, 2026 23:38

hemildesai reviewed Feb 27, 2026

View reviewed changes

nemo_automodel/components/loggers/comet_utils.py Outdated Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to test February 28, 2026 00:57 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 28, 2026 00:57 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 28, 2026 01:16 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 28, 2026 02:59 Failure

LoganVegnaSHOP force-pushed the feat/comet-logger branch from b24a082 to b69710c Compare March 3, 2026 00:11

copy-pr-bot bot temporarily deployed to nemo-ci March 3, 2026 09:50 Inactive

copy-pr-bot bot temporarily deployed to test March 3, 2026 09:50 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 4, 2026 19:43 Inactive

copy-pr-bot bot temporarily deployed to test March 4, 2026 19:43 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 4, 2026 19:47 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 4, 2026 19:55 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 4, 2026 20:14 Inactive

hemildesai and others added 2 commits March 4, 2026 12:23

This was referenced Mar 4, 2026

feat: Add first-class Comet ML experiment tracking NVIDIA-NeMo/Megatron-Bridge#2652

Open

feat: Add first-class Comet ML experiment tracking NVIDIA-NeMo/Megatron-Bridge#2653

Open

LoganVegnaSHOP and others added 9 commits March 5, 2026 11:35

Merge remote-tracking branch 'upstream/hemil/data-fixes' into feat/co…

7c217fb

…met-logger

native torch_fp32 norm

015fcde

Signed-off-by: root <root@pool0-00155.cm.cluster>

Merge remote-tracking branch 'upstream/hemil/data-fixes-fp32-norm' in…

4011429

…to feat/comet-logger

style: apply ruff formatting to changed files

d4e3474

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

Merge remote-tracking branch 'upstream/hemil/cp-dense-fixes' into fea…

88f0443

…t/comet-logger Made-with: Cursor # Conflicts: # tests/unit_tests/distributed/test_cp_utils.py

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 23:11 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add native Comet ML experiment tracking#1411

feat: Add native Comet ML experiment tracking#1411
LoganVegnaSHOP wants to merge 38 commits intoNVIDIA-NeMo:mainfrom
LoganVegnaSHOP:feat/comet-logger

LoganVegnaSHOP commented Feb 27, 2026

Uh oh!

copy-pr-bot bot commented Feb 27, 2026

Uh oh!

Uh oh!

hemildesai commented Feb 28, 2026

Uh oh!

akoumpa commented Mar 3, 2026

Uh oh!

akoumpa commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LoganVegnaSHOP commented Feb 27, 2026

Summary

Motivation

Test plan

Uh oh!

copy-pr-bot bot commented Feb 27, 2026

Uh oh!

Uh oh!

hemildesai commented Feb 28, 2026

Uh oh!

akoumpa commented Mar 3, 2026

Uh oh!

akoumpa commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants