Skip to content

feat: Add native Comet ML experiment tracking#1411

Open
LoganVegnaSHOP wants to merge 38 commits intoNVIDIA-NeMo:mainfrom
LoganVegnaSHOP:feat/comet-logger
Open

feat: Add native Comet ML experiment tracking#1411
LoganVegnaSHOP wants to merge 38 commits intoNVIDIA-NeMo:mainfrom
LoganVegnaSHOP:feat/comet-logger

Conversation

@LoganVegnaSHOP
Copy link

Summary

  • Add CometLogger class in nemo_automodel/components/loggers/comet_utils.py — mirrors the existing MLflowLogger pattern
  • Wire Comet logging into TrainFinetuneRecipeForNextTokenPrediction:
    • Training metrics (loss, TPS, grad_norm, lr, mem) via log_train_metrics()
    • Validation metrics (val_loss) via log_val_metrics()
    • MoE load balance metrics via _log_moe_metrics()
  • Enable by adding a comet: block to the training YAML config:
comet:
  project_name: "my-project"
  experiment_name: "my-run"      # optional, auto-generated from model name
  workspace: "my-workspace"      # optional, uses COMET_WORKSPACE env var
  api_key: null                  # optional, uses COMET_API_KEY env var
  tags: ["finetune", "llama"]    # optional

Motivation

Currently, Comet ML users must rely on Comet's wandb auto-patcher, which does not
reliably intercept wandb.log() calls because NeMo gates all wandb logging behind
if wandb.run is not None — requiring a valid wandb API key and active run even when
wandb is only used as a bridge to Comet. Native Comet support (matching the existing
MLflow integration pattern) eliminates this fragile dependency.

Test plan

  • Run a fine-tuning job with comet: config and verify metrics appear in Comet dashboard
  • Verify training metrics (loss, TPS, grad_norm, lr, mem) are logged at log_remote_every_steps frequency
  • Verify validation metrics (val_loss) are logged on each validation step
  • Verify MoE load balance metrics are logged (on MoE models)
  • Verify that omitting comet: config has no effect (backward compatible)
  • Verify comet_ml not installed raises clear ImportError

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@LoganVegnaSHOP LoganVegnaSHOP marked this pull request as draft February 27, 2026 23:05
@LoganVegnaSHOP LoganVegnaSHOP marked this pull request as ready for review February 27, 2026 23:38
@hemildesai
Copy link
Contributor

/ok to test b24a082

@akoumpa
Copy link
Contributor

akoumpa commented Mar 3, 2026

/ok to test 1f47ddc

hemildesai and others added 2 commits March 4, 2026 12:23
With seq_length=4 and truncation=True, the sequence is truncated to
just the start of the system message — no assistant tokens survive.
Skip the "must have supervised tokens" assertion when truncation is
enabled since it may legitimately remove all assistant content.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
When a chat template is used, the template's own turn-ending token
(e.g. <|im_end|>) terminates sequences — not eos_token_id
(e.g. <|endoftext|>). These are different tokens for Qwen2.5/3.
Remove assertions that eos_token_id must appear in supervised labels
since it was only true when EOS was manually appended.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
LoganVegnaSHOP and others added 9 commits March 5, 2026 11:35
Signed-off-by: root <root@pool0-00155.cm.cluster>
…elism

Strip the 4D attention_mask from the batch and register forward pre-hooks
on self_attn modules to set is_causal=True, so that SDPA handles causal
masking internally when using dense context parallelism without TE.

Signed-off-by: hemildesai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Replace functools.partial(F.scaled_dot_product_attention, ...) with a
closure that resolves F.scaled_dot_product_attention at call time. This
ensures CP's runtime monkey-patch of the function is picked up by all
custom models instead of being bypassed by the early-bound reference.

Also make _attach_context_parallel_hooks public (renamed to
attach_context_parallel_hooks).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
…ends

Extract SDPA backend selection into a resolve_sdpa_method() helper that
accepts string names from YAML config (e.g. ["flash_attention",
"efficient_attention"]) and converts them to SDPBackend enum members.
When no explicit config is provided, auto-selects based on CP and
activation checkpointing constraints.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Replace the assert that required all attention modules to be TE
DotProductAttention with a continue, so dense (SDPA) attention
modules are gracefully skipped. This allows MoE models to use
context parallelism with non-TE attention backends.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
…t/comet-logger

Made-with: Cursor

# Conflicts:
#	tests/unit_tests/distributed/test_cp_utils.py
@akoumpa
Copy link
Contributor

akoumpa commented Mar 6, 2026

/ok to test 88f0443

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants