feat: Add native Comet ML experiment tracking#1411
Open
LoganVegnaSHOP wants to merge 38 commits intoNVIDIA-NeMo:mainfrom
Open
feat: Add native Comet ML experiment tracking#1411LoganVegnaSHOP wants to merge 38 commits intoNVIDIA-NeMo:mainfrom
LoganVegnaSHOP wants to merge 38 commits intoNVIDIA-NeMo:mainfrom
Conversation
hemildesai
reviewed
Feb 27, 2026
Contributor
|
/ok to test b24a082 |
b24a082 to
b69710c
Compare
Contributor
|
/ok to test 1f47ddc |
With seq_length=4 and truncation=True, the sequence is truncated to just the start of the system message — no assistant tokens survive. Skip the "must have supervised tokens" assertion when truncation is enabled since it may legitimately remove all assistant content. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com>
When a chat template is used, the template's own turn-ending token (e.g. <|im_end|>) terminates sequences — not eos_token_id (e.g. <|endoftext|>). These are different tokens for Qwen2.5/3. Remove assertions that eos_token_id must appear in supervised labels since it was only true when EOS was manually appended. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com>
This was referenced Mar 4, 2026
Signed-off-by: root <root@pool0-00155.cm.cluster>
…to feat/comet-logger
…elism Strip the 4D attention_mask from the batch and register forward pre-hooks on self_attn modules to set is_causal=True, so that SDPA handles causal masking internally when using dense context parallelism without TE. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
Replace functools.partial(F.scaled_dot_product_attention, ...) with a closure that resolves F.scaled_dot_product_attention at call time. This ensures CP's runtime monkey-patch of the function is picked up by all custom models instead of being bypassed by the early-bound reference. Also make _attach_context_parallel_hooks public (renamed to attach_context_parallel_hooks). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
…ends Extract SDPA backend selection into a resolve_sdpa_method() helper that accepts string names from YAML config (e.g. ["flash_attention", "efficient_attention"]) and converts them to SDPBackend enum members. When no explicit config is provided, auto-selects based on CP and activation checkpointing constraints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
Replace the assert that required all attention modules to be TE DotProductAttention with a continue, so dense (SDPA) attention modules are gracefully skipped. This allows MoE models to use context parallelism with non-TE attention backends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
…t/comet-logger Made-with: Cursor # Conflicts: # tests/unit_tests/distributed/test_cp_utils.py
Contributor
|
/ok to test 88f0443 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CometLoggerclass innemo_automodel/components/loggers/comet_utils.py— mirrors the existingMLflowLoggerpatternTrainFinetuneRecipeForNextTokenPrediction:log_train_metrics()log_val_metrics()_log_moe_metrics()comet:block to the training YAML config:Motivation
Currently, Comet ML users must rely on Comet's wandb auto-patcher, which does not
reliably intercept
wandb.log()calls because NeMo gates all wandb logging behindif wandb.run is not None— requiring a valid wandb API key and active run even whenwandb is only used as a bridge to Comet. Native Comet support (matching the existing
MLflow integration pattern) eliminates this fragile dependency.
Test plan
comet:config and verify metrics appear in Comet dashboardlog_remote_every_stepsfrequencycomet:config has no effect (backward compatible)comet_mlnot installed raises clear ImportError