Skip to content

Ko3n1g/ci/disable nvrx#2694

Closed
ko3n1g wants to merge 18 commits intomainfrom
ko3n1g/ci/disable-nvrx
Closed

Ko3n1g/ci/disable nvrx#2694
ko3n1g wants to merge 18 commits intomainfrom
ko3n1g/ci/disable-nvrx

Conversation

@ko3n1g
Copy link
Contributor

@ko3n1g ko3n1g commented Mar 7, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • Refactor

    • Streamlined executor configuration structure and formatting for improved code clarity
  • Chores

    • Optimized network configuration settings to enhance performance in distributed computing workloads
    • Enhanced DNS configuration capabilities for improved domain resolution in cloud deployments
    • Updated default configuration values in executor specifications
    • Removed experimental fault tolerance functionality from certain evaluation scenarios

ko3n1g added 16 commits March 7, 2026 10:33
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g requested review from a team, erhoo82 and malay-nagda as code owners March 7, 2026 23:42
ko3n1g added 2 commits March 7, 2026 23:42
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 7, 2026

📝 Walkthrough

Walkthrough

Changes update NCCL network environment variables and add DNS configuration in Kuberay executor spec. Function signature modified to use None as default for custom_env_vars. FaultTolerancePlugin configuration is commented out in the performance setup script.

Changes

Cohort / File(s) Summary
Kuberay Executor Configuration
examples/evaluation/utils/executors.py
Updates NCCL network-related environment variables (tcpxo, socket interfaces, fastrak device); adds DNS configuration with ndots and single-request-reopen settings; refactors volume_mounts formatting; changes custom_env_vars parameter default from {} to None.
Performance Setup Script
scripts/performance/setup_experiment.py
Comments out FaultTolerancePlugin configuration block that sets fault tolerance parameters when using recipes and dgxc cluster.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • chore: Add evaluation pipeline #1876: Directly modified examples/evaluation/utils/executors.py, introducing the kuberay_executor function and related environment/DNS/NCCL settings that are being adjusted in this PR.

Suggested reviewers

  • erhoo82
  • malay-nagda
  • chtruong814
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR makes significant infrastructure changes to distributed training networking (NCCL protocol, Kubernetes DNS, fault tolerance) without documented test results or performance validation. Add test results demonstrating no convergence/performance regression from network configuration changes and FaultTolerancePlugin removal before merging.
Title check ❓ Inconclusive The title 'Ko3n1g/ci/disable nvrx' is vague and uses branch-naming conventions rather than a clear changeset description, making it unclear what the actual changes accomplish. Revise the title to clearly describe the main change, such as 'Disable nvrx in CI configuration' or 'Replace NCCL network settings and disable fault tolerance plugin in CI setup'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ko3n1g/ci/disable-nvrx

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
examples/evaluation/utils/executors.py (1)

79-79: Use the repo's Python 3.10 nullable type syntax here.

Dict[str, str] = None doesn't match the default value and doesn't follow the repo typing convention. Please change this to dict[str, str] | None = None.

As per coding guidelines, "Use 'T | None' for nullable types instead of 'Optional[T]'" and "Use built-in generics (list, dict, tuple) instead of typing equivalents".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/evaluation/utils/executors.py` at line 79, Update the type
annotation for the parameter custom_env_vars in the executor function/signature
to use Python 3.10 nullable syntax: replace the typing.Dict usage and None
default (currently `Dict[str, str] = None`) with `dict[str, str] | None = None`;
ensure no unnecessary typing.Dict import remains and keep the default value as
None.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/performance/setup_experiment.py`:
- Around line 389-399: The commented-out FaultTolerancePlugin block (referencing
use_recipes, dgxc_cluster, and plugins) should be removed or re-enabled behind a
real feature flag; delete the dead block and also remove the unused
FaultTolerancePlugin import from the module imports (the import that references
FaultTolerancePlugin) so Ruff/CI no longer complains, or alternatively restore
the block behind a proper conditional (e.g., a configurable
enable_fault_tolerance flag) keeping the plugin instantiation inside the active
branch.

---

Nitpick comments:
In `@examples/evaluation/utils/executors.py`:
- Line 79: Update the type annotation for the parameter custom_env_vars in the
executor function/signature to use Python 3.10 nullable syntax: replace the
typing.Dict usage and None default (currently `Dict[str, str] = None`) with
`dict[str, str] | None = None`; ensure no unnecessary typing.Dict import remains
and keep the default value as None.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 053363e7-0a86-4a50-9904-ea94ef05e550

📥 Commits

Reviewing files that changed from the base of the PR and between a2e6b62 and b956789.

📒 Files selected for processing (2)
  • examples/evaluation/utils/executors.py
  • scripts/performance/setup_experiment.py

Comment on lines +389 to +399
# if use_recipes and dgxc_cluster is not None:
# plugins.append(
# FaultTolerancePlugin(
# enable_ft_package=True,
# calc_ft_timeouts=True,
# num_in_job_restarts=10,
# num_job_retries_on_failure=10,
# initial_rank_heartbeat_timeout=1800,
# rank_heartbeat_timeout=300,
# )
# )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove the commented FT block or restore it behind a real flag.

As written, this disables FaultTolerancePlugin but leaves the import unused, which is already failing Ruff in CI. If the intent is to turn FT off for this path, delete the dead block and drop the import from Lines 49-53 instead of parking it here.

✂️ Suggested cleanup
-    # if use_recipes and dgxc_cluster is not None:
-    #     plugins.append(
-    #         FaultTolerancePlugin(
-    #             enable_ft_package=True,
-    #             calc_ft_timeouts=True,
-    #             num_in_job_restarts=10,
-    #             num_job_retries_on_failure=10,
-    #             initial_rank_heartbeat_timeout=1800,
-    #             rank_heartbeat_timeout=300,
-    #         )
-    #     )

Also remove FaultTolerancePlugin from the imports at Lines 49-53.

As per coding guidelines, "If code is commented out, include a comment describing its usage and why it is commented out; otherwise remove it as debug code before merging."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# if use_recipes and dgxc_cluster is not None:
# plugins.append(
# FaultTolerancePlugin(
# enable_ft_package=True,
# calc_ft_timeouts=True,
# num_in_job_restarts=10,
# num_job_retries_on_failure=10,
# initial_rank_heartbeat_timeout=1800,
# rank_heartbeat_timeout=300,
# )
# )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/setup_experiment.py` around lines 389 - 399, The
commented-out FaultTolerancePlugin block (referencing use_recipes, dgxc_cluster,
and plugins) should be removed or re-enabled behind a real feature flag; delete
the dead block and also remove the unused FaultTolerancePlugin import from the
module imports (the import that references FaultTolerancePlugin) so Ruff/CI no
longer complains, or alternatively restore the block behind a proper conditional
(e.g., a configurable enable_fault_tolerance flag) keeping the plugin
instantiation inside the active branch.

@ko3n1g ko3n1g closed this Mar 10, 2026
@yaoyu-33 yaoyu-33 added area:misc Cross-cutting utilities, logging, helpers, and other changes needs-review PR is ready for code review and waiting on a reviewer and removed area:misc Cross-cutting utilities, logging, helpers, and other changes needs-review PR is ready for code review and waiting on a reviewer labels Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:misc Cross-cutting utilities, logging, helpers, and other changes needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants