Conversation
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
📝 WalkthroughWalkthroughChanges update NCCL network environment variables and add DNS configuration in Kuberay executor spec. Function signature modified to use Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
examples/evaluation/utils/executors.py (1)
79-79: Use the repo's Python 3.10 nullable type syntax here.
Dict[str, str] = Nonedoesn't match the default value and doesn't follow the repo typing convention. Please change this todict[str, str] | None = None.As per coding guidelines, "Use 'T | None' for nullable types instead of 'Optional[T]'" and "Use built-in generics (list, dict, tuple) instead of typing equivalents".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/evaluation/utils/executors.py` at line 79, Update the type annotation for the parameter custom_env_vars in the executor function/signature to use Python 3.10 nullable syntax: replace the typing.Dict usage and None default (currently `Dict[str, str] = None`) with `dict[str, str] | None = None`; ensure no unnecessary typing.Dict import remains and keep the default value as None.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/performance/setup_experiment.py`:
- Around line 389-399: The commented-out FaultTolerancePlugin block (referencing
use_recipes, dgxc_cluster, and plugins) should be removed or re-enabled behind a
real feature flag; delete the dead block and also remove the unused
FaultTolerancePlugin import from the module imports (the import that references
FaultTolerancePlugin) so Ruff/CI no longer complains, or alternatively restore
the block behind a proper conditional (e.g., a configurable
enable_fault_tolerance flag) keeping the plugin instantiation inside the active
branch.
---
Nitpick comments:
In `@examples/evaluation/utils/executors.py`:
- Line 79: Update the type annotation for the parameter custom_env_vars in the
executor function/signature to use Python 3.10 nullable syntax: replace the
typing.Dict usage and None default (currently `Dict[str, str] = None`) with
`dict[str, str] | None = None`; ensure no unnecessary typing.Dict import remains
and keep the default value as None.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 053363e7-0a86-4a50-9904-ea94ef05e550
📒 Files selected for processing (2)
examples/evaluation/utils/executors.pyscripts/performance/setup_experiment.py
| # if use_recipes and dgxc_cluster is not None: | ||
| # plugins.append( | ||
| # FaultTolerancePlugin( | ||
| # enable_ft_package=True, | ||
| # calc_ft_timeouts=True, | ||
| # num_in_job_restarts=10, | ||
| # num_job_retries_on_failure=10, | ||
| # initial_rank_heartbeat_timeout=1800, | ||
| # rank_heartbeat_timeout=300, | ||
| # ) | ||
| # ) |
There was a problem hiding this comment.
Remove the commented FT block or restore it behind a real flag.
As written, this disables FaultTolerancePlugin but leaves the import unused, which is already failing Ruff in CI. If the intent is to turn FT off for this path, delete the dead block and drop the import from Lines 49-53 instead of parking it here.
✂️ Suggested cleanup
- # if use_recipes and dgxc_cluster is not None:
- # plugins.append(
- # FaultTolerancePlugin(
- # enable_ft_package=True,
- # calc_ft_timeouts=True,
- # num_in_job_restarts=10,
- # num_job_retries_on_failure=10,
- # initial_rank_heartbeat_timeout=1800,
- # rank_heartbeat_timeout=300,
- # )
- # )Also remove FaultTolerancePlugin from the imports at Lines 49-53.
As per coding guidelines, "If code is commented out, include a comment describing its usage and why it is commented out; otherwise remove it as debug code before merging."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # if use_recipes and dgxc_cluster is not None: | |
| # plugins.append( | |
| # FaultTolerancePlugin( | |
| # enable_ft_package=True, | |
| # calc_ft_timeouts=True, | |
| # num_in_job_restarts=10, | |
| # num_job_retries_on_failure=10, | |
| # initial_rank_heartbeat_timeout=1800, | |
| # rank_heartbeat_timeout=300, | |
| # ) | |
| # ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/performance/setup_experiment.py` around lines 389 - 399, The
commented-out FaultTolerancePlugin block (referencing use_recipes, dgxc_cluster,
and plugins) should be removed or re-enabled behind a real feature flag; delete
the dead block and also remove the unused FaultTolerancePlugin import from the
module imports (the import that references FaultTolerancePlugin) so Ruff/CI no
longer complains, or alternatively restore the block behind a proper conditional
(e.g., a configurable enable_fault_tolerance flag) keeping the plugin
instantiation inside the active branch.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Refactor
Chores