Skip to content

Conversation

@hhzhang16
Copy link
Contributor

@hhzhang16 hhzhang16 commented Oct 28, 2025

Overview:

Remove ClusterRoles from operator and profiling docs and add flag for cluster-wide operators to do gpu discovery.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Added optional automatic GPU discovery for cluster-wide deployments.
    • Introduced configurable hardware settings with updated default values.
  • Documentation

    • Updated profiling guides with hardware configuration and GPU discovery details.
    • Renamed configuration keys for consistency.
  • Chores

    • Simplified operator RBAC definitions and deployment templates.

@hhzhang16 hhzhang16 requested review from a team as code owners October 28, 2025 15:17
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 28, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 28, 2025

Walkthrough

This PR introduces GPU discovery feature management by adding an EnableGpuDiscovery CLI flag, updating default GPU resource values, revising namespace restriction logic in Helm templates, simplifying RBAC definitions, and propagating the discovery setting through the Kubernetes controller and documentation.

Changes

Cohort / File(s) Summary
GPU Discovery CLI Configuration
benchmarks/profiler/utils/profiler_argparse.py
Added --enable-gpu-discovery CLI flag with store_true semantics. Updated default GPU settings: min-num-gpus-per-engine (0→1), max-num-gpus-per-engine (0→8), num-gpus-per-node (0→8). Made auto_generate_search_space() conditional on enable_gpu_discovery flag. Added validation requiring non-zero GPU values when discovery is disabled.
Kubernetes Deployment and RBAC Configuration
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml, deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml
Inverted namespaceRestriction logic in deployment template to only add profiling flags when NOT namespace-restricted. Removed all namespace-scoped RBAC definitions (ServiceAccount, Role, RoleBinding for dgdr-profiling-job) and cluster-level node-specific RBAC (ClusterRole and ClusterRoleBinding for dgdr-profiling-nodes).
API Type and Controller Logic
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go, deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go
Added EnableGpuDiscovery field to DynamoGraphDeploymentRequestSpec. Added validation in controller preventing GPU discovery for namespace-restricted operators. Updated profiling job creation to propagate enable-gpu-discovery flag to profiler container arguments.
Documentation Updates
docs/benchmarks/sla_driven_profiling.md, docs/planner/sla_planner_quickstart.md
Renamed AI Configurator config keys (aic.system → aic_system, etc.). Changed section header from "GPU Discovery" to "Hardware Configuration". Added new Hardware Configuration and Automatic GPU Discovery sections with constraint documentation for cluster-scoped vs namespace-restricted operators.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • RBAC removal justification: Verify the complete removal of namespace-scoped profiling-job RBAC and nodes-specific ClusterRole is intentional and won't break existing deployments
  • Namespace restriction validation: Review the validation logic preventing GPU discovery in namespace-restricted environments and ensure error messaging guides users correctly
  • GPU discovery flag propagation: Confirm the enable-gpu-discovery flag is correctly propagated through CLI → controller → profiler pipeline
  • Default value changes: Validate that changing GPU defaults from 0 to 1/8 is safe for existing configurations

Poem

🐰 A bunny hops through GPUs bright,
Discovery flags set left and right,
RBAC roles now slimmed down tight,
Namespaces dance in new delight,
Hardware config shines just right! ✨

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The pull request title "feat: remove cluster wide logic from namespace restricted operator" accurately describes one major component of the changeset—the removal of namespace-scoped RBAC components (ClusterRoles, RoleBindings, ServiceAccounts) and the inversion of conditional logic in deployment configurations. However, the PR objectives and description explicitly identify two equal major components: removing ClusterRoles AND adding a GPU discovery flag for cluster-wide operators. The title captures only the removal aspect and completely omits the addition of the new EnableGpuDiscovery flag across multiple files (profiler argparse, CRD types, controller logic, and documentation), which is a significant feature addition. A developer scanning commit history would have an incomplete understanding of the PR's full scope based on the title alone.
Description Check ⚠️ Warning The pull request description is significantly incomplete and largely relies on template placeholders. While the Overview section is adequately filled with a clear statement of purpose ("Remove ClusterRoles from operator and profiling docs and add flag for cluster-wide operators to do gpu discovery"), the remaining three required sections are missing substantive content: the Details section contains only the template comment, the "Where should the reviewer start?" section is empty except for the placeholder, and the Related Issues section shows a placeholder issue number (#xxx) rather than an actual GitHub issue reference. These are critical sections for helping reviewers understand the changes and their impact. Complete the missing sections: expand the Details section to describe the specific changes (RBAC removals in profiling-job-rbac.yaml and deployment.yaml, GPU discovery flag additions, validation logic), populate "Where should the reviewer start?" with key files to review (at minimum: profiling-job-rbac.yaml, deployment.yaml, dynamographdeploymentrequest_types.go, and dynamographdeploymentrequest_controller.go), and replace the placeholder with the actual GitHub issue number being closed by this PR. These additions will provide essential context for reviewers.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
benchmarks/profiler/utils/profiler_argparse.py (1)

314-328: Review validation logic interaction with new defaults.

The validation logic checks for zero values in hardware parameters when enable_gpu_discovery is false. However, with the new defaults (min: 1, max: 8, node: 8), these validation errors would only trigger if a user explicitly sets these values to 0 in their configuration.

Consider whether the validation messages should be updated to reflect that users are overriding sensible defaults rather than failing to provide required values. The current error messages imply these are required parameters, but they now have working defaults.

Example updated message:

parser.error(
    "Hardware parameters are set to 0. When --enable-gpu-discovery is false, you must provide non-zero values for "
    "--min-num-gpus-per-engine and --max-num-gpus-per-engine, or use the default values."
)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a79122c and a4ebf6e.

📒 Files selected for processing (7)
  • benchmarks/profiler/utils/profiler_argparse.py (3 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml (0 hunks)
  • deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (2 hunks)
  • docs/benchmarks/sla_driven_profiling.md (1 hunks)
  • docs/planner/sla_planner_quickstart.md (1 hunks)
💤 Files with no reviewable changes (1)
  • deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml
🧰 Additional context used
🧬 Code graph analysis (2)
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (1)
deploy/cloud/operator/internal/controller_common/predicate.go (1)
  • Config (55-71)
benchmarks/profiler/utils/profiler_argparse.py (1)
benchmarks/profiler/utils/search_space_autogen.py (1)
  • auto_generate_search_space (29-88)
🪛 GitHub Actions: Generate Documentation
docs/benchmarks/sla_driven_profiling.md

[warning] 86-86: image file not readable: docs/images/h100_prefill_performance.png


[warning] 88-88: image file not readable: docs/images/h100_decode_performance.png


[warning] 91-91: image file not readable: docs/images/pd_interpolation.png

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (12)
docs/planner/sla_planner_quickstart.md (2)

232-235: LGTM! Configuration keys updated to use underscores.

The configuration keys have been changed from dotted notation (aic.system, aic.model_name, aic.backend_version) to underscored notation (aic_system, aic_model_name, aic_backend_version). This provides a cleaner YAML structure and aligns with the profiler argument parser changes.

Note: This is a breaking change. Users with existing DGDRs using the old dotted notation will need to update their configurations.


240-242: LGTM! Section updated to reflect new hardware configuration paradigm.

The section header and content have been appropriately updated from "GPU Discovery" to "Hardware Configuration" with a reference to the detailed documentation. This aligns with the PR's objective to position GPU discovery as an optional, cluster-scoped feature rather than a general requirement.

deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1)

127-130: LGTM! Profiling flags now correctly gated behind cluster-scoped operator mode.

The logic has been appropriately inverted: profiling-related flags (--dgdr-profiling-cluster-role-name and --planner-cluster-role-name) are now only added when namespaceRestriction.enabled is false. This ensures that cluster-wide profiling features requiring node access are only enabled for cluster-scoped operators, directly supporting the PR's objective to remove cluster-wide logic from namespace-restricted operators.

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go (1)

117-124: LGTM! New EnableGpuDiscovery field is well-defined.

The new EnableGpuDiscovery field is properly implemented with:

  • Appropriate type (bool) and JSON tag (enableGpuDiscovery,omitempty)
  • Safe default value (false) preventing breaking changes
  • Clear kubebuilder validation markers (default and optional)
  • Comprehensive documentation explaining behavior, overrides, and cluster-wide requirement

The field integrates cleanly with the existing spec structure and properly communicates constraints to users.

deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (2)

706-709: LGTM! Validation correctly enforces cluster-scoped requirement.

The validation properly prevents enabling EnableGpuDiscovery for namespace-restricted operators by checking r.Config.RestrictedNamespace. The error message is clear, actionable, and guides users to provide manual hardware configuration instead.


928-932: LGTM! Profiler flag correctly propagated.

The controller correctly propagates the EnableGpuDiscovery flag to the profiler container arguments when enabled. The comment helpfully explains the cluster-wide requirement, and the implementation is straightforward.

docs/benchmarks/sla_driven_profiling.md (4)

45-45: LGTM! GPU discovery scope correctly clarified.

The documentation now correctly notes that GPU resource discovery is only applicable to cluster-scoped operators, aligning with the validation logic in the controller.


52-78: LGTM! Excellent documentation of hardware configuration.

The new Hardware Configuration section clearly explains that hardware parameters have sensible defaults and are optional. The example YAML is well-structured and demonstrates both hardware overrides and the AIC system configuration.


83-83: LGTM! Profiling method step appropriately reframed.

The first step in the profiling method has been correctly updated from "GPU Discovery" to "Hardware Setup," accurately reflecting that hardware configuration can use defaults, user-specified values, or optional automatic GPU discovery (for cluster-scoped operators).


86-91: Note: Pre-existing pipeline warnings for missing images.

The pipeline shows warnings about missing image files (h100_prefill_performance.png, h100_decode_performance.png, pd_interpolation.png). These are pre-existing issues not introduced by this PR, as these lines are not marked as changed. The images should be added separately to resolve the pipeline warnings.

benchmarks/profiler/utils/profiler_argparse.py (2)

251-256: LGTM! New enable-gpu-discovery flag is well-implemented.

The new --enable-gpu-discovery flag is properly defined with:

  • Correct action type (store_true) for boolean flag
  • Safe default value (False from config)
  • Clear help text explaining override behavior and cluster-wide permission requirement

315-316: LGTM! GPU discovery integration is correct.

The conditional call to auto_generate_search_space(args) correctly implements the GPU discovery toggle. When enabled, it queries the Kubernetes cluster for GPU information and overrides any manually specified hardware configuration, as documented in the API.

Copy link
Contributor

@tedzhouhk tedzhouhk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than coderabbit and Julien's comments

dep-554-remove-cluster-wide-logic-from-namespace-restricted-operator

Signed-off-by: Hannah Zhang <[email protected]>
Signed-off-by: Hannah Zhang <[email protected]>
Signed-off-by: Hannah Zhang <[email protected]>
Signed-off-by: Hannah Zhang <[email protected]>
dep-554-remove-cluster-wide-logic-from-namespace-restricted-operator
@hhzhang16
Copy link
Contributor Author

/ok to test cad1742

@hhzhang16 hhzhang16 enabled auto-merge (squash) October 30, 2025 00:28
@hhzhang16
Copy link
Contributor Author

/ok to test bd26931

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants