feat: remove cluster wide logic from namespace restricted operator #3934

hhzhang16 · 2025-10-28T15:17:26Z

Overview:

Remove ClusterRoles from operator and profiling docs and add flag for cluster-wide operators to do gpu discovery.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Added optional automatic GPU discovery for cluster-wide deployments.
- Introduced configurable hardware settings with updated default values.
Documentation
- Updated profiling guides with hardware configuration and GPU discovery details.
- Renamed configuration keys for consistency.
Chores
- Simplified operator RBAC definitions and deployment templates.

Signed-off-by: Hannah Zhang <[email protected]>

copy-pr-bot · 2025-10-28T15:17:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-10-28T15:24:15Z

Walkthrough

This PR introduces GPU discovery feature management by adding an EnableGpuDiscovery CLI flag, updating default GPU resource values, revising namespace restriction logic in Helm templates, simplifying RBAC definitions, and propagating the discovery setting through the Kubernetes controller and documentation.

Changes

Cohort / File(s)	Summary
GPU Discovery CLI Configuration `benchmarks/profiler/utils/profiler_argparse.py`	Added `--enable-gpu-discovery` CLI flag with store_true semantics. Updated default GPU settings: min-num-gpus-per-engine (0→1), max-num-gpus-per-engine (0→8), num-gpus-per-node (0→8). Made `auto_generate_search_space()` conditional on enable_gpu_discovery flag. Added validation requiring non-zero GPU values when discovery is disabled.
Kubernetes Deployment and RBAC Configuration `deploy/cloud/helm/platform/components/operator/templates/deployment.yaml`, `deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml`	Inverted namespaceRestriction logic in deployment template to only add profiling flags when NOT namespace-restricted. Removed all namespace-scoped RBAC definitions (ServiceAccount, Role, RoleBinding for dgdr-profiling-job) and cluster-level node-specific RBAC (ClusterRole and ClusterRoleBinding for dgdr-profiling-nodes).
API Type and Controller Logic `deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go`, `deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go`	Added `EnableGpuDiscovery` field to DynamoGraphDeploymentRequestSpec. Added validation in controller preventing GPU discovery for namespace-restricted operators. Updated profiling job creation to propagate enable-gpu-discovery flag to profiler container arguments.
Documentation Updates `docs/benchmarks/sla_driven_profiling.md`, `docs/planner/sla_planner_quickstart.md`	Renamed AI Configurator config keys (aic.system → aic_system, etc.). Changed section header from "GPU Discovery" to "Hardware Configuration". Added new Hardware Configuration and Automatic GPU Discovery sections with constraint documentation for cluster-scoped vs namespace-restricted operators.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

RBAC removal justification: Verify the complete removal of namespace-scoped profiling-job RBAC and nodes-specific ClusterRole is intentional and won't break existing deployments
Namespace restriction validation: Review the validation logic preventing GPU discovery in namespace-restricted environments and ensure error messaging guides users correctly
GPU discovery flag propagation: Confirm the enable-gpu-discovery flag is correctly propagated through CLI → controller → profiler pipeline
Default value changes: Validate that changing GPU defaults from 0 to 1/8 is safe for existing configurations

Poem

🐰 A bunny hops through GPUs bright,
Discovery flags set left and right,
RBAC roles now slimmed down tight,
Namespaces dance in new delight,
Hardware config shines just right! ✨

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The pull request title "feat: remove cluster wide logic from namespace restricted operator" accurately describes one major component of the changeset—the removal of namespace-scoped RBAC components (ClusterRoles, RoleBindings, ServiceAccounts) and the inversion of conditional logic in deployment configurations. However, the PR objectives and description explicitly identify two equal major components: removing ClusterRoles AND adding a GPU discovery flag for cluster-wide operators. The title captures only the removal aspect and completely omits the addition of the new EnableGpuDiscovery flag across multiple files (profiler argparse, CRD types, controller logic, and documentation), which is a significant feature addition. A developer scanning commit history would have an incomplete understanding of the PR's full scope based on the title alone.
Description Check	⚠️ Warning	The pull request description is significantly incomplete and largely relies on template placeholders. While the Overview section is adequately filled with a clear statement of purpose ("Remove ClusterRoles from operator and profiling docs and add flag for cluster-wide operators to do gpu discovery"), the remaining three required sections are missing substantive content: the Details section contains only the template comment, the "Where should the reviewer start?" section is empty except for the placeholder, and the Related Issues section shows a placeholder issue number (#xxx) rather than an actual GitHub issue reference. These are critical sections for helping reviewers understand the changes and their impact.	Complete the missing sections: expand the Details section to describe the specific changes (RBAC removals in profiling-job-rbac.yaml and deployment.yaml, GPU discovery flag additions, validation logic), populate "Where should the reviewer start?" with key files to review (at minimum: profiling-job-rbac.yaml, deployment.yaml, dynamographdeploymentrequest_types.go, and dynamographdeploymentrequest_controller.go), and replace the placeholder with the actual GitHub issue number being closed by this PR. These additions will provide essential context for reviewers.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

benchmarks/profiler/utils/profiler_argparse.py (1)
314-328: Review validation logic interaction with new defaults.

The validation logic checks for zero values in hardware parameters when enable_gpu_discovery is false. However, with the new defaults (min: 1, max: 8, node: 8), these validation errors would only trigger if a user explicitly sets these values to 0 in their configuration.

Consider whether the validation messages should be updated to reflect that users are overriding sensible defaults rather than failing to provide required values. The current error messages imply these are required parameters, but they now have working defaults.

Example updated message:
parser.error(
    "Hardware parameters are set to 0. When --enable-gpu-discovery is false, you must provide non-zero values for "
    "--min-num-gpus-per-engine and --max-num-gpus-per-engine, or use the default values."
)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a79122c and a4ebf6e.

📒 Files selected for processing (7)

benchmarks/profiler/utils/profiler_argparse.py (3 hunks)
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1 hunks)
deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml (0 hunks)
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go (1 hunks)
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (2 hunks)
docs/benchmarks/sla_driven_profiling.md (1 hunks)
docs/planner/sla_planner_quickstart.md (1 hunks)

💤 Files with no reviewable changes (1)

deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml

🧰 Additional context used

🧬 Code graph analysis (2)

deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (1)

deploy/cloud/operator/internal/controller_common/predicate.go (1)

Config (55-71)

benchmarks/profiler/utils/profiler_argparse.py (1)

benchmarks/profiler/utils/search_space_autogen.py (1)

auto_generate_search_space (29-88)

🪛 GitHub Actions: Generate Documentation

docs/benchmarks/sla_driven_profiling.md

[warning] 86-86: image file not readable: docs/images/h100_prefill_performance.png

[warning] 88-88: image file not readable: docs/images/h100_decode_performance.png

[warning] 91-91: image file not readable: docs/images/pd_interpolation.png

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (12)

docs/planner/sla_planner_quickstart.md (2)

232-235: LGTM! Configuration keys updated to use underscores.

The configuration keys have been changed from dotted notation (aic.system, aic.model_name, aic.backend_version) to underscored notation (aic_system, aic_model_name, aic_backend_version). This provides a cleaner YAML structure and aligns with the profiler argument parser changes.

Note: This is a breaking change. Users with existing DGDRs using the old dotted notation will need to update their configurations.

240-242: LGTM! Section updated to reflect new hardware configuration paradigm.

The section header and content have been appropriately updated from "GPU Discovery" to "Hardware Configuration" with a reference to the detailed documentation. This aligns with the PR's objective to position GPU discovery as an optional, cluster-scoped feature rather than a general requirement.

deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1)

127-130: LGTM! Profiling flags now correctly gated behind cluster-scoped operator mode.

The logic has been appropriately inverted: profiling-related flags (--dgdr-profiling-cluster-role-name and --planner-cluster-role-name) are now only added when namespaceRestriction.enabled is false. This ensures that cluster-wide profiling features requiring node access are only enabled for cluster-scoped operators, directly supporting the PR's objective to remove cluster-wide logic from namespace-restricted operators.

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go (1)

117-124: LGTM! New EnableGpuDiscovery field is well-defined.

The new EnableGpuDiscovery field is properly implemented with:

Appropriate type (bool) and JSON tag (enableGpuDiscovery,omitempty)

Safe default value (false) preventing breaking changes

Clear kubebuilder validation markers (default and optional)

Comprehensive documentation explaining behavior, overrides, and cluster-wide requirement

The field integrates cleanly with the existing spec structure and properly communicates constraints to users.

deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (2)

706-709: LGTM! Validation correctly enforces cluster-scoped requirement.

The validation properly prevents enabling EnableGpuDiscovery for namespace-restricted operators by checking r.Config.RestrictedNamespace. The error message is clear, actionable, and guides users to provide manual hardware configuration instead.

928-932: LGTM! Profiler flag correctly propagated.

The controller correctly propagates the EnableGpuDiscovery flag to the profiler container arguments when enabled. The comment helpfully explains the cluster-wide requirement, and the implementation is straightforward.

docs/benchmarks/sla_driven_profiling.md (4)

45-45: LGTM! GPU discovery scope correctly clarified.

The documentation now correctly notes that GPU resource discovery is only applicable to cluster-scoped operators, aligning with the validation logic in the controller.

52-78: LGTM! Excellent documentation of hardware configuration.

The new Hardware Configuration section clearly explains that hardware parameters have sensible defaults and are optional. The example YAML is well-structured and demonstrates both hardware overrides and the AIC system configuration.

83-83: LGTM! Profiling method step appropriately reframed.

The first step in the profiling method has been correctly updated from "GPU Discovery" to "Hardware Setup," accurately reflecting that hardware configuration can use defaults, user-specified values, or optional automatic GPU discovery (for cluster-scoped operators).

86-91: Note: Pre-existing pipeline warnings for missing images.

The pipeline shows warnings about missing image files (h100_prefill_performance.png, h100_decode_performance.png, pd_interpolation.png). These are pre-existing issues not introduced by this PR, as these lines are not marked as changed. The images should be added separately to resolve the pipeline warnings.

benchmarks/profiler/utils/profiler_argparse.py (2)

251-256: LGTM! New enable-gpu-discovery flag is well-implemented.

The new --enable-gpu-discovery flag is properly defined with:

Correct action type (store_true) for boolean flag

Safe default value (False from config)

Clear help text explaining override behavior and cluster-wide permission requirement

315-316: LGTM! GPU discovery integration is correct.

The conditional call to auto_generate_search_space(args) correctly implements the GPU discovery toggle. When enabled, it queries the Kubernetes cluster for GPU information and overrides any manually specified hardware configuration, as documented in the API.

benchmarks/profiler/utils/profiler_argparse.py

docs/benchmarks/sla_driven_profiling.md

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go

tedzhouhk

LGTM other than coderabbit and Julien's comments

dep-554-remove-cluster-wide-logic-from-namespace-restricted-operator Signed-off-by: Hannah Zhang <[email protected]>

Signed-off-by: Hannah Zhang <[email protected]>

dep-554-remove-cluster-wide-logic-from-namespace-restricted-operator

…om-namespace-restricted-operator

hhzhang16 · 2025-10-29T23:38:03Z

/ok to test cad1742

…om-namespace-restricted-operator

hhzhang16 · 2025-10-30T11:17:41Z

/ok to test bd26931

hhzhang16 added 2 commits October 28, 2025 07:43

feat: remove ClusterRoles from operator and profiling docs

b64c38a

Signed-off-by: Hannah Zhang <[email protected]>

feat: add enable_gpu_discovery flag

a4ebf6e

Signed-off-by: Hannah Zhang <[email protected]>

hhzhang16 requested review from a team as code owners October 28, 2025 15:17

pull-request-size bot added the size/L label Oct 28, 2025

github-actions bot added the feat label Oct 28, 2025

coderabbitai bot reviewed Oct 28, 2025

View reviewed changes

benchmarks/profiler/utils/profiler_argparse.py Show resolved Hide resolved

docs/benchmarks/sla_driven_profiling.md Outdated Show resolved Hide resolved

julienmancuso reviewed Oct 28, 2025

View reviewed changes

deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go Show resolved Hide resolved

tedzhouhk approved these changes Oct 28, 2025

View reviewed changes

hhzhang16 added 5 commits October 28, 2025 16:13

Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hannahz/

1ab6ac9

dep-554-remove-cluster-wide-logic-from-namespace-restricted-operator Signed-off-by: Hannah Zhang <[email protected]>

feat: run make commands

9424f89

Signed-off-by: Hannah Zhang <[email protected]>

feat: coderabbit comments

20562dd

Signed-off-by: Hannah Zhang <[email protected]>

fix: validation message

ae61e46

Signed-off-by: Hannah Zhang <[email protected]>

Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hannahz/

4abd9f8

dep-554-remove-cluster-wide-logic-from-namespace-restricted-operator

julienmancuso approved these changes Oct 29, 2025

View reviewed changes

Merge branch 'main' into hannahz/dep-554-remove-cluster-wide-logic-fr…

cad1742

…om-namespace-restricted-operator

hhzhang16 enabled auto-merge (squash) October 30, 2025 00:28

Merge branch 'main' into hannahz/dep-554-remove-cluster-wide-logic-fr…

bd26931

…om-namespace-restricted-operator

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 11:17 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 11:19 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: remove cluster wide logic from namespace restricted operator #3934

feat: remove cluster wide logic from namespace restricted operator #3934

Uh oh!

hhzhang16 commented Oct 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Oct 28, 2025

Uh oh!

coderabbitai bot commented Oct 28, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tedzhouhk left a comment

Uh oh!

hhzhang16 commented Oct 29, 2025

Uh oh!

hhzhang16 commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: remove cluster wide logic from namespace restricted operator #3934

Are you sure you want to change the base?

feat: remove cluster wide logic from namespace restricted operator #3934

Uh oh!

Conversation

hhzhang16 commented Oct 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 28, 2025

Uh oh!

coderabbitai bot commented Oct 28, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tedzhouhk left a comment

Choose a reason for hiding this comment

Uh oh!

hhzhang16 commented Oct 29, 2025

Uh oh!

hhzhang16 commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hhzhang16 commented Oct 28, 2025 •

edited by coderabbitai bot

Loading