-
Notifications
You must be signed in to change notification settings - Fork 663
feat: remove cluster wide logic from namespace restricted operator #3934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: remove cluster wide logic from namespace restricted operator #3934
Conversation
Signed-off-by: Hannah Zhang <[email protected]>
Signed-off-by: Hannah Zhang <[email protected]>
WalkthroughThis PR introduces GPU discovery feature management by adding an Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
benchmarks/profiler/utils/profiler_argparse.py (1)
314-328: Review validation logic interaction with new defaults.The validation logic checks for zero values in hardware parameters when
enable_gpu_discoveryis false. However, with the new defaults (min: 1, max: 8, node: 8), these validation errors would only trigger if a user explicitly sets these values to 0 in their configuration.Consider whether the validation messages should be updated to reflect that users are overriding sensible defaults rather than failing to provide required values. The current error messages imply these are required parameters, but they now have working defaults.
Example updated message:
parser.error( "Hardware parameters are set to 0. When --enable-gpu-discovery is false, you must provide non-zero values for " "--min-num-gpus-per-engine and --max-num-gpus-per-engine, or use the default values." )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
benchmarks/profiler/utils/profiler_argparse.py(3 hunks)deploy/cloud/helm/platform/components/operator/templates/deployment.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml(0 hunks)deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go(1 hunks)deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go(2 hunks)docs/benchmarks/sla_driven_profiling.md(1 hunks)docs/planner/sla_planner_quickstart.md(1 hunks)
💤 Files with no reviewable changes (1)
- deploy/cloud/helm/platform/components/operator/templates/profiling-job-rbac.yaml
🧰 Additional context used
🧬 Code graph analysis (2)
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (1)
deploy/cloud/operator/internal/controller_common/predicate.go (1)
Config(55-71)
benchmarks/profiler/utils/profiler_argparse.py (1)
benchmarks/profiler/utils/search_space_autogen.py (1)
auto_generate_search_space(29-88)
🪛 GitHub Actions: Generate Documentation
docs/benchmarks/sla_driven_profiling.md
[warning] 86-86: image file not readable: docs/images/h100_prefill_performance.png
[warning] 88-88: image file not readable: docs/images/h100_decode_performance.png
[warning] 91-91: image file not readable: docs/images/pd_interpolation.png
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (12)
docs/planner/sla_planner_quickstart.md (2)
232-235: LGTM! Configuration keys updated to use underscores.The configuration keys have been changed from dotted notation (
aic.system,aic.model_name,aic.backend_version) to underscored notation (aic_system,aic_model_name,aic_backend_version). This provides a cleaner YAML structure and aligns with the profiler argument parser changes.Note: This is a breaking change. Users with existing DGDRs using the old dotted notation will need to update their configurations.
240-242: LGTM! Section updated to reflect new hardware configuration paradigm.The section header and content have been appropriately updated from "GPU Discovery" to "Hardware Configuration" with a reference to the detailed documentation. This aligns with the PR's objective to position GPU discovery as an optional, cluster-scoped feature rather than a general requirement.
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1)
127-130: LGTM! Profiling flags now correctly gated behind cluster-scoped operator mode.The logic has been appropriately inverted: profiling-related flags (
--dgdr-profiling-cluster-role-nameand--planner-cluster-role-name) are now only added whennamespaceRestriction.enabledisfalse. This ensures that cluster-wide profiling features requiring node access are only enabled for cluster-scoped operators, directly supporting the PR's objective to remove cluster-wide logic from namespace-restricted operators.deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go (1)
117-124: LGTM! New EnableGpuDiscovery field is well-defined.The new
EnableGpuDiscoveryfield is properly implemented with:
- Appropriate type (bool) and JSON tag (
enableGpuDiscovery,omitempty)- Safe default value (
false) preventing breaking changes- Clear kubebuilder validation markers (default and optional)
- Comprehensive documentation explaining behavior, overrides, and cluster-wide requirement
The field integrates cleanly with the existing spec structure and properly communicates constraints to users.
deploy/cloud/operator/internal/controller/dynamographdeploymentrequest_controller.go (2)
706-709: LGTM! Validation correctly enforces cluster-scoped requirement.The validation properly prevents enabling
EnableGpuDiscoveryfor namespace-restricted operators by checkingr.Config.RestrictedNamespace. The error message is clear, actionable, and guides users to provide manual hardware configuration instead.
928-932: LGTM! Profiler flag correctly propagated.The controller correctly propagates the
EnableGpuDiscoveryflag to the profiler container arguments when enabled. The comment helpfully explains the cluster-wide requirement, and the implementation is straightforward.docs/benchmarks/sla_driven_profiling.md (4)
45-45: LGTM! GPU discovery scope correctly clarified.The documentation now correctly notes that GPU resource discovery is only applicable to cluster-scoped operators, aligning with the validation logic in the controller.
52-78: LGTM! Excellent documentation of hardware configuration.The new Hardware Configuration section clearly explains that hardware parameters have sensible defaults and are optional. The example YAML is well-structured and demonstrates both hardware overrides and the AIC system configuration.
83-83: LGTM! Profiling method step appropriately reframed.The first step in the profiling method has been correctly updated from "GPU Discovery" to "Hardware Setup," accurately reflecting that hardware configuration can use defaults, user-specified values, or optional automatic GPU discovery (for cluster-scoped operators).
86-91: Note: Pre-existing pipeline warnings for missing images.The pipeline shows warnings about missing image files (
h100_prefill_performance.png,h100_decode_performance.png,pd_interpolation.png). These are pre-existing issues not introduced by this PR, as these lines are not marked as changed. The images should be added separately to resolve the pipeline warnings.benchmarks/profiler/utils/profiler_argparse.py (2)
251-256: LGTM! New enable-gpu-discovery flag is well-implemented.The new
--enable-gpu-discoveryflag is properly defined with:
- Correct action type (
store_true) for boolean flag- Safe default value (False from config)
- Clear help text explaining override behavior and cluster-wide permission requirement
315-316: LGTM! GPU discovery integration is correct.The conditional call to
auto_generate_search_space(args)correctly implements the GPU discovery toggle. When enabled, it queries the Kubernetes cluster for GPU information and overrides any manually specified hardware configuration, as documented in the API.
deploy/cloud/operator/api/v1alpha1/dynamographdeploymentrequest_types.go
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than coderabbit and Julien's comments
dep-554-remove-cluster-wide-logic-from-namespace-restricted-operator Signed-off-by: Hannah Zhang <[email protected]>
Signed-off-by: Hannah Zhang <[email protected]>
Signed-off-by: Hannah Zhang <[email protected]>
Signed-off-by: Hannah Zhang <[email protected]>
dep-554-remove-cluster-wide-logic-from-namespace-restricted-operator
…om-namespace-restricted-operator
|
/ok to test cad1742 |
…om-namespace-restricted-operator
|
/ok to test bd26931 |
Overview:
Remove ClusterRoles from operator and profiling docs and add flag for cluster-wide operators to do gpu discovery.
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Documentation
Chores