Rocm nvidia agents#107
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 33 minutes and 27 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughRestructures agent container builds from a single image to three GPU-variant targets (no-GPU, NVIDIA AMD) using a multi-stage Containerfile. Renames operator image to Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
agent images are small |
a8e99b8 to
fe6471c
Compare
This commit addresses all review feedback and failing CI checks: - Add common agent base Containerfile with shared build stages - Update all agent Containerfiles with clear stage documentation - Add agent-base image to CI/CD workflow and Makefile - Fix image-build workflow to build all 4 agent variants (base, nvidia, amd, nogpu) - Fix 19 markdown linting errors in documentation files - Wrap long lines to ≤80 characters (MD013) - Add blank lines around code blocks (MD031) - Add blank lines around lists (MD032) Resolves: - Build Image (agent) workflow failure (missing Containerfile.gkm-agent) - Pre-commit markdown linting failures - PR review comment requesting common base container Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
This commit addresses all review feedback and failing CI checks: - Add common agent base Containerfile with shared build stages - Update all agent Containerfiles with clear stage documentation - Add agent-base image to CI/CD workflow and Makefile - Fix image-build workflow to build all 4 agent variants (base, nvidia, amd, nogpu) - Fix 19 markdown linting errors in documentation files - Wrap long lines to ≤80 characters (MD013) - Add blank lines around code blocks (MD031) - Add blank lines around lists (MD032) Resolves: - Build Image (agent) workflow failure (missing Containerfile.gkm-agent) - Pre-commit markdown linting failures - PR review comment requesting common base container Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
1df4736 to
3dca1b9
Compare
336cc9f to
ccbc1ad
Compare
a56ccbc to
c263a5e
Compare
…yment Replace generic agent with GPU-vendor-specific agents that deploy based on hardware detection. This enables hybrid clusters with both NVIDIA and AMD GPUs to run optimized agents with appropriate runtime libraries. Changes: - Add Containerfile.gkm-agent-nvidia (CUDA 12.6.3 base with NVML) - Add Containerfile.gkm-agent-amd (ROCm 6.3.1 with AMD SMI libraries) - Remove generic Containerfile.gkm-agent - Add DaemonSet manifests with PCI vendor ID-based node selectors: * gkm-agent-nvidia.yaml (nodeSelector: pci-10de.present) * gkm-agent-amd.yaml (nodeSelector: pci-1002.present) - Remove generic gkm-agent.yaml - Add Node Feature Discovery (NFD) deployment configuration - Update Makefile with GPU-specific build/push targets: * build-image-agent-nvidia, build-image-agent-amd * build-image-agents-gpu (builds both) * push-images-agents-gpu - Add mcv dependencies: go-nvlib v0.9.0, amdsmi (amd-staging) - Add comprehensive documentation for multi-GPU deployment The operator and CSI plugin remain unchanged and work with both agent types. NFD automatically labels nodes with GPU vendor information, enabling declarative GPU-specific agent deployment without manual intervention. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Add deploy-nfd and undeploy-nfd targets for automated Node Feature Discovery deployment - Integrate NFD deployment into main deploy target for GPU detection - Add deploy-kyverno-production for non-Kind cluster Kyverno deployment - Add deploy-kyverno-with-policies combined target - Update deploy target to conditionally deploy Kyverno based on KYVERNO_ENABLED flag - Update undeploy target to clean up NFD and Kyverno when KYVERNO_ENABLED=true - Update prepare-deploy to configure all three agent image variants (NVIDIA, AMD, no-GPU) This enables 'make deploy' to automatically deploy a complete GKM stack including GPU detection (NFD) and optional image verification (Kyverno) on production clusters. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- When NO_GPU_BUILD=true, only build and push no-GPU agent - When NO_GPU_BUILD=false (default), build and push all three agents (NVIDIA, AMD, no-GPU) - This avoids unnecessary builds of GPU-specific agents for Kind/test clusters Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add AGENT_NVIDIA_IMG, AGENT_AMD_IMG, and AGENT_NOGPU_IMG variables
to allow individual override of agent images. This enables deploying
with custom image names/tags without requiring the default naming scheme.
Example usage:
make deploy \
OPERATOR_IMG=quay.io/user/gkm:operator \
EXTRACT_IMG=quay.io/user/gkm:extract \
AGENT_NVIDIA_IMG=quay.io/user/gkm:agent-nvidia \
AGENT_AMD_IMG=quay.io/user/gkm:agent-amd \
AGENT_NOGPU_IMG=quay.io/user/gkm:agent-no-gpu
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
The GPU agents were not being scheduled on nodes with GPUs because NFD creates labels with PCI class codes (e.g., pci-0302_10de for NVIDIA 3D controllers), but agents were using simple nodeSelectors looking for vendor ID only (pci-10de). Changes: - Update NVIDIA agent to use nodeAffinity matching class codes 0300 and 0302 - Update AMD agent to use nodeAffinity matching class codes 0300, 0302, and 0380 - Upgrade NFD to v0.17.2 to fix deprecated node-role.kubernetes.io/master label - Replace wget with curl in Makefile for macOS compatibility Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
In multi-node clusters, the nogpu agent should not run on control-plane nodes. Also updated to match PCI class code label format consistent with GPU agents. Changes: - Add nodeAffinity to exclude nodes with node-role.kubernetes.io/control-plane label - Update GPU detection to use PCI class codes (0300, 0302, 0380) instead of vendor ID only - Ensures nogpu agent only runs on non-GPU worker nodes Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
… requests The GPU agents were unable to access GPUs because they lacked the necessary GPU runtime libraries. Following the NVIDIA device plugin pattern, we now mount: NVIDIA agent: - /usr/lib64 -> Contains libnvidia-ml.so and other NVIDIA libraries - LD_LIBRARY_PATH=/usr/lib64 environment variable AMD agent: - /opt/rocm -> ROCm libraries for AMD GPU management - /usr/lib64 -> System libraries - LD_LIBRARY_PATH=/opt/rocm/lib:/usr/lib64 This allows the agents to use NVML/ROCm APIs to detect and monitor ALL GPUs on the node without requesting gpu resources (nvidia.com/gpu or amd.com/gpu), which would limit visibility to only one GPU. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add comprehensive installation script and make target to automate dependency setup on RHEL 10 systems. The script handles installation of build dependencies from CentOS Stream and Fedora repositories, and installs/upgrades go, podman, and kubectl to required versions. Changes: - Add hack/install_deps.sh script for RHEL 10 dependency installation - Add 'make install-deps' target to Makefile - Update GettingStartedGuide with automated installation instructions - Document package sources for RHEL 10 (CentOS Stream, Fedora) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
This commit addresses all review feedback and failing CI checks: - Add common agent base Containerfile with shared build stages - Update all agent Containerfiles with clear stage documentation - Add agent-base image to CI/CD workflow and Makefile - Fix image-build workflow to build all 4 agent variants (base, nvidia, amd, nogpu) - Fix 19 markdown linting errors in documentation files - Wrap long lines to ≤80 characters (MD013) - Add blank lines around code blocks (MD031) - Add blank lines around lists (MD032) Resolves: - Build Image (agent) workflow failure (missing Containerfile.gkm-agent) - Pre-commit markdown linting failures - PR review comment requesting common base container Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Move inline comment out of matchExpressions list to avoid yamllint warnings - Fix indentation of nodeSelectorTerms and matchExpressions items - Ensure consistent 2-space indentation for YAML list items This resolves the pre-commit yamllint hook failures for: - examples/namespace/RWO-NVIDIA/12-ds.yaml - examples/namespace/RWO-NVIDIA/13-pod.yaml Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Move examples from flat structure to organized hierarchy: - examples/namespace/RWO-NVIDIA/ → examples/namespace/RWO/CUDA/ - examples/namespace/RWO-ROCM/ → examples/namespace/RWO/ROCM/ This change: - Updates README paths to reflect new directory structure - Includes yamllint fixes (proper indentation and comment placement) - Maintains consistent example organization under RWO/ Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…nd-load-images
The kind-load-images target was attempting to load ${AGENT_IMG} which is never built.
Updated to load the actual agent images based on NO_GPU_BUILD flag: AGENT_BASE_IMG,
AGENT_NOGPU_IMG (always), and AGENT_NVIDIA_IMG/AGENT_AMD_IMG (when NO_GPU_BUILD=false).
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Fixes Kyverno and NFD component scheduling issues in Kind clusters with GPU taints by adding proper tolerations and removing duplicate deployments. Changes: - Use Kind-specific Kyverno values when NO_GPU=true in deploy target - Remove duplicate Kyverno deployment from run-on-kind target - Add GPU tolerations for Kyverno hooks/migration jobs - Add GPU tolerations for NFD garbage collector and workers Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
NFD is unnecessary in Kind simulated GPU environments. Instead, patch agent daemonsets to use GPU device plugin labels (rocm.amd.com/gpu.present, nvidia.com/gpu.present) for node affinity rather than NFD's PCI device labels. Changes: - Skip NFD deployment when NO_GPU=true (Kind clusters) - Skip NFD undeployment when NO_GPU=true - Add Kind-specific agent patches using device plugin labels - Patch gkm-agent-amd to use rocm.amd.com/gpu.present label - Patch gkm-agent-nvidia to use nvidia.com/gpu.present label - Patch gkm-agent-nogpu to exclude nodes with GPU labels Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Introduces SKIP_NFD flag to control NFD deployment separately from NO_GPU mode. For Kind clusters, we skip NFD deployment but simulate it by manually adding PCI device labels that NFD would normally create. Changes: - Add SKIP_NFD flag (default: false) to control NFD deployment - Use SKIP_NFD instead of NO_GPU for controlling NFD deployment/undeploy - Auto-label Kind worker nodes with NFD PCI device labels (nvidia/rocm) - Keep NO_GPU=true for Kind to use no-GPU agent mode - Remove device plugin label patches (revert to NFD PCI labels) - Update deploy-on-kind to pass both SKIP_NFD=true and NO_GPU=true Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
For Kind GPU simulation with NO_GPU=true, remove NFD PCI label requirements by removing node affinity from the nogpu agent. This allows the agent to schedule on all worker nodes without needing NFD labels. Changes: - Remove NFD PCI label addition from deploy-on-kind target - Add Kind-specific patch to remove node affinity from nogpu agent - Fix agent-patch.yaml to target all three agent daemonsets (amd, nvidia, nogpu) - NoGPU agents now schedule successfully in Kind clusters Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Make namespace and cache names consistent across ROCM and CUDA examples: - ROCM namespace: gkm-test-ns-rwo-1 → gkm-test-ns-rocm-rwo-1 - ROCM cache: vector-add-cache-rocm-v2-rwo → vector-add-cache-rocm-rwo - ROCM cache v3: vector-add-cache-rocm-v3-rwo → vector-add-cache-rocm-rwo-v3 - CUDA namespace: gkm-test-ns-nvidia-rwo-1 → gkm-test-ns-cuda-rwo-1 - CUDA workloads: gkm-test-nvidia-* → gkm-test-cuda-* Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Update ROCM workload names to be consistent with namespace naming:
- gkm-test-ns-rwo-ds-* → gkm-test-rocm-rwo-ds-*
- gkm-test-ns-rwo-v3-ds-* → gkm-test-rocm-rwo-v3-ds-*
Now matches CUDA pattern: gkm-test-{vendor}-rwo-*
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Removed duplicated builder stage from NVIDIA, AMD, and nogpu agent Containerfiles. Each now uses FROM quay.io/gkm/gkm-agent-base:latest as the builder stage, eliminating code duplication while keeping GPU-specific runtime stages intact. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replaced separate Containerfiles for each agent variant with a single Containerfile.gkm-agents containing multi-stage targets (nogpu, amd, nvidia). This eliminates cross-file dependencies and enables parallel CI builds. Changes: - Created Containerfile.gkm-agents with shared builder stage - nogpu target: complete agent with common runtime deps - amd target: extends nogpu, adds ROCm support only - nvidia target: CUDA runtime with agent binary - Updated Makefile to build using --target flags - Updated GitHub workflow to use single Containerfile - Removed obsolete individual Containerfiles - Updated documentation references Benefits: - No build dependencies between separate files - Builder stage always available in same file - AMD reuses all nogpu layers (more efficient) - CI workflows can build in parallel - Cleaner, more maintainable structure Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Changed gkm.agent.image from non-existent gkm-agent:latest to gkm-agent-nogpu:latest. This value is legacy/unused (operator only logs it), but needs to reference a real image for backwards compatibility. Each agent daemonset uses its GPU-specific image directly, so this configmap value is not actually used at runtime. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add supplementalGroups to ROCm agent pod spec to grant video group membership (GIDs 44 and 986), enabling proper access to /dev/kfd and /dev/dri devices required for amd-smi GPU detection. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
AMD integrated GPUs (e.g., Strix Halo) return "N/A" strings for PCIe
and ECC fields that discrete GPUs populate with typed values. Changed
AMDBus, AMDRAS, and AMDCardInfo structs to use interface{} for fields
that can be either typed values or "N/A" strings, allowing JSON parsing
to succeed for both integrated and discrete AMD GPUs.
Also added render group (GID 105) to AMD agent pod security context to
ensure device access on systems where /dev/kfd and /dev/dri/renderD*
are owned by the render group.
Includes test with real Strix Halo GPU output to verify parsing.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
c263a5e to
c53c04e
Compare
There was a problem hiding this comment.
Actionable comments posted: 17
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/namespace/RWO/ROCM/14-ds.yaml (1)
29-47:⚠️ Potential issue | 🟠 MajorHarden container security contexts in this example DaemonSet.
Line 32 runs the init container as root, and neither container sets
allowPrivilegeEscalation: false. This keeps a permissive baseline and triggers the medium-risk checks noted in static analysis.Suggested hardening patch
initContainers: - name: fix-permissions image: quay.io/fedora/fedora-minimal securityContext: runAsUser: 0 + allowPrivilegeEscalation: false + capabilities: + drop: ["ALL"] @@ containers: - name: test image: quay.io/fedora/fedora-minimal imagePullPolicy: IfNotPresent + securityContext: + runAsNonRoot: true + runAsUser: 1000 + allowPrivilegeEscalation: false + capabilities: + drop: ["ALL"] command: [sleep, 365d]As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/namespace/RWO/ROCM/14-ds.yaml` around lines 29 - 47, The init container "fix-permissions" is running as root (securityContext.runAsUser: 0) and neither "fix-permissions" nor the main container "test" set allowPrivilegeEscalation, which is insecure; change the init container to run as a non-root UID (e.g., set securityContext.runAsUser: 1000 and/or securityContext.runAsNonRoot: true for "fix-permissions") and add securityContext.allowPrivilegeEscalation: false to both the "fix-permissions" initContainer and the "test" container (also consider adding securityContext.readOnlyRootFilesystem: true and dropping capabilities if appropriate) so both containers no longer run with elevated privileges.
🧹 Nitpick comments (2)
config/operator/kustomization.yaml (1)
6-14: Trim the legacy operator image aliases.
config/operator/operator.yamlalready usesquay.io/gkm/gkm-operator:latest, so thecontrollerandquay.io/gkm/operatormappings are just extra indirection now. Keeping only the canonicalquay.io/gkm/gkm-operatorentry will make future overrides easier to reason about.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/operator/kustomization.yaml` around lines 6 - 14, Remove the legacy image name mappings in the kustomization.yaml and keep only the canonical mapping for quay.io/gkm/gkm-operator:latest; specifically delete the entries referencing the alias "controller" and the old "quay.io/gkm/operator" and retain only the image block whose name is "quay.io/gkm/gkm-operator" with newName quay.io/gkm/gkm-operator and newTag latest so overrides target a single, unambiguous image reference.config/kind-gpu/agent-remove-affinity-patch.yaml (1)
9-10: Keep the control-plane exclusion when dropping GPU affinity.
affinity: nullremoves thenode-role.kubernetes.io/control-planeguard fromconfig/agent/gkm-agent-nogpu.yaml:20-42together with the GPU checks. That makes the “worker nodes only” behavior depend on other patches instead of this overlay. Re-add the control-plane exclusion here and only drop the GPU-specific predicates.Suggested patch
- affinity: null + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: node-role.kubernetes.io/control-plane + operator: DoesNotExist🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/kind-gpu/agent-remove-affinity-patch.yaml` around lines 9 - 10, The current patch sets affinity: null which removes the control-plane exclusion; instead restore a nodeAffinity block that keeps the "worker nodes only" guard while removing only the GPU-specific predicates. Replace affinity: null with an affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms that include a matchExpression to exclude control-plane nodes (e.g. key: node-role.kubernetes.io/control-plane, operator: DoesNotExist) and remove any matchExpressions or nodeSelector entries that reference GPU-specific keys (e.g. nvidia.com/gpu, feature.node.kubernetes.io/gpu) so only the control-plane exclusion remains; mirror the structure used in config/agent/gkm-agent-nogpu.yaml around lines 20-42 to ensure consistent semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@config/agent/gkm-agent-nvidia.yaml`:
- Around line 24-33: The current Kubernetes pod affinity under
affinity.nodeAffinity.nodeSelectorTerms only matches PCI classes 0300 and 0302;
add a third matchExpression for key
feature.node.kubernetes.io/pci-0380_10de.present with operator: Exists (or
consolidate the matchExpressions list to include 0300, 0302 and 0380) so the
NVIDIA DaemonSet will also schedule on nodes labeled with PCI class 0380.
- Around line 60-61: The LD_LIBRARY_PATH environment entries currently overwrite
the container's libraries by hardcoding /usr/lib64; change each LD_LIBRARY_PATH
env var to preserve the container's CUDA/NVML library paths and only add the
host library directory instead of replacing the whole variable (i.e.,
prepend/append the host path to the existing LD_LIBRARY_PATH or explicitly
include the container's standard CUDA/NVML paths such as the Ubuntu GPU library
directories plus the host path); apply this fix to all LD_LIBRARY_PATH env
entries in the NVIDIA agent spec so the container retains its distro-native
CUDA/NVML paths while allowing host libs to be visible.
In `@config/agent/README.md`:
- Around line 123-135: The "Node Selectors" docs incorrectly show nodeSelector
strings; update the section to document using affinity.nodeAffinity with
requiredDuringSchedulingIgnoredDuringExecution and matchExpressions using the
Exists operator instead of feature.node...: "true" strings; replace the
nodeSelector examples with affinity examples that include nodeSelectorTerms ->
matchExpressions entries for the NVIDIA keys (e.g.,
feature.node.kubernetes.io/pci-0300_10de.present and
feature.node.kubernetes.io/pci-0302_10de.present) and use operator: Exists (do
the same pattern for AMD keys), ensuring the README's examples reference
affinity.nodeAffinity and the requiredDuringSchedulingIgnoredDuringExecution
structure and not nodeSelector.
- Around line 37-39: The README's label examples use vendor-only format
(`pci-10de.present`) but the DaemonSets emit class-qualified labels; update the
examples in config/agent/README.md to the actual format used by the DaemonSet
(e.g., `feature.node.kubernetes.io/pci-0300_10de.present=true` for NVIDIA and
`feature.node.kubernetes.io/pci-0302_1002.present=true` for AMD), replace all
vendor-only examples with the class_vendor format, and add a short note
describing the pattern
`feature.node.kubernetes.io/pci-<class>_<vendor>.present=true` so it matches the
implementation.
In `@config/nfd/README.md`:
- Around line 98-110: The Integration section incorrectly shows a nodeSelector
using string values; replace those examples by documenting affinity.nodeAffinity
usage: remove the nodeSelector blocks and show affinity.nodeAffinity ->
requiredDuringSchedulingIgnoredDuringExecution -> nodeSelectorTerms with
matchExpressions using operator: Exists, and use the correct feature keys (e.g.
feature.node.kubernetes.io/pci-0300_10de.present and
feature.node.kubernetes.io/pci-0302_10de.present for NVIDIA agents) so the
DaemonSets use Exists-based nodeAffinity rather than string-valued nodeSelector.
- Around line 11-12: The README currently shows vendor-only labels
(`pci-10de.present`, `pci-1002.present`) but the agent DaemonSets use
class-qualified labels (e.g., `pci-0300_10de.present`, `pci-0302_10de.present`);
update the README so all examples and verification commands (the checks around
lines referenced in the comment) use the exact PCI class + vendor labels used by
the DaemonSets (replace vendor-only labels with `pci-0300_10de.present` and
`pci-0302_10de.present` for NVIDIA and `pci-0300_1002.present` /
`pci-0302_1002.present` for AMD), and adjust the kubectl verification commands
to query those class-qualified labels so they return the expected results.
- Around line 38-39: Update the verification commands to use the class-qualified
NFD label keys instead of the bare vendor tokens: replace occurrences of
"feature.node.kubernetes.io/pci-10de.present" and
"feature.node.kubernetes.io/pci-1002.present" with their class-qualified forms
"feature.node.kubernetes.io/pci.vendor-10de.present" and
"feature.node.kubernetes.io/pci.vendor-1002.present" in the README so the
kubectl get nodes -L command lists the correct NFD labels.
In `@docs/GettingStartedGuide.md`:
- Around line 13-28: The documentation's Automated Installation section (under
"Automated Installation (RHEL 10 / CentOS Stream 10)") incorrectly lists kubectl
minimum as v1.11.3+; update the text in that section and the kubectl bullet so
the minimum required kubectl version is v1.24+ (to match NFD v0.17.2 and
Kubernetes v1.24+). Ensure the change is applied to the make install-deps /
kubectl install description so users installing via "make install-deps" see the
corrected kubectl minimum.
In `@examples/namespace/RWO/CUDA/11-gkmcache.yaml`:
- Around line 13-19: The podTemplate currently only adds a toleration
(podTemplate.spec.tolerations) which is permissive; update the Pod spec to
require NVIDIA nodes by adding either a nodeSelector or a required nodeAffinity:
under podTemplate.spec add nodeSelector: {"nvidia.com/gpu": "present"} or,
preferably, add
affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution with a
nodeSelectorTerm that uses matchExpressions key "nvidia.com/gpu" operator
"Exists" (or matching the label used by your NVIDIA nodes), so the CUDA
extraction job is scheduled only onto NVIDIA GPU nodes.
In `@examples/namespace/RWO/CUDA/12-ds.yaml`:
- Around line 7-17: The DaemonSet label keys are inconsistent: the DaemonSet
metadata uses gkm.io/pvcMutation (camelCase) while the pod template uses
gkm.io/pvc-mutation (kebab-case); update the DaemonSet-level label (under
labels: and metadata:) to use the same key as the pod template
(gkm.io/pvc-mutation) so selectors/mutation webhooks match, ensuring you change
every occurrence of gkm.io/pvcMutation to gkm.io/pvc-mutation in this resource
(including labels: and any metadata sections).
In `@examples/namespace/RWO/CUDA/README.md`:
- Around line 50-52: Update the README namespace to match the manifests: replace
all occurrences of "gkm-test-ns-nvidia-rwo-1" in
examples/namespace/RWO/CUDA/README.md with "gkm-test-ns-cuda-rwo-1" so kubectl
commands align with the actual manifests (see manifests 12-ds.yaml and
13-pod.yaml); apply the same replacement for the additional occurrences called
out (lines around 63-64, 70, 77, 84-89, 95-96, 105, 120-121, 127) to ensure
every kubectl command and example uses the correct namespace string.
- Around line 84-89: The README uses pod/daemonset names that don't match the
manifests; replace the incorrect names referenced (gkm-test-nvidia-pod-1 and
gkm-test-nvidia-rwo-ds-1) with the manifest names (gkm-test-cuda-pod-1 and
gkm-test-cuda-rwo-ds-1) in the kubectl commands so the examples align with the
resources created by the manifests (update the lines showing kubectl get pod,
kubectl logs, and the DaemonSet/pod label commands).
In `@hack/install_deps.sh`:
- Line 11: The MIN_GO_VERSION variable is set to a non-existent Go release
("1.25.0"), causing downloads from https://go.dev/dl/go1.25.0.linux-amd64.tar.gz
to fail; update MIN_GO_VERSION to a valid Go release (e.g., "1.24.7" or
whichever official version you want to pin) in hack/install_deps.sh (the
MIN_GO_VERSION variable) and make the same matching change in the Containerfile
so the download URL and any version checks use a real release.
- Around line 44-53: The script hardcodes Fedora rawhide RPM URLs (see
"${FEDORA_BASE}/l/libbtrfs-...", "${FEDORA_BASE}/b/btrfs-progs-..." and
btrfs-progs-devel) which are fragile; change install_deps.sh to prefer
installing by package name via dnf (e.g., dnf install -y btrfs-progs
btrfs-progs-devel libbtrfs libbtrfsutil) or configure a stable Fedora repository
and fall back to downloading a versioned RPM only if dnf is unavailable, and
remove the fixed fc45 URLs or make them a configurable variable with a clear
comment so the dependency source is stable and maintainable.
In `@Makefile`:
- Around line 341-347: The deploy-nfd Makefile target currently masks wait
failures with "|| true" causing false success; update the deploy-nfd recipe
(target name: deploy-nfd) to stop swallowing the kubectl wait failure — remove
"|| true" from the line that runs $(KUBECTL) wait --for=condition=Available
--timeout=120s -n node-feature-discovery deployment/nfd-master and instead let
the command return a non-zero exit code (or add an explicit conditional that
prints a clear error and exits non-zero) so the Make target fails fast when NFD
never becomes ready.
- Around line 717-724: Remove the duplicate recipe that defines the
$(KIND_GPU_SIM_SCRIPT) target (the block that depends on $(LOCALBIN) and uses
curl to download $(KIND_GPU_SIM_SCRIPT_URL)); it overrides the earlier correct
rule and references an undefined $(KIND_GPU_SIM_SCRIPT_URL). Keep only the
.PHONY declaration and the phony target dependency: ".PHONY:
kind-gpu-sim-script" and "kind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT)", relying
on the original $(KIND_GPU_SIM_SCRIPT) rule that builds the script locally.
In `@mcv/pkg/accelerator/devices/amd_test.go`:
- Around line 285-289: TranslateGPUToArch misclassifies "RDNA 2" because the
function checks a generic "RDNA" branch before the more specific RDNA2 pattern;
update TranslateGPUToArch to match versioned RDNA strings first (e.g., check for
"RDNA 2" / "RDNA 3" or parse the numeric RDNA version) before the generic "RDNA"
case so "RDNA 2" resolves to gfx1030; adjust the matching order or use a regex
that extracts the RDNA major version and maps 2→gfx1030 (while preserving
existing mappings for other checks like "Instinct MI210" / "gfx90a").
---
Outside diff comments:
In `@examples/namespace/RWO/ROCM/14-ds.yaml`:
- Around line 29-47: The init container "fix-permissions" is running as root
(securityContext.runAsUser: 0) and neither "fix-permissions" nor the main
container "test" set allowPrivilegeEscalation, which is insecure; change the
init container to run as a non-root UID (e.g., set securityContext.runAsUser:
1000 and/or securityContext.runAsNonRoot: true for "fix-permissions") and add
securityContext.allowPrivilegeEscalation: false to both the "fix-permissions"
initContainer and the "test" container (also consider adding
securityContext.readOnlyRootFilesystem: true and dropping capabilities if
appropriate) so both containers no longer run with elevated privileges.
---
Nitpick comments:
In `@config/kind-gpu/agent-remove-affinity-patch.yaml`:
- Around line 9-10: The current patch sets affinity: null which removes the
control-plane exclusion; instead restore a nodeAffinity block that keeps the
"worker nodes only" guard while removing only the GPU-specific predicates.
Replace affinity: null with an affinity: nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms that include a
matchExpression to exclude control-plane nodes (e.g. key:
node-role.kubernetes.io/control-plane, operator: DoesNotExist) and remove any
matchExpressions or nodeSelector entries that reference GPU-specific keys (e.g.
nvidia.com/gpu, feature.node.kubernetes.io/gpu) so only the control-plane
exclusion remains; mirror the structure used in
config/agent/gkm-agent-nogpu.yaml around lines 20-42 to ensure consistent
semantics.
In `@config/operator/kustomization.yaml`:
- Around line 6-14: Remove the legacy image name mappings in the
kustomization.yaml and keep only the canonical mapping for
quay.io/gkm/gkm-operator:latest; specifically delete the entries referencing the
alias "controller" and the old "quay.io/gkm/operator" and retain only the image
block whose name is "quay.io/gkm/gkm-operator" with newName
quay.io/gkm/gkm-operator and newTag latest so overrides target a single,
unambiguous image reference.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 8138b4b8-1144-4cd2-b83e-9ea48c888e55
⛔ Files ignored due to path filters (1)
mcv/go.sumis excluded by!**/*.sum
📒 Files selected for processing (39)
.github/workflows/image-build.ymlContainerfile.gkm-agentContainerfile.gkm-agentsMakefileconfig/agent/README.mdconfig/agent/gkm-agent-amd.yamlconfig/agent/gkm-agent-nogpu.yamlconfig/agent/gkm-agent-nvidia.yamlconfig/agent/kustomization.yamlconfig/configMap/configMap.yamlconfig/configMap/kustomization.yamlconfig/kind-gpu/agent-remove-affinity-patch.yamlconfig/kind-gpu/kustomization.yamlconfig/kyverno/values-no-gpu.yamlconfig/nfd/README.mdconfig/nfd/kustomization.yamlconfig/nfd/nfd-worker-conf.yamlconfig/nfd/patch-nfd-gc.yamlconfig/nfd/patch-nfd-workers.yamlconfig/operator/kustomization.yamlconfig/operator/operator.yamldocs/GettingStartedGuide.mdexamples/namespace/RWO/CUDA/10-namespace.yamlexamples/namespace/RWO/CUDA/11-gkmcache.yamlexamples/namespace/RWO/CUDA/12-ds.yamlexamples/namespace/RWO/CUDA/13-pod.yamlexamples/namespace/RWO/CUDA/README.mdexamples/namespace/RWO/ROCM/10-namespace.yamlexamples/namespace/RWO/ROCM/11-gkmcache.yamlexamples/namespace/RWO/ROCM/12-ds.yamlexamples/namespace/RWO/ROCM/13-ds.yamlexamples/namespace/RWO/ROCM/14-ds.yamlexamples/namespace/RWO/ROCM/21-gkmcache-cosign-v3.yamlexamples/namespace/RWO/ROCM/22-ds.yamlgkm-codespell.precommit-tomlhack/install_deps.shmcv/go.modmcv/pkg/accelerator/devices/amd.gomcv/pkg/accelerator/devices/amd_test.go
💤 Files with no reviewable changes (1)
- Containerfile.gkm-agent
| affinity: | ||
| nodeAffinity: | ||
| requiredDuringSchedulingIgnoredDuringExecution: | ||
| nodeSelectorTerms: | ||
| - matchExpressions: | ||
| - key: feature.node.kubernetes.io/pci-0300_10de.present | ||
| operator: Exists | ||
| - matchExpressions: | ||
| - key: feature.node.kubernetes.io/pci-0302_10de.present | ||
| operator: Exists |
There was a problem hiding this comment.
Include PCI class 0380 in the NVIDIA affinity.
This selector only matches 0300 and 0302, so nodes labeled only with feature.node.kubernetes.io/pci-0380_10de.present will never get the NVIDIA DaemonSet. That silently breaks auto-scheduling for one of the GPU classes this PR is trying to support.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@config/agent/gkm-agent-nvidia.yaml` around lines 24 - 33, The current
Kubernetes pod affinity under affinity.nodeAffinity.nodeSelectorTerms only
matches PCI classes 0300 and 0302; add a third matchExpression for key
feature.node.kubernetes.io/pci-0380_10de.present with operator: Exists (or
consolidate the matchExpressions list to include 0300, 0302 and 0380) so the
NVIDIA DaemonSet will also schedule on nodes labeled with PCI class 0380.
| - name: LD_LIBRARY_PATH | ||
| value: /usr/lib64 |
There was a problem hiding this comment.
Don't hard-code /usr/lib64 as the only NVIDIA library path.
The NVIDIA image here is Ubuntu-based, but this mounts the host's /usr/lib64 over the container and replaces LD_LIBRARY_PATH with that single directory. That works only on RHEL-style hosts and can also hide the image's own default CUDA/NVML paths, so NVML loading becomes host-distro dependent.
Also applies to: 79-81, 104-108
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@config/agent/gkm-agent-nvidia.yaml` around lines 60 - 61, The LD_LIBRARY_PATH
environment entries currently overwrite the container's libraries by hardcoding
/usr/lib64; change each LD_LIBRARY_PATH env var to preserve the container's
CUDA/NVML library paths and only add the host library directory instead of
replacing the whole variable (i.e., prepend/append the host path to the existing
LD_LIBRARY_PATH or explicitly include the container's standard CUDA/NVML paths
such as the Ubuntu GPU library directories plus the host path); apply this fix
to all LD_LIBRARY_PATH env entries in the NVIDIA agent spec so the container
retains its distro-native CUDA/NVML paths while allowing host libs to be
visible.
| NFD will automatically add labels like: | ||
| - `feature.node.kubernetes.io/pci-10de.present=true` (NVIDIA, vendor ID: 0x10de) | ||
| - `feature.node.kubernetes.io/pci-1002.present=true` (AMD, vendor ID: 0x1002) |
There was a problem hiding this comment.
Label format documented doesn't match actual DaemonSet configuration.
Similar to the NFD README, this documents vendor-only labels (pci-10de.present), but the actual DaemonSets use class-qualified labels like pci-0300_10de.present. Update to match the actual implementation.
Proposed fix
NFD will automatically add labels like:
-- `feature.node.kubernetes.io/pci-10de.present=true` (NVIDIA, vendor ID: 0x10de)
-- `feature.node.kubernetes.io/pci-1002.present=true` (AMD, vendor ID: 0x1002)
+- `feature.node.kubernetes.io/pci-0300_10de.present=true` (NVIDIA VGA controller)
+- `feature.node.kubernetes.io/pci-0302_10de.present=true` (NVIDIA 3D controller)
+- `feature.node.kubernetes.io/pci-0300_1002.present=true` (AMD VGA controller)
+- `feature.node.kubernetes.io/pci-0302_1002.present=true` (AMD 3D controller)
+- `feature.node.kubernetes.io/pci-0380_1002.present=true` (AMD Display controller)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@config/agent/README.md` around lines 37 - 39, The README's label examples use
vendor-only format (`pci-10de.present`) but the DaemonSets emit class-qualified
labels; update the examples in config/agent/README.md to the actual format used
by the DaemonSet (e.g., `feature.node.kubernetes.io/pci-0300_10de.present=true`
for NVIDIA and `feature.node.kubernetes.io/pci-0302_1002.present=true` for AMD),
replace all vendor-only examples with the class_vendor format, and add a short
note describing the pattern
`feature.node.kubernetes.io/pci-<class>_<vendor>.present=true` so it matches the
implementation.
| ## Node Selectors | ||
|
|
||
| The DaemonSets use PCI vendor ID-based node selectors: | ||
|
|
||
| ```yaml | ||
| # NVIDIA nodes | ||
| nodeSelector: | ||
| feature.node.kubernetes.io/pci-10de.present: "true" | ||
|
|
||
| # AMD nodes | ||
| nodeSelector: | ||
| feature.node.kubernetes.io/pci-1002.present: "true" | ||
| ``` |
There was a problem hiding this comment.
Node selectors section shows incorrect format.
The DaemonSets use affinity.nodeAffinity with Exists operator, not nodeSelector with string values. This documentation is misleading.
Proposed fix
-The DaemonSets use PCI vendor ID-based node selectors:
+The DaemonSets use PCI class+vendor ID-based node affinity:
```yaml
-# NVIDIA nodes
-nodeSelector:
- feature.node.kubernetes.io/pci-10de.present: "true"
-
-# AMD nodes
-nodeSelector:
- feature.node.kubernetes.io/pci-1002.present: "true"
+# NVIDIA nodes (VGA or 3D controller)
+affinity:
+ nodeAffinity:
+ requiredDuringSchedulingIgnoredDuringExecution:
+ nodeSelectorTerms:
+ - matchExpressions:
+ - key: feature.node.kubernetes.io/pci-0300_10de.present
+ operator: Exists
+ - matchExpressions:
+ - key: feature.node.kubernetes.io/pci-0302_10de.present
+ operator: Exists</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @config/agent/README.md around lines 123 - 135, The "Node Selectors" docs
incorrectly show nodeSelector strings; update the section to document using
affinity.nodeAffinity with requiredDuringSchedulingIgnoredDuringExecution and
matchExpressions using the Exists operator instead of feature.node...: "true"
strings; replace the nodeSelector examples with affinity examples that include
nodeSelectorTerms -> matchExpressions entries for the NVIDIA keys (e.g.,
feature.node.kubernetes.io/pci-0300_10de.present and
feature.node.kubernetes.io/pci-0302_10de.present) and use operator: Exists (do
the same pattern for AMD keys), ensuring the README's examples reference
affinity.nodeAffinity and the requiredDuringSchedulingIgnoredDuringExecution
structure and not nodeSelector.
</details>
<!-- fingerprinting:phantom:poseidon:ocelot:03c827ba-87a4-4193-8273-611dfd44f42f -->
<!-- This is an auto-generated comment by CodeRabbit -->
| - **NVIDIA GPUs**: `feature.node.kubernetes.io/pci-10de.present=true` (vendor ID: 0x10de) | ||
| - **AMD GPUs**: `feature.node.kubernetes.io/pci-1002.present=true` (vendor ID: 0x1002) |
There was a problem hiding this comment.
Documentation mismatch: Labels documented don't match actual usage.
The README documents vendor-only labels (pci-10de.present, pci-1002.present), but the actual agent DaemonSets in config/agent/gkm-agent-nvidia.yaml and config/agent/gkm-agent-amd.yaml use PCI class-qualified labels like pci-0300_10de.present (VGA controller) and pci-0302_10de.present (3D controller).
This will confuse users and cause the verification commands (lines 39, 92-95) to return no results.
Proposed fix
-For GKM, NFD automatically labels nodes with PCI device vendor IDs, enabling GPU-specific agent deployment:
+For GKM, NFD automatically labels nodes with PCI device class and vendor IDs, enabling GPU-specific agent deployment:
-- **NVIDIA GPUs**: `feature.node.kubernetes.io/pci-10de.present=true` (vendor ID: 0x10de)
-- **AMD GPUs**: `feature.node.kubernetes.io/pci-1002.present=true` (vendor ID: 0x1002)
+- **NVIDIA GPUs**: `feature.node.kubernetes.io/pci-0300_10de.present=true` or `pci-0302_10de.present=true` (vendor ID: 0x10de, class: VGA/3D)
+- **AMD GPUs**: `feature.node.kubernetes.io/pci-0300_1002.present=true`, `pci-0302_1002.present=true`, or `pci-0380_1002.present=true` (vendor ID: 0x1002)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@config/nfd/README.md` around lines 11 - 12, The README currently shows
vendor-only labels (`pci-10de.present`, `pci-1002.present`) but the agent
DaemonSets use class-qualified labels (e.g., `pci-0300_10de.present`,
`pci-0302_10de.present`); update the README so all examples and verification
commands (the checks around lines referenced in the comment) use the exact PCI
class + vendor labels used by the DaemonSets (replace vendor-only labels with
`pci-0300_10de.present` and `pci-0302_10de.present` for NVIDIA and
`pci-0300_1002.present` / `pci-0302_1002.present` for AMD), and adjust the
kubectl verification commands to query those class-qualified labels so they
return the expected results.
| echo "" | ||
|
|
||
| # Minimum required versions | ||
| MIN_GO_VERSION="1.25.0" |
There was a problem hiding this comment.
Go 1.25.0 does not exist.
Same issue as in the Containerfile - Go version 1.25.0 is not a valid release. The script will fail when attempting to download from https://go.dev/dl/go1.25.0.linux-amd64.tar.gz.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hack/install_deps.sh` at line 11, The MIN_GO_VERSION variable is set to a
non-existent Go release ("1.25.0"), causing downloads from
https://go.dev/dl/go1.25.0.linux-amd64.tar.gz to fail; update MIN_GO_VERSION to
a valid Go release (e.g., "1.24.7" or whichever official version you want to
pin) in hack/install_deps.sh (the MIN_GO_VERSION variable) and make the same
matching change in the Containerfile so the download URL and any version checks
use a real release.
| sudo rpm -ivh --nodeps \ | ||
| "${FEDORA_BASE}/l/libbtrfs-6.19-1.fc45.x86_64.rpm" \ | ||
| "${FEDORA_BASE}/l/libbtrfsutil-6.19-1.fc45.x86_64.rpm" 2>/dev/null || echo "Libraries may already be installed" | ||
|
|
||
| # Now install devel package with --nodeps | ||
| sudo rpm -ivh --nodeps \ | ||
| "${FEDORA_BASE}/b/btrfs-progs-6.19-1.fc45.x86_64.rpm" 2>/dev/null || echo "btrfs-progs may already be installed" | ||
|
|
||
| sudo rpm -ivh --nodeps \ | ||
| "${FEDORA_BASE}/b/btrfs-progs-devel-6.19-1.fc45.x86_64.rpm" |
There was a problem hiding this comment.
Hardcoded Fedora rawhide RPM URLs are fragile.
These URLs point to specific package versions in Fedora rawhide (fc45), which is a rolling development branch. Packages are frequently rebuilt and old versions removed, causing these URLs to break.
Consider using a more stable package source, or document that these URLs need periodic updates. Alternatively, use dnf with a Fedora repository if available.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hack/install_deps.sh` around lines 44 - 53, The script hardcodes Fedora
rawhide RPM URLs (see "${FEDORA_BASE}/l/libbtrfs-...",
"${FEDORA_BASE}/b/btrfs-progs-..." and btrfs-progs-devel) which are fragile;
change install_deps.sh to prefer installing by package name via dnf (e.g., dnf
install -y btrfs-progs btrfs-progs-devel libbtrfs libbtrfsutil) or configure a
stable Fedora repository and fall back to downloading a versioned RPM only if
dnf is unavailable, and remove the fixed fc45 URLs or make them a configurable
variable with a clear comment so the dependency source is stable and
maintainable.
| .PHONY: deploy-nfd | ||
| deploy-nfd: kustomize ## Deploy Node Feature Discovery for GPU detection | ||
| @echo "Deploying Node Feature Discovery (NFD)..." | ||
| $(KUSTOMIZE) build config/nfd | $(KUBECTL) apply -f - | ||
| @echo "Waiting for NFD to be ready..." | ||
| @$(KUBECTL) wait --for=condition=Available --timeout=120s -n node-feature-discovery deployment/nfd-master || true | ||
| @echo "NFD deployed successfully." |
There was a problem hiding this comment.
Fail fast if NFD never becomes ready.
deploy-nfd reports success even when the readiness check times out because the failure is masked with || true. Since the vendor-specific DaemonSets depend on NFD labels, this turns a required dependency failure into a much harder-to-debug "agents never schedule" install.
Suggested fix
deploy-nfd: kustomize ## Deploy Node Feature Discovery for GPU detection
`@echo` "Deploying Node Feature Discovery (NFD)..."
$(KUSTOMIZE) build config/nfd | $(KUBECTL) apply -f -
`@echo` "Waiting for NFD to be ready..."
- @$(KUBECTL) wait --for=condition=Available --timeout=120s -n node-feature-discovery deployment/nfd-master || true
+ @$(KUBECTL) wait --for=condition=Available --timeout=120s -n node-feature-discovery deployment/nfd-master
`@echo` "NFD deployed successfully."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Makefile` around lines 341 - 347, The deploy-nfd Makefile target currently
masks wait failures with "|| true" causing false success; update the deploy-nfd
recipe (target name: deploy-nfd) to stop swallowing the kubectl wait failure —
remove "|| true" from the line that runs $(KUBECTL) wait
--for=condition=Available --timeout=120s -n node-feature-discovery
deployment/nfd-master and instead let the command return a non-zero exit code
(or add an explicit conditional that prints a clear error and exits non-zero) so
the Make target fails fast when NFD never becomes ready.
| .PHONY: kind-gpu-sim-script | ||
| kind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT) ## Download kind-gpu-sim-script locally if necessary. | ||
| $(KIND_GPU_SIM_SCRIPT): $(LOCALBIN) | ||
| if [ ! -f $(KIND_GPU_SIM_SCRIPT) ]; then \ | ||
| echo "Downloading $(KIND_GPU_SIM_SCRIPT)"; \ | ||
| curl -L -o $(KIND_GPU_SIM_SCRIPT) $(KIND_GPU_SIM_SCRIPT_URL); \ | ||
| chmod +x $(KIND_GPU_SIM_SCRIPT); \ | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Looking for duplicate rules and URL definition:"
rg -n 'kind-gpu-sim-script:|\$\(KIND_GPU_SIM_SCRIPT\):|(^KIND_GPU_SIM_SCRIPT_URL\s*[:?+]?=)|KIND_GPU_SIM_SCRIPT_URL' MakefileRepository: redhat-et/GKM
Length of output: 387
🏁 Script executed:
# Search for any definition of KIND_GPU_SIM_SCRIPT_URL in the entire Makefile
rg -n 'KIND_GPU_SIM_SCRIPT_URL\s*[:?+]?=' Makefile
# Also check context around the first rule at line 455
sed -n '450,465p' MakefileRepository: redhat-et/GKM
Length of output: 771
Remove the second $(KIND_GPU_SIM_SCRIPT) recipe at lines 719–724.
This rule overrides the earlier, correct rule at line 455 that installs from the local hack/kind-gpu-sim.sh. The second rule attempts to download via $(KIND_GPU_SIM_SCRIPT_URL), which is not defined anywhere in the Makefile. On a clean checkout, this causes the curl command to execute with an empty URL.
Keep only the .PHONY declaration and the phony target dependency:
.PHONY: kind-gpu-sim-script
kind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT)
The original rule at line 455 already handles building $(KIND_GPU_SIM_SCRIPT) correctly by copying from the local script.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Makefile` around lines 717 - 724, Remove the duplicate recipe that defines
the $(KIND_GPU_SIM_SCRIPT) target (the block that depends on $(LOCALBIN) and
uses curl to download $(KIND_GPU_SIM_SCRIPT_URL)); it overrides the earlier
correct rule and references an undefined $(KIND_GPU_SIM_SCRIPT_URL). Keep only
the .PHONY declaration and the phony target dependency: ".PHONY:
kind-gpu-sim-script" and "kind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT)", relying
on the original $(KIND_GPU_SIM_SCRIPT) rule that builds the script locally.
There was a problem hiding this comment.
Looks like the rebase kept the old kind-gpu-sim script download. This target should be removed.
There was a problem hiding this comment.
@cmagina, agreed. The duplicate $(KIND_GPU_SIM_SCRIPT): $(LOCALBIN) recipe block needs to be removed — it references the undefined $(KIND_GPU_SIM_SCRIPT_URL) and overrides the correct original rule that installs from hack/kind-gpu-sim.sh. Only the .PHONY declaration and the phony alias should remain.
- Remove the duplicate
$(KIND_GPU_SIM_SCRIPT): $(LOCALBIN)recipe (the curl-based download block) from the newkind-gpu-sim-scriptsection - Keep only the
.PHONYdeclaration and thekind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT)dependency line
Failed to handle agent chat message. Please try again.
Move more specific RDNA 3 and RDNA 2 cases before the generic RDNA case to prevent incorrect architecture detection. The generic pattern must be checked last to avoid matching substrings of more specific GPU names. Fixes TestGFXArchitectureTranslation test failure. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Billy99
left a comment
There was a problem hiding this comment.
I'm still playing with your PR, but want to go ahead and add this.
| echo "=== Step 3: Installing btrfs development headers ===" | ||
| echo "=====================================================" | ||
| # First install the base libraries with --nodeps to skip filesystem checks | ||
| sudo rpm -ivh --nodeps \ |
There was a problem hiding this comment.
Should this just run rpm -qa | grep libbtrfs to see if it is already installed or just try to install the package without a version? I'm on Fedora 42 and running make install-deps failed. Everything is already installed (btrfs 6.16.1-1), I was just running to see what would happen.
There was a problem hiding this comment.
I read the changes to the Getting Started Guide and realized that this is only intended for RHEL 10 or CentOS Stream 10 systems, so take or leave the comment.
| - name: quay.io/gkm/agent-nvidia | ||
| newName: quay.io/gkm/gkm-agent-nvidia | ||
| newTag: latest | ||
| - name: quay.io/gkm/gkm-agent-amd |
There was a problem hiding this comment.
Looks like you have quay.io/gkm/gkm-agent-amd, quay.io/gkm/gkm-agent-nogpu and quay.io/gkm/gkm-agent-nvidia duplicated
| newTag: latest | ||
| - name: quay.io/gkm/operator | ||
| newName: quay.io/gkm/operator | ||
| newName: quay.io/gkm/gkm-operator |
There was a problem hiding this comment.
Looks like you have quay.io/gkm/gkm-operator duplicated
There was a problem hiding this comment.
I think this file just needs to be:
resources:
- operator.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
images:
- name: quay.io/gkm/gkm-operator
newName: quay.io/gkm/gkm-operator
newTag: latest
| .PHONY: build-installer | ||
| build-installer: manifests generate kustomize ## Generate a consolidated YAML with CRDs and deployment. | ||
| mkdir -p dist | ||
| cd config/operator && $(KUSTOMIZE) edit set image controller=${OPERATOR_IMG} |
There was a problem hiding this comment.
This change along with the changes to config/operator/kustomization.yaml:
| cd config/operator && $(KUSTOMIZE) edit set image quay.io/gkm/gkm-operator=${OPERATOR_IMG} |
|
|
||
| .PHONY: prepare-deploy | ||
| prepare-deploy: | ||
| cd config/operator && $(KUSTOMIZE) edit set image quay.io/gkm/operator=${OPERATOR_IMG} |
There was a problem hiding this comment.
| cd config/operator && $(KUSTOMIZE) edit set image quay.io/gkm/operator=${OPERATOR_IMG} | |
| cd config/operator && $(KUSTOMIZE) edit set image quay.io/gkm/gkm-operator=${OPERATOR_IMG} |
| ## Location to install dependencies to | ||
| $(LOCALBIN): | ||
| mkdir -p $@ | ||
|
|
There was a problem hiding this comment.
This was removed in Craig's PR, but add it since you are in this file:
KIND_GPU_SIM_SCRIPT_URL := https://raw.githubusercontent.com/maryamtahhan/kind-gpu-sim/refs/heads/main/kind-gpu-sim.sh
|
Found an interesting problem. On my desktop, I get both the nvidia and amd agents. This is because my desktop has an nvidia dGPU and an amd iGPU. This wouldn't be an issue for AMD EPYC CPUs as they don't have an iGPU, but if anyone put both nvidia and amd gpus in a server, we'd get both agents. Again, probably a corner case, but wanted to mention it. |
Add Multi-Agent-GPU Support with NFD-Based Deployment
Summary
This PR introduces GPU-specific agents for heterogeneous Kubernetes clusters with automatic hardware detection, enabling GKM to support both NVIDIA and AMD GPUs on different nodes within the same cluster.
Key Changes:
Architecture
The deployment now consists of three specialized agents:
gkm-agent-nvidia- NVIDIA GPU nodesnvcr.io/nvidia/cuda:12.6.3-base-ubuntu24.04feature.node.kubernetes.io/pci-10de.present=truegkm-agent-amd- AMD ROCm GPU nodesubuntu:24.04feature.node.kubernetes.io/pci-1002.present=truegkm-agent-nogpu- Non-GPU nodes (e.g., control-plane)ubuntu:24.04Node Feature Discovery Integration
NFD automatically labels nodes based on PCI vendor IDs:
10de→feature.node.kubernetes.io/pci-10de.present=true1002→feature.node.kubernetes.io/pci-1002.present=trueThis eliminates manual node labeling and enables automatic agent scheduling.
What's New
Features
build-image-agent-nvidia,build-image-agent-amd,build-image-agents-gpuFixes
NO_GPU_BUILDflagMigration Path
For existing deployments using the generic agent:
Testing
Breaking Changes
Containerfile.gkm-agentrenamed toContainerfile.gkm-agent-amdconfig/agent/gkm-agent.yamlreplaced with GPU-specific manifestsSummary by CodeRabbit
Release Notes
New Features
Documentation
Chores