Rocm nvidia agents by maryamtahhan · Pull Request #107 · redhat-et/GKM

maryamtahhan · 2026-03-12T13:28:14Z

Add Multi-Agent-GPU Support with NFD-Based Deployment

Summary

This PR introduces GPU-specific agents for heterogeneous Kubernetes clusters with automatic hardware detection, enabling GKM to support both NVIDIA and AMD GPUs on different nodes within the same cluster.

Key Changes:

Split monolithic agent into GPU-vendor-specific agents (NVIDIA, AMD, and no-GPU variants)
Integrated Node Feature Discovery (NFD) for automatic GPU vendor detection via PCI IDs
Added automated dependency installation for RHEL 10
Enhanced Makefile with flexible image build targets for individual agents
Improved GPU library mounting and scheduling

Architecture

The deployment now consists of three specialized agents:

gkm-agent-nvidia - NVIDIA GPU nodes
- Base: nvcr.io/nvidia/cuda:12.6.3-base-ubuntu24.04
- Includes: CUDA runtime with NVML libraries
- Node selector: feature.node.kubernetes.io/pci-10de.present=true
gkm-agent-amd - AMD ROCm GPU nodes
- Base: ubuntu:24.04
- Includes: ROCm 6.3.1 (amd-smi-lib, rocm-smi-lib)
- Node selector: feature.node.kubernetes.io/pci-1002.present=true
gkm-agent-nogpu - Non-GPU nodes (e.g., control-plane)
- Base: ubuntu:24.04
- Minimal footprint without GPU libraries
- Anti-affinity: Excludes control-plane nodes

Node Feature Discovery Integration

NFD automatically labels nodes based on PCI vendor IDs:

NVIDIA: PCI vendor 10de → feature.node.kubernetes.io/pci-10de.present=true
AMD: PCI vendor 1002 → feature.node.kubernetes.io/pci-1002.present=true

This eliminates manual node labeling and enables automatic agent scheduling.

What's New

Features

GPU-specific Containerfiles (Containerfile.gkm-agent-nvidia, Containerfile.gkm-agent-amd, Containerfile.gkm-agent-nogpu)
NFD configuration for automatic GPU detection (config/nfd/)
Individual agent deployments with proper node selectors (config/agent/)
Automated RHEL 10 dependency installer (hack/install_deps.sh)
Enhanced Makefile with targets: build-image-agent-nvidia, build-image-agent-amd, build-image-agents-gpu
Comprehensive documentation (config/agent/README.md, config/nfd/README.md)

Fixes

Mount GPU libraries to enable device access without requiring GPU resource requests
Exclude control-plane nodes from nogpu agent deployment
Correct GPU agent scheduling using NFD PCI class code labels instead of custom labels
Conditionally build agents based on NO_GPU_BUILD flag

Migration Path

For existing deployments using the generic agent:

# 1. Deploy NFD
kubectl apply -k config/nfd

# 2. Build GPU-specific agents
make build-image-agents-gpu

# 3. Deploy new agents
kubectl apply -k config/agent

# 4. Remove legacy agent
kubectl delete ds agent -n gkm-system

Testing

Verify NFD labels nodes correctly with PCI vendor IDs
Confirm NVIDIA agent deploys only to NVIDIA GPU nodes
Confirm AMD agent deploys only to AMD GPU nodes
Confirm nogpu agent deploys to non-GPU nodes (excluding control-plane)
Verify GPU library mounting without resource requests
Test RHEL 10 dependency installation script
Test NO GPU image in kind

Breaking Changes

Legacy Containerfile.gkm-agent renamed to Containerfile.gkm-agent-amd
Generic config/agent/gkm-agent.yaml replaced with GPU-specific manifests
Requires NFD deployment for automatic node labeling

Summary by CodeRabbit

Release Notes

New Features
- Added support for GPU-specific agent variants: NVIDIA CUDA, AMD ROCm, and no-GPU configurations
- Integrated automatic GPU detection via Node Feature Discovery for heterogeneous clusters
- Multi-target container builds enabling optimized images for each hardware configuration
Documentation
- New guides for multi-GPU agent deployment, Node Feature Discovery setup, and example CUDA/ROCm workloads
Chores
- Updated build pipeline for multi-variant agent image generation and deployment

coderabbitai · 2026-03-12T13:28:24Z

Warning

Rate limit exceeded

@maryamtahhan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 33 minutes and 27 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 33 minutes and 27 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ca74a70c-d6fa-4ca5-8de4-69ad62f73539

📥 Commits

Reviewing files that changed from the base of the PR and between c53c04e and 58f3b0b.

📒 Files selected for processing (1)

mcv/pkg/accelerator/devices/amd.go

📝 Walkthrough

Walkthrough

Restructures agent container builds from a single image to three GPU-variant targets (no-GPU, NVIDIA AMD) using a multi-stage Containerfile. Renames operator image to gkm-operator. Adds Node Feature Discovery (NFD) and Kyverno deployments. Introduces GPU-specific DaemonSet manifests with corresponding node affinity selectors. Expands Makefile with build targets for GPU variants and deployment management.

Changes

Cohort / File(s)	Summary
CI/Build Matrix `.github/workflows/image-build.yml`, `Makefile`	Updated build matrix from single `agent` image to three GPU variants (`gkm-agent-nogpu`, `gkm-agent-nvidia`, `gkm-agent-amd`) each with distinct Buildx targets. Renamed `operator` → `gkm-operator`. Updated Makefile with new agent build targets, image variable consolidation, and conditional build logic for `NO_GPU_BUILD`.
Dependency Installation `hack/install_deps.sh`, `docs/GettingStartedGuide.md`	Added `install_deps.sh` script for automated RHEL 10 dependency installation (Go, Podman, kubectl, development packages). Updated getting-started guide with automated/manual installation workflows.
Container Build Definitions `Containerfile.gkm-agent`, `Containerfile.gkm-agents`	Removed single-target `Containerfile.gkm-agent`. Introduced `Containerfile.gkm-agents` with shared `builder` stage and three runtime targets (`nogpu`, `nvidia`, `amd`) handling GPU libraries and configurations conditionally.
Agent Deployment Manifests `config/agent/gkm-agent-nvidia.yaml`, `config/agent/gkm-agent-amd.yaml`, `config/agent/gkm-agent-nogpu.yaml`, `config/agent/kustomization.yaml`, `config/agent/README.md`	Replaced generic agent manifest with three GPU-specific DaemonSets using node affinity selectors for PCI vendor labels. Updated Kustomize image mappings. Added comprehensive deployment documentation.
Operator Deployment `config/operator/operator.yaml`, `config/operator/kustomization.yaml`	Updated operator image reference from `quay.io/gkm/operator` to `quay.io/gkm/gkm-operator`. Modified Kustomize image overrides accordingly.
Node Feature Discovery (NFD) `config/nfd/kustomization.yaml`, `config/nfd/nfd-worker-conf.yaml`, `config/nfd/patch-nfd-gc.yaml`, `config/nfd/patch-nfd-workers.yaml`, `config/nfd/README.md`	Added complete NFD deployment configuration with upstream integration at v0.17.2, worker PCI discovery settings, GPU taint tolerations for NFD components, and deployment/troubleshooting documentation.
Kyverno & Kind Configuration `config/kyverno/values-no-gpu.yaml`, `config/kind-gpu/agent-remove-affinity-patch.yaml`, `config/kind-gpu/kustomization.yaml`	Extended Kyverno Helm values with hooks configuration for GPU nodes. Added affinity-removal patch for no-GPU agent on Kind clusters. Updated Kind kustomization to target all three agent variants with appropriate patches.
ConfigMap Updates `config/configMap/configMap.yaml`, `config/configMap/kustomization.yaml`	Updated legacy `gkm.agent.image` reference from `quay.io/gkm/agent:latest` to `quay.io/gkm/gkm-agent-nogpu:latest` with deprecation notice.
CUDA GPU Examples `examples/namespace/RWO/CUDA/10-namespace.yaml`, `examples/namespace/RWO/CUDA/11-gkmcache.yaml`, `examples/namespace/RWO/CUDA/12-ds.yaml`, `examples/namespace/RWO/CUDA/13-pod.yaml`, `examples/namespace/RWO/CUDA/README.md`	Added new namespace and GPU-specific example manifests (GKMCache, test DaemonSets and Pods) for NVIDIA CUDA workloads with comprehensive deployment and troubleshooting guide.
ROCm GPU Examples `examples/namespace/RWO/ROCM/10-namespace.yaml`, `examples/namespace/RWO/ROCM/11-gkmcache.yaml`, `examples/namespace/RWO/ROCM/12-ds.yaml`, `examples/namespace/RWO/ROCM/13-ds.yaml`, `examples/namespace/RWO/ROCM/14-ds.yaml`, `examples/namespace/RWO/ROCM/21-gkmcache-cosign-v3.yaml`, `examples/namespace/RWO/ROCM/22-ds.yaml`	Updated ROCm example namespace identifiers, GKMCache names, and DaemonSet references for consistency across ROCm-specific test scenarios.
AMD Device Handling `mcv/go.mod`, `mcv/pkg/accelerator/devices/amd.go`, `mcv/pkg/accelerator/devices/amd_test.go`	Added NVIDIA and AMD SMI indirect dependencies. Loosened JSON field types in `amd.go` from strict types to `interface{}` for flexibility with varying AMD SMI output formats. Added comprehensive unit tests for AMD GPU JSON parsing and architecture translation.
Tool Configuration `gkm-codespell.precommit-toml`	Extended spelling exception list with additional term for pre-commit checks.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Rocm nvidia agents' is partially related to the changeset but lacks specificity and clarity about the primary change.	Revise the title to be more specific and descriptive, such as 'Add GPU-vendor-specific agents with Node Feature Discovery integration' or 'Support heterogeneous Kubernetes clusters with NVIDIA and AMD GPU agents'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

maryamtahhan · 2026-03-12T13:29:36Z

agent images are small

 podman images
REPOSITORY                                TAG                      IMAGE ID      CREATED        SIZE
quay.io/gkm/agent-nogpu               latest                   b56b30880c0a  2 hours ago    345 MB
quay.io/gkm/agent-amd                 latest                   06834b1def0c  2 hours ago    811 MB
quay.io/gkm/agent-nvidia              latest                   3e5bd82e4b0d  2 hours ago    534 MB
quay.io/gkm/operator                  latest                   d63413d319ef  2 hours ago    235 MB

This commit addresses all review feedback and failing CI checks: - Add common agent base Containerfile with shared build stages - Update all agent Containerfiles with clear stage documentation - Add agent-base image to CI/CD workflow and Makefile - Fix image-build workflow to build all 4 agent variants (base, nvidia, amd, nogpu) - Fix 19 markdown linting errors in documentation files - Wrap long lines to ≤80 characters (MD013) - Add blank lines around code blocks (MD031) - Add blank lines around lists (MD032) Resolves: - Build Image (agent) workflow failure (missing Containerfile.gkm-agent) - Pre-commit markdown linting failures - PR review comment requesting common base container Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

…yment Replace generic agent with GPU-vendor-specific agents that deploy based on hardware detection. This enables hybrid clusters with both NVIDIA and AMD GPUs to run optimized agents with appropriate runtime libraries. Changes: - Add Containerfile.gkm-agent-nvidia (CUDA 12.6.3 base with NVML) - Add Containerfile.gkm-agent-amd (ROCm 6.3.1 with AMD SMI libraries) - Remove generic Containerfile.gkm-agent - Add DaemonSet manifests with PCI vendor ID-based node selectors: * gkm-agent-nvidia.yaml (nodeSelector: pci-10de.present) * gkm-agent-amd.yaml (nodeSelector: pci-1002.present) - Remove generic gkm-agent.yaml - Add Node Feature Discovery (NFD) deployment configuration - Update Makefile with GPU-specific build/push targets: * build-image-agent-nvidia, build-image-agent-amd * build-image-agents-gpu (builds both) * push-images-agents-gpu - Add mcv dependencies: go-nvlib v0.9.0, amdsmi (amd-staging) - Add comprehensive documentation for multi-GPU deployment The operator and CSI plugin remain unchanged and work with both agent types. NFD automatically labels nodes with GPU vendor information, enabling declarative GPU-specific agent deployment without manual intervention. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- Add deploy-nfd and undeploy-nfd targets for automated Node Feature Discovery deployment - Integrate NFD deployment into main deploy target for GPU detection - Add deploy-kyverno-production for non-Kind cluster Kyverno deployment - Add deploy-kyverno-with-policies combined target - Update deploy target to conditionally deploy Kyverno based on KYVERNO_ENABLED flag - Update undeploy target to clean up NFD and Kyverno when KYVERNO_ENABLED=true - Update prepare-deploy to configure all three agent image variants (NVIDIA, AMD, no-GPU) This enables 'make deploy' to automatically deploy a complete GKM stack including GPU detection (NFD) and optional image verification (Kyverno) on production clusters. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- When NO_GPU_BUILD=true, only build and push no-GPU agent - When NO_GPU_BUILD=false (default), build and push all three agents (NVIDIA, AMD, no-GPU) - This avoids unnecessary builds of GPU-specific agents for Kind/test clusters Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Add AGENT_NVIDIA_IMG, AGENT_AMD_IMG, and AGENT_NOGPU_IMG variables to allow individual override of agent images. This enables deploying with custom image names/tags without requiring the default naming scheme. Example usage: make deploy \ OPERATOR_IMG=quay.io/user/gkm:operator \ EXTRACT_IMG=quay.io/user/gkm:extract \ AGENT_NVIDIA_IMG=quay.io/user/gkm:agent-nvidia \ AGENT_AMD_IMG=quay.io/user/gkm:agent-amd \ AGENT_NOGPU_IMG=quay.io/user/gkm:agent-no-gpu Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

The GPU agents were not being scheduled on nodes with GPUs because NFD creates labels with PCI class codes (e.g., pci-0302_10de for NVIDIA 3D controllers), but agents were using simple nodeSelectors looking for vendor ID only (pci-10de). Changes: - Update NVIDIA agent to use nodeAffinity matching class codes 0300 and 0302 - Update AMD agent to use nodeAffinity matching class codes 0300, 0302, and 0380 - Upgrade NFD to v0.17.2 to fix deprecated node-role.kubernetes.io/master label - Replace wget with curl in Makefile for macOS compatibility Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

In multi-node clusters, the nogpu agent should not run on control-plane nodes. Also updated to match PCI class code label format consistent with GPU agents. Changes: - Add nodeAffinity to exclude nodes with node-role.kubernetes.io/control-plane label - Update GPU detection to use PCI class codes (0300, 0302, 0380) instead of vendor ID only - Ensures nogpu agent only runs on non-GPU worker nodes Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

… requests The GPU agents were unable to access GPUs because they lacked the necessary GPU runtime libraries. Following the NVIDIA device plugin pattern, we now mount: NVIDIA agent: - /usr/lib64 -> Contains libnvidia-ml.so and other NVIDIA libraries - LD_LIBRARY_PATH=/usr/lib64 environment variable AMD agent: - /opt/rocm -> ROCm libraries for AMD GPU management - /usr/lib64 -> System libraries - LD_LIBRARY_PATH=/opt/rocm/lib:/usr/lib64 This allows the agents to use NVML/ROCm APIs to detect and monitor ALL GPUs on the node without requesting gpu resources (nvidia.com/gpu or amd.com/gpu), which would limit visibility to only one GPU. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Add comprehensive installation script and make target to automate dependency setup on RHEL 10 systems. The script handles installation of build dependencies from CentOS Stream and Fedora repositories, and installs/upgrades go, podman, and kubectl to required versions. Changes: - Add hack/install_deps.sh script for RHEL 10 dependency installation - Add 'make install-deps' target to Makefile - Update GettingStartedGuide with automated installation instructions - Document package sources for RHEL 10 (CentOS Stream, Fedora) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

This commit addresses all review feedback and failing CI checks: - Add common agent base Containerfile with shared build stages - Update all agent Containerfiles with clear stage documentation - Add agent-base image to CI/CD workflow and Makefile - Fix image-build workflow to build all 4 agent variants (base, nvidia, amd, nogpu) - Fix 19 markdown linting errors in documentation files - Wrap long lines to ≤80 characters (MD013) - Add blank lines around code blocks (MD031) - Add blank lines around lists (MD032) Resolves: - Build Image (agent) workflow failure (missing Containerfile.gkm-agent) - Pre-commit markdown linting failures - PR review comment requesting common base container Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- Move inline comment out of matchExpressions list to avoid yamllint warnings - Fix indentation of nodeSelectorTerms and matchExpressions items - Ensure consistent 2-space indentation for YAML list items This resolves the pre-commit yamllint hook failures for: - examples/namespace/RWO-NVIDIA/12-ds.yaml - examples/namespace/RWO-NVIDIA/13-pod.yaml Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Move examples from flat structure to organized hierarchy: - examples/namespace/RWO-NVIDIA/ → examples/namespace/RWO/CUDA/ - examples/namespace/RWO-ROCM/ → examples/namespace/RWO/ROCM/ This change: - Updates README paths to reflect new directory structure - Includes yamllint fixes (proper indentation and comment placement) - Maintains consistent example organization under RWO/ Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

…nd-load-images The kind-load-images target was attempting to load ${AGENT_IMG} which is never built. Updated to load the actual agent images based on NO_GPU_BUILD flag: AGENT_BASE_IMG, AGENT_NOGPU_IMG (always), and AGENT_NVIDIA_IMG/AGENT_AMD_IMG (when NO_GPU_BUILD=false). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Fixes Kyverno and NFD component scheduling issues in Kind clusters with GPU taints by adding proper tolerations and removing duplicate deployments. Changes: - Use Kind-specific Kyverno values when NO_GPU=true in deploy target - Remove duplicate Kyverno deployment from run-on-kind target - Add GPU tolerations for Kyverno hooks/migration jobs - Add GPU tolerations for NFD garbage collector and workers Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

NFD is unnecessary in Kind simulated GPU environments. Instead, patch agent daemonsets to use GPU device plugin labels (rocm.amd.com/gpu.present, nvidia.com/gpu.present) for node affinity rather than NFD's PCI device labels. Changes: - Skip NFD deployment when NO_GPU=true (Kind clusters) - Skip NFD undeployment when NO_GPU=true - Add Kind-specific agent patches using device plugin labels - Patch gkm-agent-amd to use rocm.amd.com/gpu.present label - Patch gkm-agent-nvidia to use nvidia.com/gpu.present label - Patch gkm-agent-nogpu to exclude nodes with GPU labels Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Introduces SKIP_NFD flag to control NFD deployment separately from NO_GPU mode. For Kind clusters, we skip NFD deployment but simulate it by manually adding PCI device labels that NFD would normally create. Changes: - Add SKIP_NFD flag (default: false) to control NFD deployment - Use SKIP_NFD instead of NO_GPU for controlling NFD deployment/undeploy - Auto-label Kind worker nodes with NFD PCI device labels (nvidia/rocm) - Keep NO_GPU=true for Kind to use no-GPU agent mode - Remove device plugin label patches (revert to NFD PCI labels) - Update deploy-on-kind to pass both SKIP_NFD=true and NO_GPU=true Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

For Kind GPU simulation with NO_GPU=true, remove NFD PCI label requirements by removing node affinity from the nogpu agent. This allows the agent to schedule on all worker nodes without needing NFD labels. Changes: - Remove NFD PCI label addition from deploy-on-kind target - Add Kind-specific patch to remove node affinity from nogpu agent - Fix agent-patch.yaml to target all three agent daemonsets (amd, nvidia, nogpu) - NoGPU agents now schedule successfully in Kind clusters Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Make namespace and cache names consistent across ROCM and CUDA examples: - ROCM namespace: gkm-test-ns-rwo-1 → gkm-test-ns-rocm-rwo-1 - ROCM cache: vector-add-cache-rocm-v2-rwo → vector-add-cache-rocm-rwo - ROCM cache v3: vector-add-cache-rocm-v3-rwo → vector-add-cache-rocm-rwo-v3 - CUDA namespace: gkm-test-ns-nvidia-rwo-1 → gkm-test-ns-cuda-rwo-1 - CUDA workloads: gkm-test-nvidia-* → gkm-test-cuda-* Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Update ROCM workload names to be consistent with namespace naming: - gkm-test-ns-rwo-ds-* → gkm-test-rocm-rwo-ds-* - gkm-test-ns-rwo-v3-ds-* → gkm-test-rocm-rwo-v3-ds-* Now matches CUDA pattern: gkm-test-{vendor}-rwo-* Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Removed duplicated builder stage from NVIDIA, AMD, and nogpu agent Containerfiles. Each now uses FROM quay.io/gkm/gkm-agent-base:latest as the builder stage, eliminating code duplication while keeping GPU-specific runtime stages intact. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Replaced separate Containerfiles for each agent variant with a single Containerfile.gkm-agents containing multi-stage targets (nogpu, amd, nvidia). This eliminates cross-file dependencies and enables parallel CI builds. Changes: - Created Containerfile.gkm-agents with shared builder stage - nogpu target: complete agent with common runtime deps - amd target: extends nogpu, adds ROCm support only - nvidia target: CUDA runtime with agent binary - Updated Makefile to build using --target flags - Updated GitHub workflow to use single Containerfile - Removed obsolete individual Containerfiles - Updated documentation references Benefits: - No build dependencies between separate files - Builder stage always available in same file - AMD reuses all nogpu layers (more efficient) - CI workflows can build in parallel - Cleaner, more maintainable structure Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Changed gkm.agent.image from non-existent gkm-agent:latest to gkm-agent-nogpu:latest. This value is legacy/unused (operator only logs it), but needs to reference a real image for backwards compatibility. Each agent daemonset uses its GPU-specific image directly, so this configmap value is not actually used at runtime. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Add supplementalGroups to ROCm agent pod spec to grant video group membership (GIDs 44 and 986), enabling proper access to /dev/kfd and /dev/dri devices required for amd-smi GPU detection. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

AMD integrated GPUs (e.g., Strix Halo) return "N/A" strings for PCIe and ECC fields that discrete GPUs populate with typed values. Changed AMDBus, AMDRAS, and AMDCardInfo structs to use interface{} for fields that can be either typed values or "N/A" strings, allowing JSON parsing to succeed for both integrated and discrete AMD GPUs. Also added render group (GID 105) to AMD agent pod security context to ensure device access on systems where /dev/kfd and /dev/dri/renderD* are owned by the render group. Includes test with real Strix Halo GPU output to verify parsing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

coderabbitai

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/namespace/RWO/ROCM/14-ds.yaml (1)

29-47: ⚠️ Potential issue | 🟠 Major

Harden container security contexts in this example DaemonSet.

Line 32 runs the init container as root, and neither container sets allowPrivilegeEscalation: false. This keeps a permissive baseline and triggers the medium-risk checks noted in static analysis.

Suggested hardening patch

       initContainers:
         - name: fix-permissions
           image: quay.io/fedora/fedora-minimal
           securityContext:
             runAsUser: 0
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop: ["ALL"]
@@
       containers:
         - name: test
           image: quay.io/fedora/fedora-minimal
           imagePullPolicy: IfNotPresent
+          securityContext:
+            runAsNonRoot: true
+            runAsUser: 1000
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop: ["ALL"]
           command: [sleep, 365d]

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/namespace/RWO/ROCM/14-ds.yaml` around lines 29 - 47, The init
container "fix-permissions" is running as root (securityContext.runAsUser: 0)
and neither "fix-permissions" nor the main container "test" set
allowPrivilegeEscalation, which is insecure; change the init container to run as
a non-root UID (e.g., set securityContext.runAsUser: 1000 and/or
securityContext.runAsNonRoot: true for "fix-permissions") and add
securityContext.allowPrivilegeEscalation: false to both the "fix-permissions"
initContainer and the "test" container (also consider adding
securityContext.readOnlyRootFilesystem: true and dropping capabilities if
appropriate) so both containers no longer run with elevated privileges.

🧹 Nitpick comments (2)

config/operator/kustomization.yaml (1)

6-14: Trim the legacy operator image aliases.

config/operator/operator.yaml already uses quay.io/gkm/gkm-operator:latest, so the controller and quay.io/gkm/operator mappings are just extra indirection now. Keeping only the canonical quay.io/gkm/gkm-operator entry will make future overrides easier to reason about.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/operator/kustomization.yaml` around lines 6 - 14, Remove the legacy
image name mappings in the kustomization.yaml and keep only the canonical
mapping for quay.io/gkm/gkm-operator:latest; specifically delete the entries
referencing the alias "controller" and the old "quay.io/gkm/operator" and retain
only the image block whose name is "quay.io/gkm/gkm-operator" with newName
quay.io/gkm/gkm-operator and newTag latest so overrides target a single,
unambiguous image reference.

config/kind-gpu/agent-remove-affinity-patch.yaml (1)

9-10: Keep the control-plane exclusion when dropping GPU affinity.

affinity: null removes the node-role.kubernetes.io/control-plane guard from config/agent/gkm-agent-nogpu.yaml:20-42 together with the GPU checks. That makes the “worker nodes only” behavior depend on other patches instead of this overlay. Re-add the control-plane exclusion here and only drop the GPU-specific predicates.

Suggested patch

-      affinity: null
+      affinity:
+        nodeAffinity:
+          requiredDuringSchedulingIgnoredDuringExecution:
+            nodeSelectorTerms:
+            - matchExpressions:
+              - key: node-role.kubernetes.io/control-plane
+                operator: DoesNotExist

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@config/kind-gpu/agent-remove-affinity-patch.yaml` around lines 9 - 10, The
current patch sets affinity: null which removes the control-plane exclusion;
instead restore a nodeAffinity block that keeps the "worker nodes only" guard
while removing only the GPU-specific predicates. Replace affinity: null with an
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms that include a matchExpression to exclude control-plane nodes
(e.g. key: node-role.kubernetes.io/control-plane, operator: DoesNotExist) and
remove any matchExpressions or nodeSelector entries that reference GPU-specific
keys (e.g. nvidia.com/gpu, feature.node.kubernetes.io/gpu) so only the
control-plane exclusion remains; mirror the structure used in
config/agent/gkm-agent-nogpu.yaml around lines 20-42 to ensure consistent
semantics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/agent/gkm-agent-nvidia.yaml`:
- Around line 24-33: The current Kubernetes pod affinity under
affinity.nodeAffinity.nodeSelectorTerms only matches PCI classes 0300 and 0302;
add a third matchExpression for key
feature.node.kubernetes.io/pci-0380_10de.present with operator: Exists (or
consolidate the matchExpressions list to include 0300, 0302 and 0380) so the
NVIDIA DaemonSet will also schedule on nodes labeled with PCI class 0380.
- Around line 60-61: The LD_LIBRARY_PATH environment entries currently overwrite
the container's libraries by hardcoding /usr/lib64; change each LD_LIBRARY_PATH
env var to preserve the container's CUDA/NVML library paths and only add the
host library directory instead of replacing the whole variable (i.e.,
prepend/append the host path to the existing LD_LIBRARY_PATH or explicitly
include the container's standard CUDA/NVML paths such as the Ubuntu GPU library
directories plus the host path); apply this fix to all LD_LIBRARY_PATH env
entries in the NVIDIA agent spec so the container retains its distro-native
CUDA/NVML paths while allowing host libs to be visible.

In `@config/agent/README.md`:
- Around line 123-135: The "Node Selectors" docs incorrectly show nodeSelector
strings; update the section to document using affinity.nodeAffinity with
requiredDuringSchedulingIgnoredDuringExecution and matchExpressions using the
Exists operator instead of feature.node...: "true" strings; replace the
nodeSelector examples with affinity examples that include nodeSelectorTerms ->
matchExpressions entries for the NVIDIA keys (e.g.,
feature.node.kubernetes.io/pci-0300_10de.present and
feature.node.kubernetes.io/pci-0302_10de.present) and use operator: Exists (do
the same pattern for AMD keys), ensuring the README's examples reference
affinity.nodeAffinity and the requiredDuringSchedulingIgnoredDuringExecution
structure and not nodeSelector.
- Around line 37-39: The README's label examples use vendor-only format
(`pci-10de.present`) but the DaemonSets emit class-qualified labels; update the
examples in config/agent/README.md to the actual format used by the DaemonSet
(e.g., `feature.node.kubernetes.io/pci-0300_10de.present=true` for NVIDIA and
`feature.node.kubernetes.io/pci-0302_1002.present=true` for AMD), replace all
vendor-only examples with the class_vendor format, and add a short note
describing the pattern
`feature.node.kubernetes.io/pci-<class>_<vendor>.present=true` so it matches the
implementation.

In `@config/nfd/README.md`:
- Around line 98-110: The Integration section incorrectly shows a nodeSelector
using string values; replace those examples by documenting affinity.nodeAffinity
usage: remove the nodeSelector blocks and show affinity.nodeAffinity ->
requiredDuringSchedulingIgnoredDuringExecution -> nodeSelectorTerms with
matchExpressions using operator: Exists, and use the correct feature keys (e.g.
feature.node.kubernetes.io/pci-0300_10de.present and
feature.node.kubernetes.io/pci-0302_10de.present for NVIDIA agents) so the
DaemonSets use Exists-based nodeAffinity rather than string-valued nodeSelector.
- Around line 11-12: The README currently shows vendor-only labels
(`pci-10de.present`, `pci-1002.present`) but the agent DaemonSets use
class-qualified labels (e.g., `pci-0300_10de.present`, `pci-0302_10de.present`);
update the README so all examples and verification commands (the checks around
lines referenced in the comment) use the exact PCI class + vendor labels used by
the DaemonSets (replace vendor-only labels with `pci-0300_10de.present` and
`pci-0302_10de.present` for NVIDIA and `pci-0300_1002.present` /
`pci-0302_1002.present` for AMD), and adjust the kubectl verification commands
to query those class-qualified labels so they return the expected results.
- Around line 38-39: Update the verification commands to use the class-qualified
NFD label keys instead of the bare vendor tokens: replace occurrences of
"feature.node.kubernetes.io/pci-10de.present" and
"feature.node.kubernetes.io/pci-1002.present" with their class-qualified forms
"feature.node.kubernetes.io/pci.vendor-10de.present" and
"feature.node.kubernetes.io/pci.vendor-1002.present" in the README so the
kubectl get nodes -L command lists the correct NFD labels.

In `@docs/GettingStartedGuide.md`:
- Around line 13-28: The documentation's Automated Installation section (under
"Automated Installation (RHEL 10 / CentOS Stream 10)") incorrectly lists kubectl
minimum as v1.11.3+; update the text in that section and the kubectl bullet so
the minimum required kubectl version is v1.24+ (to match NFD v0.17.2 and
Kubernetes v1.24+). Ensure the change is applied to the make install-deps /
kubectl install description so users installing via "make install-deps" see the
corrected kubectl minimum.

In `@examples/namespace/RWO/CUDA/11-gkmcache.yaml`:
- Around line 13-19: The podTemplate currently only adds a toleration
(podTemplate.spec.tolerations) which is permissive; update the Pod spec to
require NVIDIA nodes by adding either a nodeSelector or a required nodeAffinity:
under podTemplate.spec add nodeSelector: {"nvidia.com/gpu": "present"} or,
preferably, add
affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution with a
nodeSelectorTerm that uses matchExpressions key "nvidia.com/gpu" operator
"Exists" (or matching the label used by your NVIDIA nodes), so the CUDA
extraction job is scheduled only onto NVIDIA GPU nodes.

In `@examples/namespace/RWO/CUDA/12-ds.yaml`:
- Around line 7-17: The DaemonSet label keys are inconsistent: the DaemonSet
metadata uses gkm.io/pvcMutation (camelCase) while the pod template uses
gkm.io/pvc-mutation (kebab-case); update the DaemonSet-level label (under
labels: and metadata:) to use the same key as the pod template
(gkm.io/pvc-mutation) so selectors/mutation webhooks match, ensuring you change
every occurrence of gkm.io/pvcMutation to gkm.io/pvc-mutation in this resource
(including labels: and any metadata sections).

In `@examples/namespace/RWO/CUDA/README.md`:
- Around line 50-52: Update the README namespace to match the manifests: replace
all occurrences of "gkm-test-ns-nvidia-rwo-1" in
examples/namespace/RWO/CUDA/README.md with "gkm-test-ns-cuda-rwo-1" so kubectl
commands align with the actual manifests (see manifests 12-ds.yaml and
13-pod.yaml); apply the same replacement for the additional occurrences called
out (lines around 63-64, 70, 77, 84-89, 95-96, 105, 120-121, 127) to ensure
every kubectl command and example uses the correct namespace string.
- Around line 84-89: The README uses pod/daemonset names that don't match the
manifests; replace the incorrect names referenced (gkm-test-nvidia-pod-1 and
gkm-test-nvidia-rwo-ds-1) with the manifest names (gkm-test-cuda-pod-1 and
gkm-test-cuda-rwo-ds-1) in the kubectl commands so the examples align with the
resources created by the manifests (update the lines showing kubectl get pod,
kubectl logs, and the DaemonSet/pod label commands).

In `@hack/install_deps.sh`:
- Line 11: The MIN_GO_VERSION variable is set to a non-existent Go release
("1.25.0"), causing downloads from https://go.dev/dl/go1.25.0.linux-amd64.tar.gz
to fail; update MIN_GO_VERSION to a valid Go release (e.g., "1.24.7" or
whichever official version you want to pin) in hack/install_deps.sh (the
MIN_GO_VERSION variable) and make the same matching change in the Containerfile
so the download URL and any version checks use a real release.
- Around line 44-53: The script hardcodes Fedora rawhide RPM URLs (see
"${FEDORA_BASE}/l/libbtrfs-...", "${FEDORA_BASE}/b/btrfs-progs-..." and
btrfs-progs-devel) which are fragile; change install_deps.sh to prefer
installing by package name via dnf (e.g., dnf install -y btrfs-progs
btrfs-progs-devel libbtrfs libbtrfsutil) or configure a stable Fedora repository
and fall back to downloading a versioned RPM only if dnf is unavailable, and
remove the fixed fc45 URLs or make them a configurable variable with a clear
comment so the dependency source is stable and maintainable.

In `@Makefile`:
- Around line 341-347: The deploy-nfd Makefile target currently masks wait
failures with "|| true" causing false success; update the deploy-nfd recipe
(target name: deploy-nfd) to stop swallowing the kubectl wait failure — remove
"|| true" from the line that runs $(KUBECTL) wait --for=condition=Available
--timeout=120s -n node-feature-discovery deployment/nfd-master and instead let
the command return a non-zero exit code (or add an explicit conditional that
prints a clear error and exits non-zero) so the Make target fails fast when NFD
never becomes ready.
- Around line 717-724: Remove the duplicate recipe that defines the
$(KIND_GPU_SIM_SCRIPT) target (the block that depends on $(LOCALBIN) and uses
curl to download $(KIND_GPU_SIM_SCRIPT_URL)); it overrides the earlier correct
rule and references an undefined $(KIND_GPU_SIM_SCRIPT_URL). Keep only the
.PHONY declaration and the phony target dependency: ".PHONY:
kind-gpu-sim-script" and "kind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT)", relying
on the original $(KIND_GPU_SIM_SCRIPT) rule that builds the script locally.

In `@mcv/pkg/accelerator/devices/amd_test.go`:
- Around line 285-289: TranslateGPUToArch misclassifies "RDNA 2" because the
function checks a generic "RDNA" branch before the more specific RDNA2 pattern;
update TranslateGPUToArch to match versioned RDNA strings first (e.g., check for
"RDNA 2" / "RDNA 3" or parse the numeric RDNA version) before the generic "RDNA"
case so "RDNA 2" resolves to gfx1030; adjust the matching order or use a regex
that extracts the RDNA major version and maps 2→gfx1030 (while preserving
existing mappings for other checks like "Instinct MI210" / "gfx90a").

---

Outside diff comments:
In `@examples/namespace/RWO/ROCM/14-ds.yaml`:
- Around line 29-47: The init container "fix-permissions" is running as root
(securityContext.runAsUser: 0) and neither "fix-permissions" nor the main
container "test" set allowPrivilegeEscalation, which is insecure; change the
init container to run as a non-root UID (e.g., set securityContext.runAsUser:
1000 and/or securityContext.runAsNonRoot: true for "fix-permissions") and add
securityContext.allowPrivilegeEscalation: false to both the "fix-permissions"
initContainer and the "test" container (also consider adding
securityContext.readOnlyRootFilesystem: true and dropping capabilities if
appropriate) so both containers no longer run with elevated privileges.

---

Nitpick comments:
In `@config/kind-gpu/agent-remove-affinity-patch.yaml`:
- Around line 9-10: The current patch sets affinity: null which removes the
control-plane exclusion; instead restore a nodeAffinity block that keeps the
"worker nodes only" guard while removing only the GPU-specific predicates.
Replace affinity: null with an affinity: nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms that include a
matchExpression to exclude control-plane nodes (e.g. key:
node-role.kubernetes.io/control-plane, operator: DoesNotExist) and remove any
matchExpressions or nodeSelector entries that reference GPU-specific keys (e.g.
nvidia.com/gpu, feature.node.kubernetes.io/gpu) so only the control-plane
exclusion remains; mirror the structure used in
config/agent/gkm-agent-nogpu.yaml around lines 20-42 to ensure consistent
semantics.

In `@config/operator/kustomization.yaml`:
- Around line 6-14: Remove the legacy image name mappings in the
kustomization.yaml and keep only the canonical mapping for
quay.io/gkm/gkm-operator:latest; specifically delete the entries referencing the
alias "controller" and the old "quay.io/gkm/operator" and retain only the image
block whose name is "quay.io/gkm/gkm-operator" with newName
quay.io/gkm/gkm-operator and newTag latest so overrides target a single,
unambiguous image reference.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 8138b4b8-1144-4cd2-b83e-9ea48c888e55

📥 Commits

Reviewing files that changed from the base of the PR and between acd5abc and c53c04e.

⛔ Files ignored due to path filters (1)

mcv/go.sum is excluded by !**/*.sum

📒 Files selected for processing (39)

.github/workflows/image-build.yml
Containerfile.gkm-agent
Containerfile.gkm-agents
Makefile
config/agent/README.md
config/agent/gkm-agent-amd.yaml
config/agent/gkm-agent-nogpu.yaml
config/agent/gkm-agent-nvidia.yaml
config/agent/kustomization.yaml
config/configMap/configMap.yaml
config/configMap/kustomization.yaml
config/kind-gpu/agent-remove-affinity-patch.yaml
config/kind-gpu/kustomization.yaml
config/kyverno/values-no-gpu.yaml
config/nfd/README.md
config/nfd/kustomization.yaml
config/nfd/nfd-worker-conf.yaml
config/nfd/patch-nfd-gc.yaml
config/nfd/patch-nfd-workers.yaml
config/operator/kustomization.yaml
config/operator/operator.yaml
docs/GettingStartedGuide.md
examples/namespace/RWO/CUDA/10-namespace.yaml
examples/namespace/RWO/CUDA/11-gkmcache.yaml
examples/namespace/RWO/CUDA/12-ds.yaml
examples/namespace/RWO/CUDA/13-pod.yaml
examples/namespace/RWO/CUDA/README.md
examples/namespace/RWO/ROCM/10-namespace.yaml
examples/namespace/RWO/ROCM/11-gkmcache.yaml
examples/namespace/RWO/ROCM/12-ds.yaml
examples/namespace/RWO/ROCM/13-ds.yaml
examples/namespace/RWO/ROCM/14-ds.yaml
examples/namespace/RWO/ROCM/21-gkmcache-cosign-v3.yaml
examples/namespace/RWO/ROCM/22-ds.yaml
gkm-codespell.precommit-toml
hack/install_deps.sh
mcv/go.mod
mcv/pkg/accelerator/devices/amd.go
mcv/pkg/accelerator/devices/amd_test.go

💤 Files with no reviewable changes (1)

Containerfile.gkm-agent

coderabbitai · 2026-04-15T13:53:22Z

+      affinity:
+        nodeAffinity:
+          requiredDuringSchedulingIgnoredDuringExecution:
+            nodeSelectorTerms:
+            - matchExpressions:
+              - key: feature.node.kubernetes.io/pci-0300_10de.present
+                operator: Exists
+            - matchExpressions:
+              - key: feature.node.kubernetes.io/pci-0302_10de.present
+                operator: Exists


⚠️ Potential issue | 🟠 Major

Include PCI class 0380 in the NVIDIA affinity.

This selector only matches 0300 and 0302, so nodes labeled only with feature.node.kubernetes.io/pci-0380_10de.present will never get the NVIDIA DaemonSet. That silently breaks auto-scheduling for one of the GPU classes this PR is trying to support.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@config/agent/gkm-agent-nvidia.yaml` around lines 24 - 33, The current Kubernetes pod affinity under affinity.nodeAffinity.nodeSelectorTerms only matches PCI classes 0300 and 0302; add a third matchExpression for key feature.node.kubernetes.io/pci-0380_10de.present with operator: Exists (or consolidate the matchExpressions list to include 0300, 0302 and 0380) so the NVIDIA DaemonSet will also schedule on nodes labeled with PCI class 0380.

coderabbitai · 2026-04-15T13:53:22Z

+          - name: LD_LIBRARY_PATH
+            value: /usr/lib64


⚠️ Potential issue | 🟠 Major

Don't hard-code /usr/lib64 as the only NVIDIA library path.

The NVIDIA image here is Ubuntu-based, but this mounts the host's /usr/lib64 over the container and replaces LD_LIBRARY_PATH with that single directory. That works only on RHEL-style hosts and can also hide the image's own default CUDA/NVML paths, so NVML loading becomes host-distro dependent.

Also applies to: 79-81, 104-108

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@config/agent/gkm-agent-nvidia.yaml` around lines 60 - 61, The LD_LIBRARY_PATH environment entries currently overwrite the container's libraries by hardcoding /usr/lib64; change each LD_LIBRARY_PATH env var to preserve the container's CUDA/NVML library paths and only add the host library directory instead of replacing the whole variable (i.e., prepend/append the host path to the existing LD_LIBRARY_PATH or explicitly include the container's standard CUDA/NVML paths such as the Ubuntu GPU library directories plus the host path); apply this fix to all LD_LIBRARY_PATH env entries in the NVIDIA agent spec so the container retains its distro-native CUDA/NVML paths while allowing host libs to be visible.

coderabbitai · 2026-04-15T13:53:22Z

+NFD will automatically add labels like:
+- `feature.node.kubernetes.io/pci-10de.present=true` (NVIDIA, vendor ID: 0x10de)
+- `feature.node.kubernetes.io/pci-1002.present=true` (AMD, vendor ID: 0x1002)


⚠️ Potential issue | 🟠 Major

Label format documented doesn't match actual DaemonSet configuration.

Similar to the NFD README, this documents vendor-only labels (pci-10de.present), but the actual DaemonSets use class-qualified labels like pci-0300_10de.present. Update to match the actual implementation.

Proposed fix

NFD will automatically add labels like: -- `feature.node.kubernetes.io/pci-10de.present=true` (NVIDIA, vendor ID: 0x10de) -- `feature.node.kubernetes.io/pci-1002.present=true` (AMD, vendor ID: 0x1002) +- `feature.node.kubernetes.io/pci-0300_10de.present=true` (NVIDIA VGA controller) +- `feature.node.kubernetes.io/pci-0302_10de.present=true` (NVIDIA 3D controller) +- `feature.node.kubernetes.io/pci-0300_1002.present=true` (AMD VGA controller) +- `feature.node.kubernetes.io/pci-0302_1002.present=true` (AMD 3D controller) +- `feature.node.kubernetes.io/pci-0380_1002.present=true` (AMD Display controller)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@config/agent/README.md` around lines 37 - 39, The README's label examples use vendor-only format (`pci-10de.present`) but the DaemonSets emit class-qualified labels; update the examples in config/agent/README.md to the actual format used by the DaemonSet (e.g., `feature.node.kubernetes.io/pci-0300_10de.present=true` for NVIDIA and `feature.node.kubernetes.io/pci-0302_1002.present=true` for AMD), replace all vendor-only examples with the class_vendor format, and add a short note describing the pattern `feature.node.kubernetes.io/pci-<class>_<vendor>.present=true` so it matches the implementation.

coderabbitai · 2026-04-15T13:53:22Z

+## Node Selectors
+
+The DaemonSets use PCI vendor ID-based node selectors:
+
+```yaml
+# NVIDIA nodes
+nodeSelector:
+  feature.node.kubernetes.io/pci-10de.present: "true"
+
+# AMD nodes
+nodeSelector:
+  feature.node.kubernetes.io/pci-1002.present: "true"
+```


⚠️ Potential issue | 🟠 Major

Node selectors section shows incorrect format.

The DaemonSets use affinity.nodeAffinity with Exists operator, not nodeSelector with string values. This documentation is misleading.

Proposed fix

-The DaemonSets use PCI vendor ID-based node selectors: +The DaemonSets use PCI class+vendor ID-based node affinity: ```yaml -# NVIDIA nodes -nodeSelector: - feature.node.kubernetes.io/pci-10de.present: "true" - -# AMD nodes -nodeSelector: - feature.node.kubernetes.io/pci-1002.present: "true" +# NVIDIA nodes (VGA or 3D controller) +affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: feature.node.kubernetes.io/pci-0300_10de.present + operator: Exists + - matchExpressions: + - key: feature.node.kubernetes.io/pci-0302_10de.present + operator: Exists

</details> <details> <summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @config/agent/README.md around lines 123 - 135, The "Node Selectors" docs
incorrectly show nodeSelector strings; update the section to document using
affinity.nodeAffinity with requiredDuringSchedulingIgnoredDuringExecution and
matchExpressions using the Exists operator instead of feature.node...: "true"
strings; replace the nodeSelector examples with affinity examples that include
nodeSelectorTerms -> matchExpressions entries for the NVIDIA keys (e.g.,
feature.node.kubernetes.io/pci-0300_10de.present and
feature.node.kubernetes.io/pci-0302_10de.present) and use operator: Exists (do
the same pattern for AMD keys), ensuring the README's examples reference
affinity.nodeAffinity and the requiredDuringSchedulingIgnoredDuringExecution
structure and not nodeSelector.

</details>  

coderabbitai · 2026-04-15T13:53:22Z

+- **NVIDIA GPUs**: `feature.node.kubernetes.io/pci-10de.present=true` (vendor ID: 0x10de)
+- **AMD GPUs**: `feature.node.kubernetes.io/pci-1002.present=true` (vendor ID: 0x1002)


⚠️ Potential issue | 🟠 Major

Documentation mismatch: Labels documented don't match actual usage.

The README documents vendor-only labels (pci-10de.present, pci-1002.present), but the actual agent DaemonSets in config/agent/gkm-agent-nvidia.yaml and config/agent/gkm-agent-amd.yaml use PCI class-qualified labels like pci-0300_10de.present (VGA controller) and pci-0302_10de.present (3D controller).

This will confuse users and cause the verification commands (lines 39, 92-95) to return no results.

Proposed fix

-For GKM, NFD automatically labels nodes with PCI device vendor IDs, enabling GPU-specific agent deployment: +For GKM, NFD automatically labels nodes with PCI device class and vendor IDs, enabling GPU-specific agent deployment: -- **NVIDIA GPUs**: `feature.node.kubernetes.io/pci-10de.present=true` (vendor ID: 0x10de) -- **AMD GPUs**: `feature.node.kubernetes.io/pci-1002.present=true` (vendor ID: 0x1002) +- **NVIDIA GPUs**: `feature.node.kubernetes.io/pci-0300_10de.present=true` or `pci-0302_10de.present=true` (vendor ID: 0x10de, class: VGA/3D) +- **AMD GPUs**: `feature.node.kubernetes.io/pci-0300_1002.present=true`, `pci-0302_1002.present=true`, or `pci-0380_1002.present=true` (vendor ID: 0x1002)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@config/nfd/README.md` around lines 11 - 12, The README currently shows vendor-only labels (`pci-10de.present`, `pci-1002.present`) but the agent DaemonSets use class-qualified labels (e.g., `pci-0300_10de.present`, `pci-0302_10de.present`); update the README so all examples and verification commands (the checks around lines referenced in the comment) use the exact PCI class + vendor labels used by the DaemonSets (replace vendor-only labels with `pci-0300_10de.present` and `pci-0302_10de.present` for NVIDIA and `pci-0300_1002.present` / `pci-0302_1002.present` for AMD), and adjust the kubectl verification commands to query those class-qualified labels so they return the expected results.

coderabbitai · 2026-04-15T13:53:23Z

+echo ""
+
+# Minimum required versions
+MIN_GO_VERSION="1.25.0"


⚠️ Potential issue | 🔴 Critical

Go 1.25.0 does not exist.

Same issue as in the Containerfile - Go version 1.25.0 is not a valid release. The script will fail when attempting to download from https://go.dev/dl/go1.25.0.linux-amd64.tar.gz.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hack/install_deps.sh` at line 11, The MIN_GO_VERSION variable is set to a non-existent Go release ("1.25.0"), causing downloads from https://go.dev/dl/go1.25.0.linux-amd64.tar.gz to fail; update MIN_GO_VERSION to a valid Go release (e.g., "1.24.7" or whichever official version you want to pin) in hack/install_deps.sh (the MIN_GO_VERSION variable) and make the same matching change in the Containerfile so the download URL and any version checks use a real release.

coderabbitai · 2026-04-15T13:53:23Z

+sudo rpm -ivh --nodeps \
+  "${FEDORA_BASE}/l/libbtrfs-6.19-1.fc45.x86_64.rpm" \
+  "${FEDORA_BASE}/l/libbtrfsutil-6.19-1.fc45.x86_64.rpm" 2>/dev/null || echo "Libraries may already be installed"
+
+# Now install devel package with --nodeps
+sudo rpm -ivh --nodeps \
+  "${FEDORA_BASE}/b/btrfs-progs-6.19-1.fc45.x86_64.rpm" 2>/dev/null || echo "btrfs-progs may already be installed"
+
+sudo rpm -ivh --nodeps \
+  "${FEDORA_BASE}/b/btrfs-progs-devel-6.19-1.fc45.x86_64.rpm"


⚠️ Potential issue | 🟠 Major

Hardcoded Fedora rawhide RPM URLs are fragile.

These URLs point to specific package versions in Fedora rawhide (fc45), which is a rolling development branch. Packages are frequently rebuilt and old versions removed, causing these URLs to break.

Consider using a more stable package source, or document that these URLs need periodic updates. Alternatively, use dnf with a Fedora repository if available.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hack/install_deps.sh` around lines 44 - 53, The script hardcodes Fedora rawhide RPM URLs (see "${FEDORA_BASE}/l/libbtrfs-...", "${FEDORA_BASE}/b/btrfs-progs-..." and btrfs-progs-devel) which are fragile; change install_deps.sh to prefer installing by package name via dnf (e.g., dnf install -y btrfs-progs btrfs-progs-devel libbtrfs libbtrfsutil) or configure a stable Fedora repository and fall back to downloading a versioned RPM only if dnf is unavailable, and remove the fixed fc45 URLs or make them a configurable variable with a clear comment so the dependency source is stable and maintainable.

coderabbitai · 2026-04-15T13:53:23Z

+.PHONY: deploy-nfd
+deploy-nfd: kustomize ## Deploy Node Feature Discovery for GPU detection
+	@echo "Deploying Node Feature Discovery (NFD)..."
+	$(KUSTOMIZE) build config/nfd | $(KUBECTL) apply -f -
+	@echo "Waiting for NFD to be ready..."
+	@$(KUBECTL) wait --for=condition=Available --timeout=120s -n node-feature-discovery deployment/nfd-master || true
+	@echo "NFD deployed successfully."


⚠️ Potential issue | 🟠 Major

Fail fast if NFD never becomes ready.

deploy-nfd reports success even when the readiness check times out because the failure is masked with || true. Since the vendor-specific DaemonSets depend on NFD labels, this turns a required dependency failure into a much harder-to-debug "agents never schedule" install.

Suggested fix

deploy-nfd: kustomize ## Deploy Node Feature Discovery for GPU detection `@echo` "Deploying Node Feature Discovery (NFD)..." $(KUSTOMIZE) build config/nfd | $(KUBECTL) apply -f - `@echo` "Waiting for NFD to be ready..." - @$(KUBECTL) wait --for=condition=Available --timeout=120s -n node-feature-discovery deployment/nfd-master || true + @$(KUBECTL) wait --for=condition=Available --timeout=120s -n node-feature-discovery deployment/nfd-master `@echo` "NFD deployed successfully."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 341 - 347, The deploy-nfd Makefile target currently masks wait failures with "|| true" causing false success; update the deploy-nfd recipe (target name: deploy-nfd) to stop swallowing the kubectl wait failure — remove "|| true" from the line that runs $(KUBECTL) wait --for=condition=Available --timeout=120s -n node-feature-discovery deployment/nfd-master and instead let the command return a non-zero exit code (or add an explicit conditional that prints a clear error and exits non-zero) so the Make target fails fast when NFD never becomes ready.

coderabbitai · 2026-04-15T13:53:23Z

+.PHONY: kind-gpu-sim-script
+kind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT) ## Download  kind-gpu-sim-script locally if necessary.
+$(KIND_GPU_SIM_SCRIPT): $(LOCALBIN)
+	if [ ! -f $(KIND_GPU_SIM_SCRIPT) ]; then \
+		echo "Downloading $(KIND_GPU_SIM_SCRIPT)"; \
+		curl -L -o $(KIND_GPU_SIM_SCRIPT) $(KIND_GPU_SIM_SCRIPT_URL); \
+		chmod +x $(KIND_GPU_SIM_SCRIPT); \
+	fi


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Looking for duplicate rules and URL definition:" rg -n 'kind-gpu-sim-script:|\$$KIND_GPU_SIM_SCRIPT$:|(^KIND_GPU_SIM_SCRIPT_URL\s*[:?+]?=)|KIND_GPU_SIM_SCRIPT_URL' Makefile

Repository: redhat-et/GKM

Length of output: 387

🏁 Script executed:

# Search for any definition of KIND_GPU_SIM_SCRIPT_URL in the entire Makefile rg -n 'KIND_GPU_SIM_SCRIPT_URL\s*[:?+]?=' Makefile # Also check context around the first rule at line 455 sed -n '450,465p' Makefile

Repository: redhat-et/GKM

Length of output: 771

Remove the second $(KIND_GPU_SIM_SCRIPT) recipe at lines 719–724.

This rule overrides the earlier, correct rule at line 455 that installs from the local hack/kind-gpu-sim.sh. The second rule attempts to download via $(KIND_GPU_SIM_SCRIPT_URL), which is not defined anywhere in the Makefile. On a clean checkout, this causes the curl command to execute with an empty URL.

Keep only the .PHONY declaration and the phony target dependency:

.PHONY: kind-gpu-sim-script kind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT)

The original rule at line 455 already handles building $(KIND_GPU_SIM_SCRIPT) correctly by copying from the local script.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 717 - 724, Remove the duplicate recipe that defines the $(KIND_GPU_SIM_SCRIPT) target (the block that depends on $(LOCALBIN) and uses curl to download $(KIND_GPU_SIM_SCRIPT_URL)); it overrides the earlier correct rule and references an undefined $(KIND_GPU_SIM_SCRIPT_URL). Keep only the .PHONY declaration and the phony target dependency: ".PHONY: kind-gpu-sim-script" and "kind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT)", relying on the original $(KIND_GPU_SIM_SCRIPT) rule that builds the script locally.

Looks like the rebase kept the old kind-gpu-sim script download. This target should be removed.

@cmagina, agreed. The duplicate $(KIND_GPU_SIM_SCRIPT): $(LOCALBIN) recipe block needs to be removed — it references the undefined $(KIND_GPU_SIM_SCRIPT_URL) and overrides the correct original rule that installs from hack/kind-gpu-sim.sh. Only the .PHONY declaration and the phony alias should remain.

Remove the duplicate $(KIND_GPU_SIM_SCRIPT): $(LOCALBIN) recipe (the curl-based download block) from the new kind-gpu-sim-script section

Keep only the .PHONY declaration and the kind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT) dependency line

Failed to handle agent chat message. Please try again.

Move more specific RDNA 3 and RDNA 2 cases before the generic RDNA case to prevent incorrect architecture detection. The generic pattern must be checked last to avoid matching substrings of more specific GPU names. Fixes TestGFXArchitectureTranslation test failure. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Billy99

I'm still playing with your PR, but want to go ahead and add this.

Billy99 · 2026-04-15T19:34:16Z

+echo "=== Step 3: Installing btrfs development headers ==="
+echo "====================================================="
+# First install the base libraries with --nodeps to skip filesystem checks
+sudo rpm -ivh --nodeps \


Should this just run rpm -qa | grep libbtrfs to see if it is already installed or just try to install the package without a version? I'm on Fedora 42 and running make install-deps failed. Everything is already installed (btrfs 6.16.1-1), I was just running to see what would happen.

I read the changes to the Getting Started Guide and realized that this is only intended for RHEL 10 or CentOS Stream 10 systems, so take or leave the comment.

Billy99 · 2026-04-16T16:30:55Z

+- name: quay.io/gkm/agent-nvidia
+  newName: quay.io/gkm/gkm-agent-nvidia
+  newTag: latest
+- name: quay.io/gkm/gkm-agent-amd


Looks like you have quay.io/gkm/gkm-agent-amd, quay.io/gkm/gkm-agent-nogpu and quay.io/gkm/gkm-agent-nvidia duplicated

Billy99 · 2026-04-16T16:32:04Z

  newTag: latest
 - name: quay.io/gkm/operator
-  newName: quay.io/gkm/operator
+  newName: quay.io/gkm/gkm-operator


Looks like you have quay.io/gkm/gkm-operator duplicated

I think this file just needs to be:

resources: - operator.yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization images: - name: quay.io/gkm/gkm-operator newName: quay.io/gkm/gkm-operator newTag: latest

Billy99 · 2026-04-16T18:11:54Z

 .PHONY: build-installer
 build-installer: manifests generate kustomize ## Generate a consolidated YAML with CRDs and deployment.
 	mkdir -p dist
 	cd config/operator && $(KUSTOMIZE) edit set image controller=${OPERATOR_IMG}


This change along with the changes to config/operator/kustomization.yaml:

Suggested change

cd config/operator && $(KUSTOMIZE) edit set image quay.io/gkm/gkm-operator=${OPERATOR_IMG}

Billy99 · 2026-04-16T19:00:04Z

+
 .PHONY: prepare-deploy
 prepare-deploy:
 	cd config/operator && $(KUSTOMIZE) edit set image quay.io/gkm/operator=${OPERATOR_IMG}


Suggested change

cd config/operator && $(KUSTOMIZE) edit set image quay.io/gkm/operator=${OPERATOR_IMG}

cd config/operator && $(KUSTOMIZE) edit set image quay.io/gkm/gkm-operator=${OPERATOR_IMG}

Billy99 · 2026-04-16T19:02:06Z

 ## Location to install dependencies to
 $(LOCALBIN):
 	mkdir -p $@



This was removed in Craig's PR, but add it since you are in this file:

KIND_GPU_SIM_SCRIPT_URL := https://raw.githubusercontent.com/maryamtahhan/kind-gpu-sim/refs/heads/main/kind-gpu-sim.sh

cmagina · 2026-05-04T18:46:42Z

Found an interesting problem. On my desktop, I get both the nvidia and amd agents. This is because my desktop has an nvidia dGPU and an amd iGPU. This wouldn't be an issue for AMD EPYC CPUs as they don't have an iGPU, but if anyone put both nvidia and amd gpus in a server, we'd get both agents. Again, probably a corner case, but wanted to mention it.

maryamtahhan force-pushed the rocm-nvidia-agents branch from a8e99b8 to fe6471c Compare March 12, 2026 14:00