Skip to content

Rocm nvidia agents#107

Open
maryamtahhan wants to merge 28 commits into
redhat-et:mainfrom
maryamtahhan:rocm-nvidia-agents
Open

Rocm nvidia agents#107
maryamtahhan wants to merge 28 commits into
redhat-et:mainfrom
maryamtahhan:rocm-nvidia-agents

Conversation

@maryamtahhan
Copy link
Copy Markdown
Collaborator

@maryamtahhan maryamtahhan commented Mar 12, 2026

Add Multi-Agent-GPU Support with NFD-Based Deployment

Summary

This PR introduces GPU-specific agents for heterogeneous Kubernetes clusters with automatic hardware detection, enabling GKM to support both NVIDIA and AMD GPUs on different nodes within the same cluster.

Key Changes:

  • Split monolithic agent into GPU-vendor-specific agents (NVIDIA, AMD, and no-GPU variants)
  • Integrated Node Feature Discovery (NFD) for automatic GPU vendor detection via PCI IDs
  • Added automated dependency installation for RHEL 10
  • Enhanced Makefile with flexible image build targets for individual agents
  • Improved GPU library mounting and scheduling

Architecture

The deployment now consists of three specialized agents:

  1. gkm-agent-nvidia - NVIDIA GPU nodes

    • Base: nvcr.io/nvidia/cuda:12.6.3-base-ubuntu24.04
    • Includes: CUDA runtime with NVML libraries
    • Node selector: feature.node.kubernetes.io/pci-10de.present=true
  2. gkm-agent-amd - AMD ROCm GPU nodes

    • Base: ubuntu:24.04
    • Includes: ROCm 6.3.1 (amd-smi-lib, rocm-smi-lib)
    • Node selector: feature.node.kubernetes.io/pci-1002.present=true
  3. gkm-agent-nogpu - Non-GPU nodes (e.g., control-plane)

    • Base: ubuntu:24.04
    • Minimal footprint without GPU libraries
    • Anti-affinity: Excludes control-plane nodes

Node Feature Discovery Integration

NFD automatically labels nodes based on PCI vendor IDs:

  • NVIDIA: PCI vendor 10defeature.node.kubernetes.io/pci-10de.present=true
  • AMD: PCI vendor 1002feature.node.kubernetes.io/pci-1002.present=true

This eliminates manual node labeling and enables automatic agent scheduling.

What's New

Features

Fixes

  • Mount GPU libraries to enable device access without requiring GPU resource requests
  • Exclude control-plane nodes from nogpu agent deployment
  • Correct GPU agent scheduling using NFD PCI class code labels instead of custom labels
  • Conditionally build agents based on NO_GPU_BUILD flag

Migration Path

For existing deployments using the generic agent:

# 1. Deploy NFD
kubectl apply -k config/nfd

# 2. Build GPU-specific agents
make build-image-agents-gpu

# 3. Deploy new agents
kubectl apply -k config/agent

# 4. Remove legacy agent
kubectl delete ds agent -n gkm-system

Testing

  • Verify NFD labels nodes correctly with PCI vendor IDs
  • Confirm NVIDIA agent deploys only to NVIDIA GPU nodes
  • Confirm AMD agent deploys only to AMD GPU nodes
  • Confirm nogpu agent deploys to non-GPU nodes (excluding control-plane)
  • Verify GPU library mounting without resource requests
  • Test RHEL 10 dependency installation script
  • Test NO GPU image in kind

Breaking Changes

  • Legacy Containerfile.gkm-agent renamed to Containerfile.gkm-agent-amd
  • Generic config/agent/gkm-agent.yaml replaced with GPU-specific manifests
  • Requires NFD deployment for automatic node labeling

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for GPU-specific agent variants: NVIDIA CUDA, AMD ROCm, and no-GPU configurations
    • Integrated automatic GPU detection via Node Feature Discovery for heterogeneous clusters
    • Multi-target container builds enabling optimized images for each hardware configuration
  • Documentation

    • New guides for multi-GPU agent deployment, Node Feature Discovery setup, and example CUDA/ROCm workloads
  • Chores

    • Updated build pipeline for multi-variant agent image generation and deployment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 12, 2026

Warning

Rate limit exceeded

@maryamtahhan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 33 minutes and 27 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 33 minutes and 27 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ca74a70c-d6fa-4ca5-8de4-69ad62f73539

📥 Commits

Reviewing files that changed from the base of the PR and between c53c04e and 58f3b0b.

📒 Files selected for processing (1)
  • mcv/pkg/accelerator/devices/amd.go
📝 Walkthrough

Walkthrough

Restructures agent container builds from a single image to three GPU-variant targets (no-GPU, NVIDIA AMD) using a multi-stage Containerfile. Renames operator image to gkm-operator. Adds Node Feature Discovery (NFD) and Kyverno deployments. Introduces GPU-specific DaemonSet manifests with corresponding node affinity selectors. Expands Makefile with build targets for GPU variants and deployment management.

Changes

Cohort / File(s) Summary
CI/Build Matrix
.github/workflows/image-build.yml, Makefile
Updated build matrix from single agent image to three GPU variants (gkm-agent-nogpu, gkm-agent-nvidia, gkm-agent-amd) each with distinct Buildx targets. Renamed operatorgkm-operator. Updated Makefile with new agent build targets, image variable consolidation, and conditional build logic for NO_GPU_BUILD.
Dependency Installation
hack/install_deps.sh, docs/GettingStartedGuide.md
Added install_deps.sh script for automated RHEL 10 dependency installation (Go, Podman, kubectl, development packages). Updated getting-started guide with automated/manual installation workflows.
Container Build Definitions
Containerfile.gkm-agent, Containerfile.gkm-agents
Removed single-target Containerfile.gkm-agent. Introduced Containerfile.gkm-agents with shared builder stage and three runtime targets (nogpu, nvidia, amd) handling GPU libraries and configurations conditionally.
Agent Deployment Manifests
config/agent/gkm-agent-nvidia.yaml, config/agent/gkm-agent-amd.yaml, config/agent/gkm-agent-nogpu.yaml, config/agent/kustomization.yaml, config/agent/README.md
Replaced generic agent manifest with three GPU-specific DaemonSets using node affinity selectors for PCI vendor labels. Updated Kustomize image mappings. Added comprehensive deployment documentation.
Operator Deployment
config/operator/operator.yaml, config/operator/kustomization.yaml
Updated operator image reference from quay.io/gkm/operator to quay.io/gkm/gkm-operator. Modified Kustomize image overrides accordingly.
Node Feature Discovery (NFD)
config/nfd/kustomization.yaml, config/nfd/nfd-worker-conf.yaml, config/nfd/patch-nfd-gc.yaml, config/nfd/patch-nfd-workers.yaml, config/nfd/README.md
Added complete NFD deployment configuration with upstream integration at v0.17.2, worker PCI discovery settings, GPU taint tolerations for NFD components, and deployment/troubleshooting documentation.
Kyverno & Kind Configuration
config/kyverno/values-no-gpu.yaml, config/kind-gpu/agent-remove-affinity-patch.yaml, config/kind-gpu/kustomization.yaml
Extended Kyverno Helm values with hooks configuration for GPU nodes. Added affinity-removal patch for no-GPU agent on Kind clusters. Updated Kind kustomization to target all three agent variants with appropriate patches.
ConfigMap Updates
config/configMap/configMap.yaml, config/configMap/kustomization.yaml
Updated legacy gkm.agent.image reference from quay.io/gkm/agent:latest to quay.io/gkm/gkm-agent-nogpu:latest with deprecation notice.
CUDA GPU Examples
examples/namespace/RWO/CUDA/10-namespace.yaml, examples/namespace/RWO/CUDA/11-gkmcache.yaml, examples/namespace/RWO/CUDA/12-ds.yaml, examples/namespace/RWO/CUDA/13-pod.yaml, examples/namespace/RWO/CUDA/README.md
Added new namespace and GPU-specific example manifests (GKMCache, test DaemonSets and Pods) for NVIDIA CUDA workloads with comprehensive deployment and troubleshooting guide.
ROCm GPU Examples
examples/namespace/RWO/ROCM/10-namespace.yaml, examples/namespace/RWO/ROCM/11-gkmcache.yaml, examples/namespace/RWO/ROCM/12-ds.yaml, examples/namespace/RWO/ROCM/13-ds.yaml, examples/namespace/RWO/ROCM/14-ds.yaml, examples/namespace/RWO/ROCM/21-gkmcache-cosign-v3.yaml, examples/namespace/RWO/ROCM/22-ds.yaml
Updated ROCm example namespace identifiers, GKMCache names, and DaemonSet references for consistency across ROCm-specific test scenarios.
AMD Device Handling
mcv/go.mod, mcv/pkg/accelerator/devices/amd.go, mcv/pkg/accelerator/devices/amd_test.go
Added NVIDIA and AMD SMI indirect dependencies. Loosened JSON field types in amd.go from strict types to interface{} for flexibility with varying AMD SMI output formats. Added comprehensive unit tests for AMD GPU JSON parsing and architecture translation.
Tool Configuration
gkm-codespell.precommit-toml
Extended spelling exception list with additional term for pre-commit checks.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Rocm nvidia agents' is partially related to the changeset but lacks specificity and clarity about the primary change. Revise the title to be more specific and descriptive, such as 'Add GPU-vendor-specific agents with Node Feature Discovery integration' or 'Support heterogeneous Kubernetes clusters with NVIDIA and AMD GPU agents'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@maryamtahhan
Copy link
Copy Markdown
Collaborator Author

maryamtahhan commented Mar 12, 2026

agent images are small

 podman images
REPOSITORY                                TAG                      IMAGE ID      CREATED        SIZE
quay.io/gkm/agent-nogpu               latest                   b56b30880c0a  2 hours ago    345 MB
quay.io/gkm/agent-amd                 latest                   06834b1def0c  2 hours ago    811 MB
quay.io/gkm/agent-nvidia              latest                   3e5bd82e4b0d  2 hours ago    534 MB
quay.io/gkm/operator                  latest                   d63413d319ef  2 hours ago    235 MB

Comment thread Containerfile.gkm-agent-nogpu Outdated
maryamtahhan added a commit to maryamtahhan/GKM that referenced this pull request Mar 16, 2026
This commit addresses all review feedback and failing CI checks:

- Add common agent base Containerfile with shared build stages
- Update all agent Containerfiles with clear stage documentation
- Add agent-base image to CI/CD workflow and Makefile
- Fix image-build workflow to build all 4 agent variants (base, nvidia, amd, nogpu)
- Fix 19 markdown linting errors in documentation files
  - Wrap long lines to ≤80 characters (MD013)
  - Add blank lines around code blocks (MD031)
  - Add blank lines around lists (MD032)

Resolves:
- Build Image (agent) workflow failure (missing Containerfile.gkm-agent)
- Pre-commit markdown linting failures
- PR review comment requesting common base container

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
maryamtahhan added a commit to maryamtahhan/GKM that referenced this pull request Mar 16, 2026
This commit addresses all review feedback and failing CI checks:

- Add common agent base Containerfile with shared build stages
- Update all agent Containerfiles with clear stage documentation
- Add agent-base image to CI/CD workflow and Makefile
- Fix image-build workflow to build all 4 agent variants (base, nvidia, amd, nogpu)
- Fix 19 markdown linting errors in documentation files
  - Wrap long lines to ≤80 characters (MD013)
  - Add blank lines around code blocks (MD031)
  - Add blank lines around lists (MD032)

Resolves:
- Build Image (agent) workflow failure (missing Containerfile.gkm-agent)
- Pre-commit markdown linting failures
- PR review comment requesting common base container

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Comment thread config/agent/gkm-agent-nvidia.yaml
Comment thread config/configMap/kustomization.yaml Outdated
Comment thread config/agent/gkm-agent-nogpu.yaml
Comment thread config/agent/kustomization.yaml
maryamtahhan and others added 15 commits April 15, 2026 14:41
…yment

Replace generic agent with GPU-vendor-specific agents that deploy based on
hardware detection. This enables hybrid clusters with both NVIDIA and AMD
GPUs to run optimized agents with appropriate runtime libraries.

Changes:
- Add Containerfile.gkm-agent-nvidia (CUDA 12.6.3 base with NVML)
- Add Containerfile.gkm-agent-amd (ROCm 6.3.1 with AMD SMI libraries)
- Remove generic Containerfile.gkm-agent
- Add DaemonSet manifests with PCI vendor ID-based node selectors:
  * gkm-agent-nvidia.yaml (nodeSelector: pci-10de.present)
  * gkm-agent-amd.yaml (nodeSelector: pci-1002.present)
- Remove generic gkm-agent.yaml
- Add Node Feature Discovery (NFD) deployment configuration
- Update Makefile with GPU-specific build/push targets:
  * build-image-agent-nvidia, build-image-agent-amd
  * build-image-agents-gpu (builds both)
  * push-images-agents-gpu
- Add mcv dependencies: go-nvlib v0.9.0, amdsmi (amd-staging)
- Add comprehensive documentation for multi-GPU deployment

The operator and CSI plugin remain unchanged and work with both agent types.
NFD automatically labels nodes with GPU vendor information, enabling
declarative GPU-specific agent deployment without manual intervention.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Add deploy-nfd and undeploy-nfd targets for automated Node Feature Discovery deployment
- Integrate NFD deployment into main deploy target for GPU detection
- Add deploy-kyverno-production for non-Kind cluster Kyverno deployment
- Add deploy-kyverno-with-policies combined target
- Update deploy target to conditionally deploy Kyverno based on KYVERNO_ENABLED flag
- Update undeploy target to clean up NFD and Kyverno when KYVERNO_ENABLED=true
- Update prepare-deploy to configure all three agent image variants (NVIDIA, AMD, no-GPU)

This enables 'make deploy' to automatically deploy a complete GKM stack including
GPU detection (NFD) and optional image verification (Kyverno) on production clusters.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- When NO_GPU_BUILD=true, only build and push no-GPU agent
- When NO_GPU_BUILD=false (default), build and push all three agents (NVIDIA, AMD, no-GPU)
- This avoids unnecessary builds of GPU-specific agents for Kind/test clusters

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add AGENT_NVIDIA_IMG, AGENT_AMD_IMG, and AGENT_NOGPU_IMG variables
to allow individual override of agent images. This enables deploying
with custom image names/tags without requiring the default naming scheme.

Example usage:
  make deploy \
    OPERATOR_IMG=quay.io/user/gkm:operator \
    EXTRACT_IMG=quay.io/user/gkm:extract \
    AGENT_NVIDIA_IMG=quay.io/user/gkm:agent-nvidia \
    AGENT_AMD_IMG=quay.io/user/gkm:agent-amd \
    AGENT_NOGPU_IMG=quay.io/user/gkm:agent-no-gpu

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
The GPU agents were not being scheduled on nodes with GPUs because NFD creates
labels with PCI class codes (e.g., pci-0302_10de for NVIDIA 3D controllers),
but agents were using simple nodeSelectors looking for vendor ID only (pci-10de).

Changes:
- Update NVIDIA agent to use nodeAffinity matching class codes 0300 and 0302
- Update AMD agent to use nodeAffinity matching class codes 0300, 0302, and 0380
- Upgrade NFD to v0.17.2 to fix deprecated node-role.kubernetes.io/master label
- Replace wget with curl in Makefile for macOS compatibility

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
In multi-node clusters, the nogpu agent should not run on control-plane nodes.
Also updated to match PCI class code label format consistent with GPU agents.

Changes:
- Add nodeAffinity to exclude nodes with node-role.kubernetes.io/control-plane label
- Update GPU detection to use PCI class codes (0300, 0302, 0380) instead of vendor ID only
- Ensures nogpu agent only runs on non-GPU worker nodes

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
… requests

The GPU agents were unable to access GPUs because they lacked the necessary
GPU runtime libraries. Following the NVIDIA device plugin pattern, we now mount:

NVIDIA agent:
- /usr/lib64 -> Contains libnvidia-ml.so and other NVIDIA libraries
- LD_LIBRARY_PATH=/usr/lib64 environment variable

AMD agent:
- /opt/rocm -> ROCm libraries for AMD GPU management
- /usr/lib64 -> System libraries
- LD_LIBRARY_PATH=/opt/rocm/lib:/usr/lib64

This allows the agents to use NVML/ROCm APIs to detect and monitor ALL GPUs
on the node without requesting gpu resources (nvidia.com/gpu or amd.com/gpu),
which would limit visibility to only one GPU.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add comprehensive installation script and make target to automate dependency
setup on RHEL 10 systems. The script handles installation of build dependencies
from CentOS Stream and Fedora repositories, and installs/upgrades go, podman,
and kubectl to required versions.

Changes:
- Add hack/install_deps.sh script for RHEL 10 dependency installation
- Add 'make install-deps' target to Makefile
- Update GettingStartedGuide with automated installation instructions
- Document package sources for RHEL 10 (CentOS Stream, Fedora)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
This commit addresses all review feedback and failing CI checks:

- Add common agent base Containerfile with shared build stages
- Update all agent Containerfiles with clear stage documentation
- Add agent-base image to CI/CD workflow and Makefile
- Fix image-build workflow to build all 4 agent variants (base, nvidia, amd, nogpu)
- Fix 19 markdown linting errors in documentation files
  - Wrap long lines to ≤80 characters (MD013)
  - Add blank lines around code blocks (MD031)
  - Add blank lines around lists (MD032)

Resolves:
- Build Image (agent) workflow failure (missing Containerfile.gkm-agent)
- Pre-commit markdown linting failures
- PR review comment requesting common base container

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Move inline comment out of matchExpressions list to avoid yamllint warnings
- Fix indentation of nodeSelectorTerms and matchExpressions items
- Ensure consistent 2-space indentation for YAML list items

This resolves the pre-commit yamllint hook failures for:
- examples/namespace/RWO-NVIDIA/12-ds.yaml
- examples/namespace/RWO-NVIDIA/13-pod.yaml

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Move examples from flat structure to organized hierarchy:
- examples/namespace/RWO-NVIDIA/ → examples/namespace/RWO/CUDA/
- examples/namespace/RWO-ROCM/ → examples/namespace/RWO/ROCM/

This change:
- Updates README paths to reflect new directory structure
- Includes yamllint fixes (proper indentation and comment placement)
- Maintains consistent example organization under RWO/

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…nd-load-images

The kind-load-images target was attempting to load ${AGENT_IMG} which is never built.
Updated to load the actual agent images based on NO_GPU_BUILD flag: AGENT_BASE_IMG,
AGENT_NOGPU_IMG (always), and AGENT_NVIDIA_IMG/AGENT_AMD_IMG (when NO_GPU_BUILD=false).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
maryamtahhan and others added 12 commits April 15, 2026 14:41
Fixes Kyverno and NFD component scheduling issues in Kind clusters with
GPU taints by adding proper tolerations and removing duplicate deployments.

Changes:
- Use Kind-specific Kyverno values when NO_GPU=true in deploy target
- Remove duplicate Kyverno deployment from run-on-kind target
- Add GPU tolerations for Kyverno hooks/migration jobs
- Add GPU tolerations for NFD garbage collector and workers

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
NFD is unnecessary in Kind simulated GPU environments. Instead, patch agent
daemonsets to use GPU device plugin labels (rocm.amd.com/gpu.present,
nvidia.com/gpu.present) for node affinity rather than NFD's PCI device labels.

Changes:
- Skip NFD deployment when NO_GPU=true (Kind clusters)
- Skip NFD undeployment when NO_GPU=true
- Add Kind-specific agent patches using device plugin labels
- Patch gkm-agent-amd to use rocm.amd.com/gpu.present label
- Patch gkm-agent-nvidia to use nvidia.com/gpu.present label
- Patch gkm-agent-nogpu to exclude nodes with GPU labels

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Introduces SKIP_NFD flag to control NFD deployment separately from NO_GPU
mode. For Kind clusters, we skip NFD deployment but simulate it by manually
adding PCI device labels that NFD would normally create.

Changes:
- Add SKIP_NFD flag (default: false) to control NFD deployment
- Use SKIP_NFD instead of NO_GPU for controlling NFD deployment/undeploy
- Auto-label Kind worker nodes with NFD PCI device labels (nvidia/rocm)
- Keep NO_GPU=true for Kind to use no-GPU agent mode
- Remove device plugin label patches (revert to NFD PCI labels)
- Update deploy-on-kind to pass both SKIP_NFD=true and NO_GPU=true

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
For Kind GPU simulation with NO_GPU=true, remove NFD PCI label requirements
by removing node affinity from the nogpu agent. This allows the agent to
schedule on all worker nodes without needing NFD labels.

Changes:
- Remove NFD PCI label addition from deploy-on-kind target
- Add Kind-specific patch to remove node affinity from nogpu agent
- Fix agent-patch.yaml to target all three agent daemonsets (amd, nvidia, nogpu)
- NoGPU agents now schedule successfully in Kind clusters

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Make namespace and cache names consistent across ROCM and CUDA examples:
- ROCM namespace: gkm-test-ns-rwo-1 → gkm-test-ns-rocm-rwo-1
- ROCM cache: vector-add-cache-rocm-v2-rwo → vector-add-cache-rocm-rwo
- ROCM cache v3: vector-add-cache-rocm-v3-rwo → vector-add-cache-rocm-rwo-v3
- CUDA namespace: gkm-test-ns-nvidia-rwo-1 → gkm-test-ns-cuda-rwo-1
- CUDA workloads: gkm-test-nvidia-* → gkm-test-cuda-*

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Update ROCM workload names to be consistent with namespace naming:
- gkm-test-ns-rwo-ds-* → gkm-test-rocm-rwo-ds-*
- gkm-test-ns-rwo-v3-ds-* → gkm-test-rocm-rwo-v3-ds-*

Now matches CUDA pattern: gkm-test-{vendor}-rwo-*

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Removed duplicated builder stage from NVIDIA, AMD, and nogpu agent
Containerfiles. Each now uses FROM quay.io/gkm/gkm-agent-base:latest
as the builder stage, eliminating code duplication while keeping
GPU-specific runtime stages intact.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replaced separate Containerfiles for each agent variant with a single
Containerfile.gkm-agents containing multi-stage targets (nogpu, amd, nvidia).
This eliminates cross-file dependencies and enables parallel CI builds.

Changes:
- Created Containerfile.gkm-agents with shared builder stage
- nogpu target: complete agent with common runtime deps
- amd target: extends nogpu, adds ROCm support only
- nvidia target: CUDA runtime with agent binary
- Updated Makefile to build using --target flags
- Updated GitHub workflow to use single Containerfile
- Removed obsolete individual Containerfiles
- Updated documentation references

Benefits:
- No build dependencies between separate files
- Builder stage always available in same file
- AMD reuses all nogpu layers (more efficient)
- CI workflows can build in parallel
- Cleaner, more maintainable structure

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Changed gkm.agent.image from non-existent gkm-agent:latest to
gkm-agent-nogpu:latest. This value is legacy/unused (operator only
logs it), but needs to reference a real image for backwards compatibility.

Each agent daemonset uses its GPU-specific image directly, so this
configmap value is not actually used at runtime.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add supplementalGroups to ROCm agent pod spec to grant video group membership (GIDs 44 and 986), enabling proper access to /dev/kfd and /dev/dri devices required for amd-smi GPU detection.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
AMD integrated GPUs (e.g., Strix Halo) return "N/A" strings for PCIe
and ECC fields that discrete GPUs populate with typed values. Changed
AMDBus, AMDRAS, and AMDCardInfo structs to use interface{} for fields
that can be either typed values or "N/A" strings, allowing JSON parsing
to succeed for both integrated and discrete AMD GPUs.

Also added render group (GID 105) to AMD agent pod security context to
ensure device access on systems where /dev/kfd and /dev/dri/renderD*
are owned by the render group.

Includes test with real Strix Halo GPU output to verify parsing.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan marked this pull request as ready for review April 15, 2026 13:42
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/namespace/RWO/ROCM/14-ds.yaml (1)

29-47: ⚠️ Potential issue | 🟠 Major

Harden container security contexts in this example DaemonSet.

Line 32 runs the init container as root, and neither container sets allowPrivilegeEscalation: false. This keeps a permissive baseline and triggers the medium-risk checks noted in static analysis.

Suggested hardening patch
       initContainers:
         - name: fix-permissions
           image: quay.io/fedora/fedora-minimal
           securityContext:
             runAsUser: 0
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop: ["ALL"]
@@
       containers:
         - name: test
           image: quay.io/fedora/fedora-minimal
           imagePullPolicy: IfNotPresent
+          securityContext:
+            runAsNonRoot: true
+            runAsUser: 1000
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop: ["ALL"]
           command: [sleep, 365d]

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/namespace/RWO/ROCM/14-ds.yaml` around lines 29 - 47, The init
container "fix-permissions" is running as root (securityContext.runAsUser: 0)
and neither "fix-permissions" nor the main container "test" set
allowPrivilegeEscalation, which is insecure; change the init container to run as
a non-root UID (e.g., set securityContext.runAsUser: 1000 and/or
securityContext.runAsNonRoot: true for "fix-permissions") and add
securityContext.allowPrivilegeEscalation: false to both the "fix-permissions"
initContainer and the "test" container (also consider adding
securityContext.readOnlyRootFilesystem: true and dropping capabilities if
appropriate) so both containers no longer run with elevated privileges.
🧹 Nitpick comments (2)
config/operator/kustomization.yaml (1)

6-14: Trim the legacy operator image aliases.

config/operator/operator.yaml already uses quay.io/gkm/gkm-operator:latest, so the controller and quay.io/gkm/operator mappings are just extra indirection now. Keeping only the canonical quay.io/gkm/gkm-operator entry will make future overrides easier to reason about.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/operator/kustomization.yaml` around lines 6 - 14, Remove the legacy
image name mappings in the kustomization.yaml and keep only the canonical
mapping for quay.io/gkm/gkm-operator:latest; specifically delete the entries
referencing the alias "controller" and the old "quay.io/gkm/operator" and retain
only the image block whose name is "quay.io/gkm/gkm-operator" with newName
quay.io/gkm/gkm-operator and newTag latest so overrides target a single,
unambiguous image reference.
config/kind-gpu/agent-remove-affinity-patch.yaml (1)

9-10: Keep the control-plane exclusion when dropping GPU affinity.

affinity: null removes the node-role.kubernetes.io/control-plane guard from config/agent/gkm-agent-nogpu.yaml:20-42 together with the GPU checks. That makes the “worker nodes only” behavior depend on other patches instead of this overlay. Re-add the control-plane exclusion here and only drop the GPU-specific predicates.

Suggested patch
-      affinity: null
+      affinity:
+        nodeAffinity:
+          requiredDuringSchedulingIgnoredDuringExecution:
+            nodeSelectorTerms:
+            - matchExpressions:
+              - key: node-role.kubernetes.io/control-plane
+                operator: DoesNotExist
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/kind-gpu/agent-remove-affinity-patch.yaml` around lines 9 - 10, The
current patch sets affinity: null which removes the control-plane exclusion;
instead restore a nodeAffinity block that keeps the "worker nodes only" guard
while removing only the GPU-specific predicates. Replace affinity: null with an
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms that include a matchExpression to exclude control-plane nodes
(e.g. key: node-role.kubernetes.io/control-plane, operator: DoesNotExist) and
remove any matchExpressions or nodeSelector entries that reference GPU-specific
keys (e.g. nvidia.com/gpu, feature.node.kubernetes.io/gpu) so only the
control-plane exclusion remains; mirror the structure used in
config/agent/gkm-agent-nogpu.yaml around lines 20-42 to ensure consistent
semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/agent/gkm-agent-nvidia.yaml`:
- Around line 24-33: The current Kubernetes pod affinity under
affinity.nodeAffinity.nodeSelectorTerms only matches PCI classes 0300 and 0302;
add a third matchExpression for key
feature.node.kubernetes.io/pci-0380_10de.present with operator: Exists (or
consolidate the matchExpressions list to include 0300, 0302 and 0380) so the
NVIDIA DaemonSet will also schedule on nodes labeled with PCI class 0380.
- Around line 60-61: The LD_LIBRARY_PATH environment entries currently overwrite
the container's libraries by hardcoding /usr/lib64; change each LD_LIBRARY_PATH
env var to preserve the container's CUDA/NVML library paths and only add the
host library directory instead of replacing the whole variable (i.e.,
prepend/append the host path to the existing LD_LIBRARY_PATH or explicitly
include the container's standard CUDA/NVML paths such as the Ubuntu GPU library
directories plus the host path); apply this fix to all LD_LIBRARY_PATH env
entries in the NVIDIA agent spec so the container retains its distro-native
CUDA/NVML paths while allowing host libs to be visible.

In `@config/agent/README.md`:
- Around line 123-135: The "Node Selectors" docs incorrectly show nodeSelector
strings; update the section to document using affinity.nodeAffinity with
requiredDuringSchedulingIgnoredDuringExecution and matchExpressions using the
Exists operator instead of feature.node...: "true" strings; replace the
nodeSelector examples with affinity examples that include nodeSelectorTerms ->
matchExpressions entries for the NVIDIA keys (e.g.,
feature.node.kubernetes.io/pci-0300_10de.present and
feature.node.kubernetes.io/pci-0302_10de.present) and use operator: Exists (do
the same pattern for AMD keys), ensuring the README's examples reference
affinity.nodeAffinity and the requiredDuringSchedulingIgnoredDuringExecution
structure and not nodeSelector.
- Around line 37-39: The README's label examples use vendor-only format
(`pci-10de.present`) but the DaemonSets emit class-qualified labels; update the
examples in config/agent/README.md to the actual format used by the DaemonSet
(e.g., `feature.node.kubernetes.io/pci-0300_10de.present=true` for NVIDIA and
`feature.node.kubernetes.io/pci-0302_1002.present=true` for AMD), replace all
vendor-only examples with the class_vendor format, and add a short note
describing the pattern
`feature.node.kubernetes.io/pci-<class>_<vendor>.present=true` so it matches the
implementation.

In `@config/nfd/README.md`:
- Around line 98-110: The Integration section incorrectly shows a nodeSelector
using string values; replace those examples by documenting affinity.nodeAffinity
usage: remove the nodeSelector blocks and show affinity.nodeAffinity ->
requiredDuringSchedulingIgnoredDuringExecution -> nodeSelectorTerms with
matchExpressions using operator: Exists, and use the correct feature keys (e.g.
feature.node.kubernetes.io/pci-0300_10de.present and
feature.node.kubernetes.io/pci-0302_10de.present for NVIDIA agents) so the
DaemonSets use Exists-based nodeAffinity rather than string-valued nodeSelector.
- Around line 11-12: The README currently shows vendor-only labels
(`pci-10de.present`, `pci-1002.present`) but the agent DaemonSets use
class-qualified labels (e.g., `pci-0300_10de.present`, `pci-0302_10de.present`);
update the README so all examples and verification commands (the checks around
lines referenced in the comment) use the exact PCI class + vendor labels used by
the DaemonSets (replace vendor-only labels with `pci-0300_10de.present` and
`pci-0302_10de.present` for NVIDIA and `pci-0300_1002.present` /
`pci-0302_1002.present` for AMD), and adjust the kubectl verification commands
to query those class-qualified labels so they return the expected results.
- Around line 38-39: Update the verification commands to use the class-qualified
NFD label keys instead of the bare vendor tokens: replace occurrences of
"feature.node.kubernetes.io/pci-10de.present" and
"feature.node.kubernetes.io/pci-1002.present" with their class-qualified forms
"feature.node.kubernetes.io/pci.vendor-10de.present" and
"feature.node.kubernetes.io/pci.vendor-1002.present" in the README so the
kubectl get nodes -L command lists the correct NFD labels.

In `@docs/GettingStartedGuide.md`:
- Around line 13-28: The documentation's Automated Installation section (under
"Automated Installation (RHEL 10 / CentOS Stream 10)") incorrectly lists kubectl
minimum as v1.11.3+; update the text in that section and the kubectl bullet so
the minimum required kubectl version is v1.24+ (to match NFD v0.17.2 and
Kubernetes v1.24+). Ensure the change is applied to the make install-deps /
kubectl install description so users installing via "make install-deps" see the
corrected kubectl minimum.

In `@examples/namespace/RWO/CUDA/11-gkmcache.yaml`:
- Around line 13-19: The podTemplate currently only adds a toleration
(podTemplate.spec.tolerations) which is permissive; update the Pod spec to
require NVIDIA nodes by adding either a nodeSelector or a required nodeAffinity:
under podTemplate.spec add nodeSelector: {"nvidia.com/gpu": "present"} or,
preferably, add
affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution with a
nodeSelectorTerm that uses matchExpressions key "nvidia.com/gpu" operator
"Exists" (or matching the label used by your NVIDIA nodes), so the CUDA
extraction job is scheduled only onto NVIDIA GPU nodes.

In `@examples/namespace/RWO/CUDA/12-ds.yaml`:
- Around line 7-17: The DaemonSet label keys are inconsistent: the DaemonSet
metadata uses gkm.io/pvcMutation (camelCase) while the pod template uses
gkm.io/pvc-mutation (kebab-case); update the DaemonSet-level label (under
labels: and metadata:) to use the same key as the pod template
(gkm.io/pvc-mutation) so selectors/mutation webhooks match, ensuring you change
every occurrence of gkm.io/pvcMutation to gkm.io/pvc-mutation in this resource
(including labels: and any metadata sections).

In `@examples/namespace/RWO/CUDA/README.md`:
- Around line 50-52: Update the README namespace to match the manifests: replace
all occurrences of "gkm-test-ns-nvidia-rwo-1" in
examples/namespace/RWO/CUDA/README.md with "gkm-test-ns-cuda-rwo-1" so kubectl
commands align with the actual manifests (see manifests 12-ds.yaml and
13-pod.yaml); apply the same replacement for the additional occurrences called
out (lines around 63-64, 70, 77, 84-89, 95-96, 105, 120-121, 127) to ensure
every kubectl command and example uses the correct namespace string.
- Around line 84-89: The README uses pod/daemonset names that don't match the
manifests; replace the incorrect names referenced (gkm-test-nvidia-pod-1 and
gkm-test-nvidia-rwo-ds-1) with the manifest names (gkm-test-cuda-pod-1 and
gkm-test-cuda-rwo-ds-1) in the kubectl commands so the examples align with the
resources created by the manifests (update the lines showing kubectl get pod,
kubectl logs, and the DaemonSet/pod label commands).

In `@hack/install_deps.sh`:
- Line 11: The MIN_GO_VERSION variable is set to a non-existent Go release
("1.25.0"), causing downloads from https://go.dev/dl/go1.25.0.linux-amd64.tar.gz
to fail; update MIN_GO_VERSION to a valid Go release (e.g., "1.24.7" or
whichever official version you want to pin) in hack/install_deps.sh (the
MIN_GO_VERSION variable) and make the same matching change in the Containerfile
so the download URL and any version checks use a real release.
- Around line 44-53: The script hardcodes Fedora rawhide RPM URLs (see
"${FEDORA_BASE}/l/libbtrfs-...", "${FEDORA_BASE}/b/btrfs-progs-..." and
btrfs-progs-devel) which are fragile; change install_deps.sh to prefer
installing by package name via dnf (e.g., dnf install -y btrfs-progs
btrfs-progs-devel libbtrfs libbtrfsutil) or configure a stable Fedora repository
and fall back to downloading a versioned RPM only if dnf is unavailable, and
remove the fixed fc45 URLs or make them a configurable variable with a clear
comment so the dependency source is stable and maintainable.

In `@Makefile`:
- Around line 341-347: The deploy-nfd Makefile target currently masks wait
failures with "|| true" causing false success; update the deploy-nfd recipe
(target name: deploy-nfd) to stop swallowing the kubectl wait failure — remove
"|| true" from the line that runs $(KUBECTL) wait --for=condition=Available
--timeout=120s -n node-feature-discovery deployment/nfd-master and instead let
the command return a non-zero exit code (or add an explicit conditional that
prints a clear error and exits non-zero) so the Make target fails fast when NFD
never becomes ready.
- Around line 717-724: Remove the duplicate recipe that defines the
$(KIND_GPU_SIM_SCRIPT) target (the block that depends on $(LOCALBIN) and uses
curl to download $(KIND_GPU_SIM_SCRIPT_URL)); it overrides the earlier correct
rule and references an undefined $(KIND_GPU_SIM_SCRIPT_URL). Keep only the
.PHONY declaration and the phony target dependency: ".PHONY:
kind-gpu-sim-script" and "kind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT)", relying
on the original $(KIND_GPU_SIM_SCRIPT) rule that builds the script locally.

In `@mcv/pkg/accelerator/devices/amd_test.go`:
- Around line 285-289: TranslateGPUToArch misclassifies "RDNA 2" because the
function checks a generic "RDNA" branch before the more specific RDNA2 pattern;
update TranslateGPUToArch to match versioned RDNA strings first (e.g., check for
"RDNA 2" / "RDNA 3" or parse the numeric RDNA version) before the generic "RDNA"
case so "RDNA 2" resolves to gfx1030; adjust the matching order or use a regex
that extracts the RDNA major version and maps 2→gfx1030 (while preserving
existing mappings for other checks like "Instinct MI210" / "gfx90a").

---

Outside diff comments:
In `@examples/namespace/RWO/ROCM/14-ds.yaml`:
- Around line 29-47: The init container "fix-permissions" is running as root
(securityContext.runAsUser: 0) and neither "fix-permissions" nor the main
container "test" set allowPrivilegeEscalation, which is insecure; change the
init container to run as a non-root UID (e.g., set securityContext.runAsUser:
1000 and/or securityContext.runAsNonRoot: true for "fix-permissions") and add
securityContext.allowPrivilegeEscalation: false to both the "fix-permissions"
initContainer and the "test" container (also consider adding
securityContext.readOnlyRootFilesystem: true and dropping capabilities if
appropriate) so both containers no longer run with elevated privileges.

---

Nitpick comments:
In `@config/kind-gpu/agent-remove-affinity-patch.yaml`:
- Around line 9-10: The current patch sets affinity: null which removes the
control-plane exclusion; instead restore a nodeAffinity block that keeps the
"worker nodes only" guard while removing only the GPU-specific predicates.
Replace affinity: null with an affinity: nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms that include a
matchExpression to exclude control-plane nodes (e.g. key:
node-role.kubernetes.io/control-plane, operator: DoesNotExist) and remove any
matchExpressions or nodeSelector entries that reference GPU-specific keys (e.g.
nvidia.com/gpu, feature.node.kubernetes.io/gpu) so only the control-plane
exclusion remains; mirror the structure used in
config/agent/gkm-agent-nogpu.yaml around lines 20-42 to ensure consistent
semantics.

In `@config/operator/kustomization.yaml`:
- Around line 6-14: Remove the legacy image name mappings in the
kustomization.yaml and keep only the canonical mapping for
quay.io/gkm/gkm-operator:latest; specifically delete the entries referencing the
alias "controller" and the old "quay.io/gkm/operator" and retain only the image
block whose name is "quay.io/gkm/gkm-operator" with newName
quay.io/gkm/gkm-operator and newTag latest so overrides target a single,
unambiguous image reference.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 8138b4b8-1144-4cd2-b83e-9ea48c888e55

📥 Commits

Reviewing files that changed from the base of the PR and between acd5abc and c53c04e.

⛔ Files ignored due to path filters (1)
  • mcv/go.sum is excluded by !**/*.sum
📒 Files selected for processing (39)
  • .github/workflows/image-build.yml
  • Containerfile.gkm-agent
  • Containerfile.gkm-agents
  • Makefile
  • config/agent/README.md
  • config/agent/gkm-agent-amd.yaml
  • config/agent/gkm-agent-nogpu.yaml
  • config/agent/gkm-agent-nvidia.yaml
  • config/agent/kustomization.yaml
  • config/configMap/configMap.yaml
  • config/configMap/kustomization.yaml
  • config/kind-gpu/agent-remove-affinity-patch.yaml
  • config/kind-gpu/kustomization.yaml
  • config/kyverno/values-no-gpu.yaml
  • config/nfd/README.md
  • config/nfd/kustomization.yaml
  • config/nfd/nfd-worker-conf.yaml
  • config/nfd/patch-nfd-gc.yaml
  • config/nfd/patch-nfd-workers.yaml
  • config/operator/kustomization.yaml
  • config/operator/operator.yaml
  • docs/GettingStartedGuide.md
  • examples/namespace/RWO/CUDA/10-namespace.yaml
  • examples/namespace/RWO/CUDA/11-gkmcache.yaml
  • examples/namespace/RWO/CUDA/12-ds.yaml
  • examples/namespace/RWO/CUDA/13-pod.yaml
  • examples/namespace/RWO/CUDA/README.md
  • examples/namespace/RWO/ROCM/10-namespace.yaml
  • examples/namespace/RWO/ROCM/11-gkmcache.yaml
  • examples/namespace/RWO/ROCM/12-ds.yaml
  • examples/namespace/RWO/ROCM/13-ds.yaml
  • examples/namespace/RWO/ROCM/14-ds.yaml
  • examples/namespace/RWO/ROCM/21-gkmcache-cosign-v3.yaml
  • examples/namespace/RWO/ROCM/22-ds.yaml
  • gkm-codespell.precommit-toml
  • hack/install_deps.sh
  • mcv/go.mod
  • mcv/pkg/accelerator/devices/amd.go
  • mcv/pkg/accelerator/devices/amd_test.go
💤 Files with no reviewable changes (1)
  • Containerfile.gkm-agent

Comment on lines +24 to +33
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: feature.node.kubernetes.io/pci-0300_10de.present
operator: Exists
- matchExpressions:
- key: feature.node.kubernetes.io/pci-0302_10de.present
operator: Exists
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Include PCI class 0380 in the NVIDIA affinity.

This selector only matches 0300 and 0302, so nodes labeled only with feature.node.kubernetes.io/pci-0380_10de.present will never get the NVIDIA DaemonSet. That silently breaks auto-scheduling for one of the GPU classes this PR is trying to support.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/agent/gkm-agent-nvidia.yaml` around lines 24 - 33, The current
Kubernetes pod affinity under affinity.nodeAffinity.nodeSelectorTerms only
matches PCI classes 0300 and 0302; add a third matchExpression for key
feature.node.kubernetes.io/pci-0380_10de.present with operator: Exists (or
consolidate the matchExpressions list to include 0300, 0302 and 0380) so the
NVIDIA DaemonSet will also schedule on nodes labeled with PCI class 0380.

Comment on lines +60 to +61
- name: LD_LIBRARY_PATH
value: /usr/lib64
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't hard-code /usr/lib64 as the only NVIDIA library path.

The NVIDIA image here is Ubuntu-based, but this mounts the host's /usr/lib64 over the container and replaces LD_LIBRARY_PATH with that single directory. That works only on RHEL-style hosts and can also hide the image's own default CUDA/NVML paths, so NVML loading becomes host-distro dependent.

Also applies to: 79-81, 104-108

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/agent/gkm-agent-nvidia.yaml` around lines 60 - 61, The LD_LIBRARY_PATH
environment entries currently overwrite the container's libraries by hardcoding
/usr/lib64; change each LD_LIBRARY_PATH env var to preserve the container's
CUDA/NVML library paths and only add the host library directory instead of
replacing the whole variable (i.e., prepend/append the host path to the existing
LD_LIBRARY_PATH or explicitly include the container's standard CUDA/NVML paths
such as the Ubuntu GPU library directories plus the host path); apply this fix
to all LD_LIBRARY_PATH env entries in the NVIDIA agent spec so the container
retains its distro-native CUDA/NVML paths while allowing host libs to be
visible.

Comment thread config/agent/README.md
Comment on lines +37 to +39
NFD will automatically add labels like:
- `feature.node.kubernetes.io/pci-10de.present=true` (NVIDIA, vendor ID: 0x10de)
- `feature.node.kubernetes.io/pci-1002.present=true` (AMD, vendor ID: 0x1002)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Label format documented doesn't match actual DaemonSet configuration.

Similar to the NFD README, this documents vendor-only labels (pci-10de.present), but the actual DaemonSets use class-qualified labels like pci-0300_10de.present. Update to match the actual implementation.

Proposed fix
 NFD will automatically add labels like:
-- `feature.node.kubernetes.io/pci-10de.present=true` (NVIDIA, vendor ID: 0x10de)
-- `feature.node.kubernetes.io/pci-1002.present=true` (AMD, vendor ID: 0x1002)
+- `feature.node.kubernetes.io/pci-0300_10de.present=true` (NVIDIA VGA controller)
+- `feature.node.kubernetes.io/pci-0302_10de.present=true` (NVIDIA 3D controller)
+- `feature.node.kubernetes.io/pci-0300_1002.present=true` (AMD VGA controller)
+- `feature.node.kubernetes.io/pci-0302_1002.present=true` (AMD 3D controller)
+- `feature.node.kubernetes.io/pci-0380_1002.present=true` (AMD Display controller)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/agent/README.md` around lines 37 - 39, The README's label examples use
vendor-only format (`pci-10de.present`) but the DaemonSets emit class-qualified
labels; update the examples in config/agent/README.md to the actual format used
by the DaemonSet (e.g., `feature.node.kubernetes.io/pci-0300_10de.present=true`
for NVIDIA and `feature.node.kubernetes.io/pci-0302_1002.present=true` for AMD),
replace all vendor-only examples with the class_vendor format, and add a short
note describing the pattern
`feature.node.kubernetes.io/pci-<class>_<vendor>.present=true` so it matches the
implementation.

Comment thread config/agent/README.md
Comment on lines +123 to +135
## Node Selectors

The DaemonSets use PCI vendor ID-based node selectors:

```yaml
# NVIDIA nodes
nodeSelector:
feature.node.kubernetes.io/pci-10de.present: "true"

# AMD nodes
nodeSelector:
feature.node.kubernetes.io/pci-1002.present: "true"
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Node selectors section shows incorrect format.

The DaemonSets use affinity.nodeAffinity with Exists operator, not nodeSelector with string values. This documentation is misleading.

Proposed fix
-The DaemonSets use PCI vendor ID-based node selectors:
+The DaemonSets use PCI class+vendor ID-based node affinity:

 ```yaml
-# NVIDIA nodes
-nodeSelector:
-  feature.node.kubernetes.io/pci-10de.present: "true"
-
-# AMD nodes
-nodeSelector:
-  feature.node.kubernetes.io/pci-1002.present: "true"
+# NVIDIA nodes (VGA or 3D controller)
+affinity:
+  nodeAffinity:
+    requiredDuringSchedulingIgnoredDuringExecution:
+      nodeSelectorTerms:
+      - matchExpressions:
+        - key: feature.node.kubernetes.io/pci-0300_10de.present
+          operator: Exists
+      - matchExpressions:
+        - key: feature.node.kubernetes.io/pci-0302_10de.present
+          operator: Exists
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @config/agent/README.md around lines 123 - 135, The "Node Selectors" docs
incorrectly show nodeSelector strings; update the section to document using
affinity.nodeAffinity with requiredDuringSchedulingIgnoredDuringExecution and
matchExpressions using the Exists operator instead of feature.node...: "true"
strings; replace the nodeSelector examples with affinity examples that include
nodeSelectorTerms -> matchExpressions entries for the NVIDIA keys (e.g.,
feature.node.kubernetes.io/pci-0300_10de.present and
feature.node.kubernetes.io/pci-0302_10de.present) and use operator: Exists (do
the same pattern for AMD keys), ensuring the README's examples reference
affinity.nodeAffinity and the requiredDuringSchedulingIgnoredDuringExecution
structure and not nodeSelector.


</details>

<!-- fingerprinting:phantom:poseidon:ocelot:03c827ba-87a4-4193-8273-611dfd44f42f -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment thread config/nfd/README.md
Comment on lines +11 to +12
- **NVIDIA GPUs**: `feature.node.kubernetes.io/pci-10de.present=true` (vendor ID: 0x10de)
- **AMD GPUs**: `feature.node.kubernetes.io/pci-1002.present=true` (vendor ID: 0x1002)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Documentation mismatch: Labels documented don't match actual usage.

The README documents vendor-only labels (pci-10de.present, pci-1002.present), but the actual agent DaemonSets in config/agent/gkm-agent-nvidia.yaml and config/agent/gkm-agent-amd.yaml use PCI class-qualified labels like pci-0300_10de.present (VGA controller) and pci-0302_10de.present (3D controller).

This will confuse users and cause the verification commands (lines 39, 92-95) to return no results.

Proposed fix
-For GKM, NFD automatically labels nodes with PCI device vendor IDs, enabling GPU-specific agent deployment:
+For GKM, NFD automatically labels nodes with PCI device class and vendor IDs, enabling GPU-specific agent deployment:

-- **NVIDIA GPUs**: `feature.node.kubernetes.io/pci-10de.present=true` (vendor ID: 0x10de)
-- **AMD GPUs**: `feature.node.kubernetes.io/pci-1002.present=true` (vendor ID: 0x1002)
+- **NVIDIA GPUs**: `feature.node.kubernetes.io/pci-0300_10de.present=true` or `pci-0302_10de.present=true` (vendor ID: 0x10de, class: VGA/3D)
+- **AMD GPUs**: `feature.node.kubernetes.io/pci-0300_1002.present=true`, `pci-0302_1002.present=true`, or `pci-0380_1002.present=true` (vendor ID: 0x1002)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/nfd/README.md` around lines 11 - 12, The README currently shows
vendor-only labels (`pci-10de.present`, `pci-1002.present`) but the agent
DaemonSets use class-qualified labels (e.g., `pci-0300_10de.present`,
`pci-0302_10de.present`); update the README so all examples and verification
commands (the checks around lines referenced in the comment) use the exact PCI
class + vendor labels used by the DaemonSets (replace vendor-only labels with
`pci-0300_10de.present` and `pci-0302_10de.present` for NVIDIA and
`pci-0300_1002.present` / `pci-0302_1002.present` for AMD), and adjust the
kubectl verification commands to query those class-qualified labels so they
return the expected results.

Comment thread hack/install_deps.sh
echo ""

# Minimum required versions
MIN_GO_VERSION="1.25.0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Go 1.25.0 does not exist.

Same issue as in the Containerfile - Go version 1.25.0 is not a valid release. The script will fail when attempting to download from https://go.dev/dl/go1.25.0.linux-amd64.tar.gz.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/install_deps.sh` at line 11, The MIN_GO_VERSION variable is set to a
non-existent Go release ("1.25.0"), causing downloads from
https://go.dev/dl/go1.25.0.linux-amd64.tar.gz to fail; update MIN_GO_VERSION to
a valid Go release (e.g., "1.24.7" or whichever official version you want to
pin) in hack/install_deps.sh (the MIN_GO_VERSION variable) and make the same
matching change in the Containerfile so the download URL and any version checks
use a real release.

Comment thread hack/install_deps.sh
Comment on lines +44 to +53
sudo rpm -ivh --nodeps \
"${FEDORA_BASE}/l/libbtrfs-6.19-1.fc45.x86_64.rpm" \
"${FEDORA_BASE}/l/libbtrfsutil-6.19-1.fc45.x86_64.rpm" 2>/dev/null || echo "Libraries may already be installed"

# Now install devel package with --nodeps
sudo rpm -ivh --nodeps \
"${FEDORA_BASE}/b/btrfs-progs-6.19-1.fc45.x86_64.rpm" 2>/dev/null || echo "btrfs-progs may already be installed"

sudo rpm -ivh --nodeps \
"${FEDORA_BASE}/b/btrfs-progs-devel-6.19-1.fc45.x86_64.rpm"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded Fedora rawhide RPM URLs are fragile.

These URLs point to specific package versions in Fedora rawhide (fc45), which is a rolling development branch. Packages are frequently rebuilt and old versions removed, causing these URLs to break.

Consider using a more stable package source, or document that these URLs need periodic updates. Alternatively, use dnf with a Fedora repository if available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/install_deps.sh` around lines 44 - 53, The script hardcodes Fedora
rawhide RPM URLs (see "${FEDORA_BASE}/l/libbtrfs-...",
"${FEDORA_BASE}/b/btrfs-progs-..." and btrfs-progs-devel) which are fragile;
change install_deps.sh to prefer installing by package name via dnf (e.g., dnf
install -y btrfs-progs btrfs-progs-devel libbtrfs libbtrfsutil) or configure a
stable Fedora repository and fall back to downloading a versioned RPM only if
dnf is unavailable, and remove the fixed fc45 URLs or make them a configurable
variable with a clear comment so the dependency source is stable and
maintainable.

Comment thread Makefile
Comment on lines +341 to +347
.PHONY: deploy-nfd
deploy-nfd: kustomize ## Deploy Node Feature Discovery for GPU detection
@echo "Deploying Node Feature Discovery (NFD)..."
$(KUSTOMIZE) build config/nfd | $(KUBECTL) apply -f -
@echo "Waiting for NFD to be ready..."
@$(KUBECTL) wait --for=condition=Available --timeout=120s -n node-feature-discovery deployment/nfd-master || true
@echo "NFD deployed successfully."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast if NFD never becomes ready.

deploy-nfd reports success even when the readiness check times out because the failure is masked with || true. Since the vendor-specific DaemonSets depend on NFD labels, this turns a required dependency failure into a much harder-to-debug "agents never schedule" install.

Suggested fix
 deploy-nfd: kustomize ## Deploy Node Feature Discovery for GPU detection
 	`@echo` "Deploying Node Feature Discovery (NFD)..."
 	$(KUSTOMIZE) build config/nfd | $(KUBECTL) apply -f -
 	`@echo` "Waiting for NFD to be ready..."
-	@$(KUBECTL) wait --for=condition=Available --timeout=120s -n node-feature-discovery deployment/nfd-master || true
+	@$(KUBECTL) wait --for=condition=Available --timeout=120s -n node-feature-discovery deployment/nfd-master
 	`@echo` "NFD deployed successfully."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 341 - 347, The deploy-nfd Makefile target currently
masks wait failures with "|| true" causing false success; update the deploy-nfd
recipe (target name: deploy-nfd) to stop swallowing the kubectl wait failure —
remove "|| true" from the line that runs $(KUBECTL) wait
--for=condition=Available --timeout=120s -n node-feature-discovery
deployment/nfd-master and instead let the command return a non-zero exit code
(or add an explicit conditional that prints a clear error and exits non-zero) so
the Make target fails fast when NFD never becomes ready.

Comment thread Makefile
Comment on lines +717 to +724
.PHONY: kind-gpu-sim-script
kind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT) ## Download kind-gpu-sim-script locally if necessary.
$(KIND_GPU_SIM_SCRIPT): $(LOCALBIN)
if [ ! -f $(KIND_GPU_SIM_SCRIPT) ]; then \
echo "Downloading $(KIND_GPU_SIM_SCRIPT)"; \
curl -L -o $(KIND_GPU_SIM_SCRIPT) $(KIND_GPU_SIM_SCRIPT_URL); \
chmod +x $(KIND_GPU_SIM_SCRIPT); \
fi
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Looking for duplicate rules and URL definition:"
rg -n 'kind-gpu-sim-script:|\$\(KIND_GPU_SIM_SCRIPT\):|(^KIND_GPU_SIM_SCRIPT_URL\s*[:?+]?=)|KIND_GPU_SIM_SCRIPT_URL' Makefile

Repository: redhat-et/GKM

Length of output: 387


🏁 Script executed:

# Search for any definition of KIND_GPU_SIM_SCRIPT_URL in the entire Makefile
rg -n 'KIND_GPU_SIM_SCRIPT_URL\s*[:?+]?=' Makefile

# Also check context around the first rule at line 455
sed -n '450,465p' Makefile

Repository: redhat-et/GKM

Length of output: 771


Remove the second $(KIND_GPU_SIM_SCRIPT) recipe at lines 719–724.

This rule overrides the earlier, correct rule at line 455 that installs from the local hack/kind-gpu-sim.sh. The second rule attempts to download via $(KIND_GPU_SIM_SCRIPT_URL), which is not defined anywhere in the Makefile. On a clean checkout, this causes the curl command to execute with an empty URL.

Keep only the .PHONY declaration and the phony target dependency:

.PHONY: kind-gpu-sim-script
kind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT)

The original rule at line 455 already handles building $(KIND_GPU_SIM_SCRIPT) correctly by copying from the local script.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 717 - 724, Remove the duplicate recipe that defines
the $(KIND_GPU_SIM_SCRIPT) target (the block that depends on $(LOCALBIN) and
uses curl to download $(KIND_GPU_SIM_SCRIPT_URL)); it overrides the earlier
correct rule and references an undefined $(KIND_GPU_SIM_SCRIPT_URL). Keep only
the .PHONY declaration and the phony target dependency: ".PHONY:
kind-gpu-sim-script" and "kind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT)", relying
on the original $(KIND_GPU_SIM_SCRIPT) rule that builds the script locally.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the rebase kept the old kind-gpu-sim script download. This target should be removed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmagina, agreed. The duplicate $(KIND_GPU_SIM_SCRIPT): $(LOCALBIN) recipe block needs to be removed — it references the undefined $(KIND_GPU_SIM_SCRIPT_URL) and overrides the correct original rule that installs from hack/kind-gpu-sim.sh. Only the .PHONY declaration and the phony alias should remain.

  • Remove the duplicate $(KIND_GPU_SIM_SCRIPT): $(LOCALBIN) recipe (the curl-based download block) from the new kind-gpu-sim-script section
  • Keep only the .PHONY declaration and the kind-gpu-sim-script: $(KIND_GPU_SIM_SCRIPT) dependency line

Failed to handle agent chat message. Please try again.

Comment thread mcv/pkg/accelerator/devices/amd_test.go
Move more specific RDNA 3 and RDNA 2 cases before the generic RDNA case
to prevent incorrect architecture detection. The generic pattern must be
checked last to avoid matching substrings of more specific GPU names.

Fixes TestGFXArchitectureTranslation test failure.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Copy link
Copy Markdown
Collaborator

@Billy99 Billy99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still playing with your PR, but want to go ahead and add this.

Comment thread hack/install_deps.sh
echo "=== Step 3: Installing btrfs development headers ==="
echo "====================================================="
# First install the base libraries with --nodeps to skip filesystem checks
sudo rpm -ivh --nodeps \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just run rpm -qa | grep libbtrfs to see if it is already installed or just try to install the package without a version? I'm on Fedora 42 and running make install-deps failed. Everything is already installed (btrfs 6.16.1-1), I was just running to see what would happen.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the changes to the Getting Started Guide and realized that this is only intended for RHEL 10 or CentOS Stream 10 systems, so take or leave the comment.

- name: quay.io/gkm/agent-nvidia
newName: quay.io/gkm/gkm-agent-nvidia
newTag: latest
- name: quay.io/gkm/gkm-agent-amd
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you have quay.io/gkm/gkm-agent-amd, quay.io/gkm/gkm-agent-nogpu and quay.io/gkm/gkm-agent-nvidia duplicated

newTag: latest
- name: quay.io/gkm/operator
newName: quay.io/gkm/operator
newName: quay.io/gkm/gkm-operator
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you have quay.io/gkm/gkm-operator duplicated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file just needs to be:

resources:
- operator.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
images:
- name: quay.io/gkm/gkm-operator
  newName: quay.io/gkm/gkm-operator
  newTag: latest

Comment thread Makefile
.PHONY: build-installer
build-installer: manifests generate kustomize ## Generate a consolidated YAML with CRDs and deployment.
mkdir -p dist
cd config/operator && $(KUSTOMIZE) edit set image controller=${OPERATOR_IMG}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change along with the changes to config/operator/kustomization.yaml:

Suggested change
cd config/operator && $(KUSTOMIZE) edit set image quay.io/gkm/gkm-operator=${OPERATOR_IMG}

Comment thread Makefile

.PHONY: prepare-deploy
prepare-deploy:
cd config/operator && $(KUSTOMIZE) edit set image quay.io/gkm/operator=${OPERATOR_IMG}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cd config/operator && $(KUSTOMIZE) edit set image quay.io/gkm/operator=${OPERATOR_IMG}
cd config/operator && $(KUSTOMIZE) edit set image quay.io/gkm/gkm-operator=${OPERATOR_IMG}

Comment thread Makefile
## Location to install dependencies to
$(LOCALBIN):
mkdir -p $@

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed in Craig's PR, but add it since you are in this file:

KIND_GPU_SIM_SCRIPT_URL := https://raw.githubusercontent.com/maryamtahhan/kind-gpu-sim/refs/heads/main/kind-gpu-sim.sh

@cmagina
Copy link
Copy Markdown
Collaborator

cmagina commented May 4, 2026

Found an interesting problem. On my desktop, I get both the nvidia and amd agents. This is because my desktop has an nvidia dGPU and an amd iGPU. This wouldn't be an issue for AMD EPYC CPUs as they don't have an iGPU, but if anyone put both nvidia and amd gpus in a server, we'd get both agents. Again, probably a corner case, but wanted to mention it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants