Skip to content

fix minor issues found during KServe demo#127

Open
Billy99 wants to merge 1 commit into
redhat-et:mainfrom
Billy99:billy99-storage-optional
Open

fix minor issues found during KServe demo#127
Billy99 wants to merge 1 commit into
redhat-et:mainfrom
Billy99:billy99-storage-optional

Conversation

@Billy99
Copy link
Copy Markdown
Collaborator

@Billy99 Billy99 commented May 1, 2026

Filippe found a few issues while trying to run GKM and KServe in the same cluster.

  • Requested that agents print the node they are running on in the logs.
  • The initContainer that gets added to the Job to extract the cache is only supposed to be there for KIND clusters but is triggering off the noGpu flag. Add a KIND Cluster flag that distinguishes between NoGPU and KIND Clusters.
  • The StorageClassName in the GKMCache and ClusterGKMCache is required when it should have been optional. Move it to optional.
  • To help with the Demo, it would be easier to manage if the PVC associated with ReadWriteOnce did not have the hash appended, which was done so each node has a unique PVC name. Add a "HACK" so that if the AccessMode of ReadWriteOncePod, the Operator treats the PVC as if ReadOnlyMany was entered and only creates one PVC for the Cluster, but creates it with AccessMode of ReadWriteOnce. This will not work where pods across multiple nodes want to share a PVC. But will "probably" work on one Node systems.

Summary by CodeRabbit

  • New Features

    • Added Kind cluster detection and configuration support
    • Added configurable extraction logging level control
    • PV access modes now dynamically computed based on cluster requirements
  • Bug Fixes

    • Storage class field is now optional; cluster default is used when not specified
    • Improved PVC owner assignment for ReadWriteOncePod mode

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 1, 2026

📝 Walkthrough

Walkthrough

The pull request introduces Kind cluster support and extract job log-level configuration across the GKM stack. Changes span Makefile build targets, Kubernetes manifests, API type definitions, and controller initialization logic. The CRD makes StorageClassName optional with fallback to cluster defaults. Controllers now propagate KIND_CLUSTER and EXTRACT_GO_LOG environment variables, and logic switches from NO_GPU to KIND_CLUSTER for PV provisioning in Kind environments.

Changes

Cohort / File(s) Summary
Build Configuration
Makefile
prepare-deploy now branches on KIND_CLUSTER to conditionally append gkm.kindcluster=true or false to ConfigMap templating. deploy-on-kind and redeploy-on-kind targets invoke base targets with KIND_CLUSTER=true alongside NO_GPU=true.
Kubernetes Manifests & ConfigMaps
config/agent/gkm-agent.yaml, config/operator/operator.yaml, config/configMap/configMap.yaml, config/configMap/kustomization.yaml
Added KIND_CLUSTER environment variable to agent and operator containers sourced from ConfigMap. Updated ConfigMap with gkm.kindcluster and gkm.extract.log.level settings; adjusted log levels to debug.
CRD Definitions
config/crd/bases/gkm.io_gkmcaches.yaml, config/crd/bases/gkm.io_clustergkmcaches.yaml
Made storageClassName field optional by removing from spec.required list. Updated descriptions to document fallback to cluster default Storage Class when omitted.
API Type Definitions
api/v1alpha1/shared_types.go
Removed required validation marker and changed JSON tag to omitempty for GKMCacheSpec.StorageClassName field; updated documentation for default behavior.
Agent Initialization
agent/main.go
Reads EXTRACT_GO_LOG (defaults to "info") and KIND_CLUSTER environment variables at startup. Injects KindCluster and ExtractLogLevel into ReconcilerCommonAgent alongside existing config fields.
Operator Initialization
cmd/main.go
Derives extractLogLevel from EXTRACT_GO_LOG and kindCluster from KIND_CLUSTER == "true". Injects both values into ReconcilerCommonOperator for cache reconcilers.
Agent Controller Logic
internal/controller/gkm-agent/common.go
Added KindCluster and ExtractLogLevel fields to ReconcilerCommonAgent. Changed PV provisioning trigger from NoGpu to KindCluster. Passes both new fields to LaunchJob.
Operator Controller Logic
internal/controller/gkm-operator/common.go
Added KindCluster and ExtractLogLevel fields to ReconcilerCommonOperator. Refactored PV/PVC access modes: computed dynamically based on ReadWriteOncePod presence instead of hardcoded ReadWriteMany. Passes KindCluster and ExtractLogLevel to LaunchJob.
Job Launch & Utilities
pkg/common/k8s.go, pkg/utils/contants.go
Updated LaunchJob signature to accept kindCluster and jobLogLevel parameters. Sets GO_LOG environment variable on extraction container. Triggers volume-permission init container based on kindCluster instead of noGpu. Added JobExtractEnvGoLog constant.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'fix minor issues found during KServe demo' is generic and vague, using non-descriptive phrasing that does not convey specific information about the actual changes. Consider a more specific title that highlights the main technical change, such as 'Add KIND cluster flag and make storage class optional' or similar descriptive phrasing.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
internal/controller/gkm-agent/common.go (1)

830-933: ⚠️ Potential issue | 🟠 Major

The KIND/manual-PV path does not implement the documented default StorageClass fallback.

The CRD for storageClassName is optional and documents that "If not provided, then default Storage Class will be used." However, GetStorageClassName() returns the raw field value without any defaulting logic, and both CreatePv() and CreatePvc() assign this value directly—including empty strings—to the PV/PVC spec. In Kubernetes, explicitly setting storageClassName: "" differs from omitting the field; the former disables default selection, while the latter enables it. The result is that omitted storageClassName in the CR breaks the documented contract by causing explicit empty-string assignments to the PV and PVC, preventing cluster default storage class usage.

Implement either:

  • Logic in GetStorageClassName() or the KIND-path callers to skip setting StorageClassName if empty, or
  • A default-resolution step before calling CreatePv() and CreatePvc() that looks up and assigns the cluster default when the CR field is omitted.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/gkm-agent/common.go` around lines 830 - 933, The
KIND/manual-PV path is passing an explicit empty string for storageClassName
(via (*gkmCache).GetStorageClassName()) into CreatePv and CreatePvc, which
prevents Kubernetes from using the cluster default; change the call/site or
helper so empty means "omit the field": update the callers here to capture sc :=
(*gkmCache).GetStorageClassName() and if sc == "" pass a nil/omitted value (or a
pointer set to nil) into common.CreatePv and common.CreatePvc, and/or update
common.CreatePv/common.CreatePvc to treat an empty storageClassName string by
not setting Spec.StorageClassName (leave it nil) instead of writing "" to the
PV/PVC; ensure GetStorageClassName need not change if you implement the
omit-in-caller or omit-in-Create* behavior.
cmd/main.go (1)

239-250: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wire ExtractLogLevel into the cluster-scoped reconciler too.

commonCl never receives extractLogLevel, so ClusterGKMCache extraction jobs will still run with the zero-value log level even though the env/default was resolved above. That leaves half of the new log-level plumbing inactive.

Suggested fix
 	commonCl := gkmOperator.ReconcilerCommonOperator[
 		gkmv1alpha1.ClusterGKMCache,
 		gkmv1alpha1.ClusterGKMCacheList,
 		gkmv1alpha1.ClusterGKMCacheNode,
 		gkmv1alpha1.ClusterGKMCacheNodeList,
 	]{
 		Client:          mgr.GetClient(),
 		Scheme:          mgr.GetScheme(),
 		NoGpu:           noGpu,
 		KindCluster:     kindCluster,
+		ExtractLogLevel: extractLogLevel,
 		ExtractImage:    extractImage,
 		CrdCacheStr:     "ClusterGKMCache",
 		CrdCacheNodeStr: "ClusterGKMCacheNode",
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/main.go` around lines 239 - 250, The Cluster-scoped reconciler's commonCl
initializer is missing the ExtractLogLevel field, so ClusterGKMCache extraction
uses the zero-value; update the commonCl composite literal in cmd/main.go to
include ExtractLogLevel: extractLogLevel (the same variable resolved earlier) so
that the gkmOperator.ReconcilerCommonOperator for ClusterGKMCache receives the
log level; ensure the field name matches the struct (ExtractLogLevel) and
rebuild to propagate the log level into the ClusterGKMCache reconciler.
internal/controller/gkm-operator/common.go (1)

767-784: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

KIND_CLUSTER is only wired into half of the operator path.

This LaunchJob call now uses r.KindCluster, but managePVandPVC() still decides whether to manually create the hostPath PV from r.NoGpu on Line 614. Non-KIND NO_GPU=true deployments will still take the KIND-only PV path, so the new flag does not actually separate the two behaviors yet.

Suggested fix
-			if r.NoGpu {
+			if r.KindCluster {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/gkm-operator/common.go` around lines 767 - 784,
managePVandPVC is still using r.NoGpu to decide creating the hostPath PV so
NON-KIND deployments with NO_GPU=true incorrectly follow the KIND path; update
managePVandPVC to check the cluster type (r.KindCluster or equivalent) when
deciding to create hostPath PVs and/or when calling the KIND-specific helper so
that only when r.KindCluster is true will the hostPath PV be created; ensure any
logic branches or callers (e.g., where managePVandPVC is invoked and where
LaunchJob is called) are consistent and reference r.KindCluster alongside/
instead of r.NoGpu to separate KIND behavior from the NO_GPU flag.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/v1alpha1/shared_types.go`:
- Around line 80-84: The spec documents that omitting StorageClassName should
allow the cluster default, but CreatePvc currently always sets StorageClassName
by taking a pointer to the (possibly empty) string; change CreatePvc (the
function that builds the PVC spec) to only set the PVC's StorageClassName
pointer when the input StorageClassName is non-empty (e.g., check if
spec.StorageClassName != "" after trimming whitespace) and leave the field nil
otherwise so Kubernetes will select the default StorageClass; update any helper
that constructs &storageClass to follow the same conditional logic.

In `@config/configMap/configMap.yaml`:
- Line 15: The constant ConfigMapIndexKindCluster is defined with the wrong key
string ("gkm.kind.cluster") causing lookups to miss the YAML key; update the
value of ConfigMapIndexKindCluster to "gkm.kindcluster" so it matches the keys
used in the YAML files (e.g., configMap entries) and ensure any reference to
ConfigMapIndexKindCluster continues to compile and read the corrected key.
- Around line 8-10: The operator Deployment is not injecting EXTRACT_GO_LOG from
the ConfigMap so gkm.extract.log.level is ignored; update the operator manifest
(config/operator/operator.yaml) to add an environment variable named
EXTRACT_GO_LOG in the operator container that sources its valueFrom the
ConfigMap key gkm.extract.log.level (use configMapKeyRef, similar to
NO_GPU/KIND_CLUSTER entries), and remove the dead code in
internal/controller/gkm-operator/configmap.go that reads the extractLogLevel
variable only to log it (delete the unused extractLogLevel lookup and its log
statement) so the value is injected via the manifest instead of being a no-op in
code.

In `@config/configMap/kustomization.yaml`:
- Around line 11-12: Remove the hard-coded gkm.kindcluster=true from the shared
base kustomization (config/configMap/kustomization.yaml) and instead add that
patch/setting to the KIND overlay kustomization
(config/kind-gpu/kustomization.yaml); locate the base configMap entry that
currently sets gkm.kindcluster and delete it so the base uses the default in
configMap.yaml (false), then add a kustomizeConfigMapGenerator/patch or overlay
entry in config/kind-gpu/kustomization.yaml to set gkm.kindcluster=true so only
the KIND/demo overlay enables kind-specific behavior.

In `@Makefile`:
- Around line 299-314: The Makefile uses ifdef for KIND_CLUSTER and NO_GPU which
treats any non-empty value as true; replace those conditionals with explicit
value checks (ifeq ($(KIND_CLUSTER),true) and ifeq ($(NO_GPU),true)) and adjust
the sed invocations so you always generate kustomization.yaml but insert the
correct literal values based on the true/false check (set gkm.kindcluster=true
when KIND_CLUSTER == true else gkm.kindcluster=false; set gkm.nogpu=true when
NO_GPU == true else gkm.nogpu=false), keeping the existing AGENT_IMG and
EXTRACT_IMG substitutions (AGENT_IMG, EXTRACT_IMG, kustomization.yaml.env,
kustomization.yaml) intact.

---

Outside diff comments:
In `@cmd/main.go`:
- Around line 239-250: The Cluster-scoped reconciler's commonCl initializer is
missing the ExtractLogLevel field, so ClusterGKMCache extraction uses the
zero-value; update the commonCl composite literal in cmd/main.go to include
ExtractLogLevel: extractLogLevel (the same variable resolved earlier) so that
the gkmOperator.ReconcilerCommonOperator for ClusterGKMCache receives the log
level; ensure the field name matches the struct (ExtractLogLevel) and rebuild to
propagate the log level into the ClusterGKMCache reconciler.

In `@internal/controller/gkm-agent/common.go`:
- Around line 830-933: The KIND/manual-PV path is passing an explicit empty
string for storageClassName (via (*gkmCache).GetStorageClassName()) into
CreatePv and CreatePvc, which prevents Kubernetes from using the cluster
default; change the call/site or helper so empty means "omit the field": update
the callers here to capture sc := (*gkmCache).GetStorageClassName() and if sc ==
"" pass a nil/omitted value (or a pointer set to nil) into common.CreatePv and
common.CreatePvc, and/or update common.CreatePv/common.CreatePvc to treat an
empty storageClassName string by not setting Spec.StorageClassName (leave it
nil) instead of writing "" to the PV/PVC; ensure GetStorageClassName need not
change if you implement the omit-in-caller or omit-in-Create* behavior.

In `@internal/controller/gkm-operator/common.go`:
- Around line 767-784: managePVandPVC is still using r.NoGpu to decide creating
the hostPath PV so NON-KIND deployments with NO_GPU=true incorrectly follow the
KIND path; update managePVandPVC to check the cluster type (r.KindCluster or
equivalent) when deciding to create hostPath PVs and/or when calling the
KIND-specific helper so that only when r.KindCluster is true will the hostPath
PV be created; ensure any logic branches or callers (e.g., where managePVandPVC
is invoked and where LaunchJob is called) are consistent and reference
r.KindCluster alongside/ instead of r.NoGpu to separate KIND behavior from the
NO_GPU flag.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 7213e779-877f-4c9f-bdbd-ea53d8893f64

📥 Commits

Reviewing files that changed from the base of the PR and between 749cecf and ef815fa.

📒 Files selected for processing (14)
  • Makefile
  • agent/main.go
  • api/v1alpha1/shared_types.go
  • cmd/main.go
  • config/agent/gkm-agent.yaml
  • config/configMap/configMap.yaml
  • config/configMap/kustomization.yaml
  • config/crd/bases/gkm.io_clustergkmcaches.yaml
  • config/crd/bases/gkm.io_gkmcaches.yaml
  • config/operator/operator.yaml
  • internal/controller/gkm-agent/common.go
  • internal/controller/gkm-operator/common.go
  • pkg/common/k8s.go
  • pkg/utils/contants.go

Comment on lines 80 to +84
// storageClassName contains the name of the Kubernetes Storage Class, which
// will be used for the PersistentVolume and PersistentVolumeClaim the GKM will
// create in order to store the extract GPU Kernel Cache.
// +required
// +kubebuilder:validation:Required
StorageClassName string `json:"storageClassName"`
// create in order to store the extract GPU Kernel Cache. If not provided, then
// default Storage Class will be used.
StorageClassName string `json:"storageClassName,omitempty"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Kubernetes PersistentVolumeClaimSpec, what is the behavior difference between omitting storageClassName(or leaving it null) and settingstorageClassName: ""? Which form uses the cluster's default StorageClass?

💡 Result:

Omitting storageClassName (or leaving it null) in a Kubernetes PersistentVolumeClaimSpec allows the PVC to use the cluster's default StorageClass if one is configured. The control plane will set storageClassName to the default for new PVCs without it, and retroactively update existing unbound PVCs lacking the field or with empty value to the default when a default becomes available. Setting storageClassName: "" explicitly requests a PV with no class (storageClassName unset or ""), preventing use of any default StorageClass. PVCs with storageClassName: "" will not be updated to the default even if one is later configured, and they only bind to classless PVs. The form that uses the cluster's default StorageClass is omitting storageClassName (or null).

Citations:


🏁 Script executed:

# First, check the StorageClassName field definition and its doc comment
echo "=== File: api/v1alpha1/shared_types.go (lines 75-90) ==="
sed -n '75,90p' api/v1alpha1/shared_types.go

echo ""
echo "=== File: pkg/common/k8s.go (lines 471-539) ==="
sed -n '471,539p' pkg/common/k8s.go

Repository: redhat-et/GKM

Length of output: 2866


🏁 Script executed:

# Search for calls to CreatePvc to see how storageClass argument is passed
echo "=== Searching for CreatePvc calls ==="
rg -n "CreatePvc" --type go -A 5 -B 2 | head -100

Repository: redhat-et/GKM

Length of output: 1546


🏁 Script executed:

# Get full context for the CreatePvc calls
echo "=== Call 1: internal/controller/gkm-operator/common.go ==="
sed -n '698,730p' internal/controller/gkm-operator/common.go

echo ""
echo "=== Call 2: internal/controller/gkm-agent/common.go ==="
sed -n '920,952p' internal/controller/gkm-agent/common.go

Repository: redhat-et/GKM

Length of output: 1480


🏁 Script executed:

# Find GetStorageClassName method definition
echo "=== Searching for GetStorageClassName definition ==="
rg -n "GetStorageClassName" --type go -A 3 -B 1 | head -50

Repository: redhat-et/GKM

Length of output: 3299


🏁 Script executed:

# Get the implementation of GetStorageClassName
echo "=== GetStorageClassName implementation ==="
sed -n '93,110p' api/v1alpha1/gkmcache_types.go

Repository: redhat-et/GKM

Length of output: 561


The optional storageClassName fallback is not implemented end-to-end.

The schema now documents that "if not provided, default Storage Class will be used," but when StorageClassName is omitted from the spec, it unmarshals to an empty string "" in Go. This empty string is then passed to CreatePvc (at pkg/common/k8s.go:514), which unconditionally sets StorageClassName: &storageClass. In Kubernetes, a pointer to an empty string is semantically equivalent to explicitly setting storageClassName: "", which prevents the default StorageClass from being selected. To honor the documented behavior, CreatePvc must treat empty/zero-value StorageClassName specially by omitting the field from the PVC spec entirely (or using a nil pointer).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v1alpha1/shared_types.go` around lines 80 - 84, The spec documents that
omitting StorageClassName should allow the cluster default, but CreatePvc
currently always sets StorageClassName by taking a pointer to the (possibly
empty) string; change CreatePvc (the function that builds the PVC spec) to only
set the PVC's StorageClassName pointer when the input StorageClassName is
non-empty (e.g., check if spec.StorageClassName != "" after trimming whitespace)
and leave the field nil otherwise so Kubernetes will select the default
StorageClass; update any helper that constructs &storageClass to follow the same
conditional logic.

Comment on lines +8 to +10
gkm.operator.log.level: debug
gkm.agent.log.level: debug
gkm.extract.log.level: debug
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n -C2 'ExtractLogLevel|EXTRACT_GO_LOG|gkm\.extract\.log\.level' --glob '*.go' --glob '*.yaml' --glob '*.yml'

Repository: redhat-et/GKM

Length of output: 3919


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Checking configmap.go around line 86 to see what happens with extractLogLevel ==="
sed -n '84,120p' internal/controller/gkm-operator/configmap.go

echo ""
echo "=== Searching for where extractLogLevel is used within configmap.go ==="
rg -n 'extractLogLevel' internal/controller/gkm-operator/configmap.go -A5 -B2

echo ""
echo "=== Checking if pod templates set EXTRACT_GO_LOG env var from config ==="
rg -n 'EXTRACT_GO_LOG|extractLogLevel' internal/controller/gkm-operator/common.go | head -20

echo ""
echo "=== Checking GetPodTemplate implementation to see env var injection ==="
rg -n 'GetPodTemplate' internal/controller/gkm-operator/common.go -A 100 | grep -A 100 'func.*GetPodTemplate' | head -80

Repository: redhat-et/GKM

Length of output: 2253


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Full ReconcileGKMConfigMap function ==="
rg -n 'func.*ReconcileGKMConfigMap' internal/controller/gkm-operator/configmap.go -A 200 | head -150

echo ""
echo "=== Check if extractLogLevel or any ConfigMap values are stored/passed to a struct ==="
rg -n 'extractLogLevel|agentLogLevel' internal/controller/gkm-operator/ -B2 -A2

echo ""
echo "=== Look for where ExtractLogLevel struct field is populated in operator's common.go ==="
rg -n 'ExtractLogLevel.*=' internal/controller/gkm-operator/common.go -B3 -A3

Repository: redhat-et/GKM

Length of output: 3306


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Search for where ExtractLogLevel is assigned/set in reconcilers ==="
rg -n 'ExtractLogLevel\s*[:=]' internal/controller/ -B5 -A3

echo ""
echo "=== Look for ReconcilerCommonAgent or ReconcilerCommonOperator initialization ==="
rg -n 'ReconcilerCommon.*{' internal/controller/gkm-operator/common.go -A 50 | head -100

echo ""
echo "=== Check how operator pod specs inject env vars ==="
rg -n 'Env.*=' internal/controller/gkm-operator/common.go | grep -i 'log\|extract'

Repository: redhat-et/GKM

Length of output: 129


🏁 Script executed:

#!/bin/bash
set -eulo pipefail

echo "=== Search broadly for ReconcilerCommon initialization in operator ==="
rg -n 'ReconcilerCommon' internal/controller/gkm-operator/common.go -B5 -A15 | head -150

echo ""
echo "=== Look for operator main setup/initialization ==="
rg -n 'func main' cmd/main.go -A 200 | head -200

echo ""
echo "=== Check if operator reads ConfigMap and populates ExtractLogLevel ==="
rg -n 'configMapKey\|ConfigMapIndexExtractLogLevel\|EXTRACT_GO_LOG' cmd/ internal/controller/gkm-operator/ -B3 -A3

Repository: redhat-et/GKM

Length of output: 14452


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Look for operator Deployment or manifest that sets EXTRACT_GO_LOG env var ==="
find . -type f \( -name "*.yaml" -o -name "*.yml" \) | xargs grep -l 'EXTRACT_GO_LOG\|operator' | head -20

echo ""
echo "=== Check operator Deployment manifest specifically ==="
fd -t f 'operator.*deployment\|deployment.*operator' config/ --iglob '*.yaml' --iglob '*.yml'

echo ""
echo "=== Look for references to EXTRACT_GO_LOG in manifests ==="
rg -n 'EXTRACT_GO_LOG' config/ --glob '*.yaml' --glob '*.yml'

echo ""
echo "=== Check if operator pod spec generation injects env vars from ConfigMap ==="
rg -n 'ValueFrom.*ConfigMap\|configMapRef\|EXTRACT' internal/controller/gkm-operator/common.go -B5 -A5 | head -80

Repository: redhat-et/GKM

Length of output: 1249


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check operator.yaml for Deployment and env vars ==="
cat ./config/operator/operator.yaml

echo ""
echo "=== Search for any EXTRACT_GO_LOG in all yaml files ==="
rg -n 'EXTRACT_GO_LOG' config/ -t yaml

Repository: redhat-et/GKM

Length of output: 4875


Add EXTRACT_GO_LOG environment variable to operator Deployment manifest.

The operator Deployment currently does not inject EXTRACT_GO_LOG from the ConfigMap, so gkm.extract.log.level in the ConfigMap has no effect. Extract jobs always use the hardcoded default log level ("info"). Inject EXTRACT_GO_LOG from gkm.extract.log.level to the operator container environment in config/operator/operator.yaml, similar to how NO_GPU, KIND_CLUSTER, and other values are currently injected. Additionally, remove the dead code in internal/controller/gkm-operator/configmap.go that reads extractLogLevel but only logs it without using the value.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/configMap/configMap.yaml` around lines 8 - 10, The operator Deployment
is not injecting EXTRACT_GO_LOG from the ConfigMap so gkm.extract.log.level is
ignored; update the operator manifest (config/operator/operator.yaml) to add an
environment variable named EXTRACT_GO_LOG in the operator container that sources
its valueFrom the ConfigMap key gkm.extract.log.level (use configMapKeyRef,
similar to NO_GPU/KIND_CLUSTER entries), and remove the dead code in
internal/controller/gkm-operator/configmap.go that reads the extractLogLevel
variable only to log it (delete the unused extractLogLevel lookup and its log
statement) so the value is injected via the manifest instead of being a no-op in
code.

gkm.agent.image: quay.io/gkm/agent:latest
gkm.extract.image: quay.io/gkm/gkm-extract:latest
gkm.nogpu: false
gkm.kindcluster: false
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n -C2 'ConfigMapIndexKindCluster|gkm\.kindcluster|gkm\.kind\.cluster' --glob '*.go' --glob '*.yaml' --glob '*.yml'

Repository: redhat-et/GKM

Length of output: 2240


Fix the key mismatch: update the constant to match the YAML keys in use.

The Go code uses ConfigMapIndexKindCluster from pkg/utils/contants.go (defined as "gkm.kind.cluster") to read this setting, but all YAML configuration files use gkm.kindcluster. This mismatch will cause the runtime lookup to fail silently, and the setting will not be applied. Update the constant in pkg/utils/contants.go line 107 from "gkm.kind.cluster" to "gkm.kindcluster" to match the keys used across config/configMap/configMap.yaml, config/configMap/kustomization.yaml, config/operator/operator.yaml, and config/agent/gkm-agent.yaml.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/configMap/configMap.yaml` at line 15, The constant
ConfigMapIndexKindCluster is defined with the wrong key string
("gkm.kind.cluster") causing lookups to miss the YAML key; update the value of
ConfigMapIndexKindCluster to "gkm.kindcluster" so it matches the keys used in
the YAML files (e.g., configMap entries) and ensure any reference to
ConfigMapIndexKindCluster continues to compile and read the corrected key.

Comment on lines 11 to +12
- gkm.nogpu=true
- gkm.kindcluster=true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n -C2 'config/configMap|gkm\.kindcluster|gkm\.nogpu' --iglob 'kustomization.yaml'

Repository: redhat-et/GKM

Length of output: 462


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find all kustomization.yaml files
echo "=== All kustomization.yaml files ==="
fd 'kustomization.yaml' --type f | sort

echo -e "\n=== References to config/configMap in kustomization files ==="
rg -l 'config/configMap' --type yaml | xargs -I {} sh -c 'echo "File: {}"; grep -n "config/configMap" "{}"'

echo -e "\n=== Search for gkm.kindcluster overrides in all yaml files ==="
rg 'gkm\.kindcluster' --type yaml -n

echo -e "\n=== Directory structure of overlays (if present) ==="
if [ -d "config" ]; then
  find config -type d | sort
fi

Repository: redhat-et/GKM

Length of output: 762


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== config/default/kustomization.yaml ==="
cat config/default/kustomization.yaml

echo -e "\n=== config/kind-gpu/kustomization.yaml ==="
cat config/kind-gpu/kustomization.yaml

echo -e "\n=== config/configMap/kustomization.yaml structure ==="
cat config/configMap/kustomization.yaml

echo -e "\n=== Search for configMap references (without full path) ==="
rg 'configMap' config/default/kustomization.yaml config/kind-gpu/kustomization.yaml 2>/dev/null || true

Repository: redhat-et/GKM

Length of output: 6047


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check kind-gpu overlay for any configMap overrides ==="
cat config/kind-gpu/agent-patch.yaml 2>/dev/null || echo "No agent-patch.yaml"

echo -e "\n=== Look for all overlay directories (directories with kustomization.yaml) ==="
find config -name kustomization.yaml -type f | xargs dirname | sort

echo -e "\n=== Check configMap.yaml to see default values ==="
cat config/configMap/configMap.yaml

echo -e "\n=== Search for any patches or replacements that might override gkm.kindcluster ==="
rg 'gkm\.kindcluster|kindcluster' --type yaml -A 2 -B 2

Repository: redhat-et/GKM

Length of output: 2690


Move gkm.kindcluster=true to the KIND/demo overlay, not the shared base configMap.

The hard-coded value in config/configMap/kustomization.yaml lines 11-12 applies to all deployments via config/default, including production runs. Only config/kind-gpu/kustomization.yaml should set gkm.kindcluster=true; the base config/configMap/configMap.yaml correctly defaults to false. Without an override mechanism in production overlays, this forces KIND-specific behavior (manual PV/init-container) everywhere and defeats the new separation from NO_GPU.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/configMap/kustomization.yaml` around lines 11 - 12, Remove the
hard-coded gkm.kindcluster=true from the shared base kustomization
(config/configMap/kustomization.yaml) and instead add that patch/setting to the
KIND overlay kustomization (config/kind-gpu/kustomization.yaml); locate the base
configMap entry that currently sets gkm.kindcluster and delete it so the base
uses the default in configMap.yaml (false), then add a
kustomizeConfigMapGenerator/patch or overlay entry in
config/kind-gpu/kustomization.yaml to set gkm.kindcluster=true so only the
KIND/demo overlay enables kind-specific behavior.

Comment thread Makefile
Comment on lines +299 to 314
ifdef KIND_CLUSTER
cd config/configMap && \
$(SED) \
-e '/literals:/a\ - gkm.nogpu=true' \
-e '/literals:/a\ - gkm.kindcluster=true' \
-e 's@gkm\.agent\.image=.*@gkm.agent.image=$(AGENT_IMG)@' \
-e 's@gkm\.extract\.image=.*@gkm.extract.image=$(EXTRACT_IMG)@' \
kustomization.yaml.env > kustomization.yaml
else ifdef NO_GPU
cd config/configMap && \
$(SED) \
-e '/literals:/a\ - gkm.nogpu=true' \
-e '/literals:/a\ - gkm.kindcluster=false' \
-e 's@gkm\.agent\.image=.*@gkm.agent.image=$(AGENT_IMG)@' \
-e 's@gkm\.extract\.image=.*@gkm.extract.image=$(EXTRACT_IMG)@' \
kustomization.yaml.env > kustomization.yaml
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

out="$(make -n prepare-deploy KIND_CLUSTER=false AGENT_IMG=test/agent EXTRACT_IMG=test/extract 2>/dev/null || true)"
printf '%s\n' "$out" | rg -n 'gkm\.(kindcluster|nogpu)'

Repository: redhat-et/GKM

Length of output: 39


🏁 Script executed:

# Check if the file exists and its size
wc -l Makefile

# View lines 299-314 to confirm the code snippet
sed -n '299,314p' Makefile

# Search for prepare-deploy target definition
rg -n 'prepare-deploy' Makefile

Repository: redhat-et/GKM

Length of output: 963


🏁 Script executed:

# Get full context around prepare-deploy target
sed -n '290,320p' Makefile | cat -n

# Try running prepare-deploy without dry-run to see what happens
make prepare-deploy KIND_CLUSTER=false AGENT_IMG=test/agent EXTRACT_IMG=test/extract 2>&1 | head -20 || true

# Check if SED variable is defined
rg -n '^\s*SED\s*=' Makefile

Repository: redhat-et/GKM

Length of output: 1736


🏁 Script executed:

# Verify GNU Make ifdef behavior with a test Makefile
cat > /tmp/test_ifdef.mk << 'EOF'
# Test ifdef behavior
ifdef MY_VAR
$(info ifdef MY_VAR: true)
else
$(info ifdef MY_VAR: false)
endif

ifeq ($(MY_VAR),true)
$(info ifeq true: true)
else
$(info ifeq true: false)
endif
EOF

# Document what ifdef does
cat > /tmp/check_ifdef.txt << 'EOF'
GNU Make ifdef syntax:
- ifdef VAR: checks if VAR is defined (non-empty)
- endif

Therefore:
- make -f /tmp/test_ifdef.mk MY_VAR=false → ifdef branch WILL execute (treats "false" as non-empty)
- make -f /tmp/test_ifdef.mk MY_VAR=true → ifdef branch WILL execute
- make -f /tmp/test_ifdef.mk → ifdef branch will NOT execute (undefined)

This is a well-known GNU Make behavior that can cause bugs.
EOF

cat /tmp/check_ifdef.txt

Repository: redhat-et/GKM

Length of output: 453


Use value checks instead of ifdef for boolean flags.

ifdef KIND_CLUSTER and ifdef NO_GPU treat any non-empty value as true, so make deploy KIND_CLUSTER=false will still generate gkm.kindcluster=true and gkm.nogpu=true in the ConfigMap. Use ifeq ($(KIND_CLUSTER),true) instead to correctly distinguish between explicit values.

Suggested fix
-ifdef KIND_CLUSTER
+ifeq ($(KIND_CLUSTER),true)
 	cd config/configMap && \
 	  $(SED) \
 	    -e '/literals:/a\  - gkm.nogpu=true' \
 	    -e '/literals:/a\  - gkm.kindcluster=true' \
 	    -e 's@gkm\.agent\.image=.*@gkm.agent.image=$(AGENT_IMG)@' \
 	    -e 's@gkm\.extract\.image=.*@gkm.extract.image=$(EXTRACT_IMG)@' \
 	    kustomization.yaml.env > kustomization.yaml
-else ifdef NO_GPU
+else ifeq ($(NO_GPU),true)
 	cd config/configMap && \
 	  $(SED) \
 	    -e '/literals:/a\  - gkm.nogpu=true' \
 	    -e '/literals:/a\  - gkm.kindcluster=false' \
 	    -e 's@gkm\.agent\.image=.*@gkm.agent.image=$(AGENT_IMG)@' \
 	    -e 's@gkm\.extract\.image=.*@gkm.extract.image=$(EXTRACT_IMG)@' \
 	    kustomization.yaml.env > kustomization.yaml
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ifdef KIND_CLUSTER
cd config/configMap && \
$(SED) \
-e '/literals:/a\ - gkm.nogpu=true' \
-e '/literals:/a\ - gkm.kindcluster=true' \
-e 's@gkm\.agent\.image=.*@gkm.agent.image=$(AGENT_IMG)@' \
-e 's@gkm\.extract\.image=.*@gkm.extract.image=$(EXTRACT_IMG)@' \
kustomization.yaml.env > kustomization.yaml
else ifdef NO_GPU
cd config/configMap && \
$(SED) \
-e '/literals:/a\ - gkm.nogpu=true' \
-e '/literals:/a\ - gkm.kindcluster=false' \
-e 's@gkm\.agent\.image=.*@gkm.agent.image=$(AGENT_IMG)@' \
-e 's@gkm\.extract\.image=.*@gkm.extract.image=$(EXTRACT_IMG)@' \
kustomization.yaml.env > kustomization.yaml
ifeq ($(KIND_CLUSTER),true)
cd config/configMap && \
$(SED) \
-e '/literals:/a\ - gkm.nogpu=true' \
-e '/literals:/a\ - gkm.kindcluster=true' \
-e 's@gkm\.agent\.image=.*@gkm.agent.image=$(AGENT_IMG)@' \
-e 's@gkm\.extract\.image=.*@gkm.extract.image=$(EXTRACT_IMG)@' \
kustomization.yaml.env > kustomization.yaml
else ifeq ($(NO_GPU),true)
cd config/configMap && \
$(SED) \
-e '/literals:/a\ - gkm.nogpu=true' \
-e '/literals:/a\ - gkm.kindcluster=false' \
-e 's@gkm\.agent\.image=.*@gkm.agent.image=$(AGENT_IMG)@' \
-e 's@gkm\.extract\.image=.*@gkm.extract.image=$(EXTRACT_IMG)@' \
kustomization.yaml.env > kustomization.yaml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 299 - 314, The Makefile uses ifdef for KIND_CLUSTER
and NO_GPU which treats any non-empty value as true; replace those conditionals
with explicit value checks (ifeq ($(KIND_CLUSTER),true) and ifeq
($(NO_GPU),true)) and adjust the sed invocations so you always generate
kustomization.yaml but insert the correct literal values based on the true/false
check (set gkm.kindcluster=true when KIND_CLUSTER == true else
gkm.kindcluster=false; set gkm.nogpu=true when NO_GPU == true else
gkm.nogpu=false), keeping the existing AGENT_IMG and EXTRACT_IMG substitutions
(AGENT_IMG, EXTRACT_IMG, kustomization.yaml.env, kustomization.yaml) intact.

Filippe found a few issues while trying to run GKM and KServe in the
same cluster.

- Requested that agents print the node they are running on in the logs.
- The initContainer that gets added to the Job to extract the cache is
  only supposed to be there for KIND clusters but is triggering off the
  noGpu flag. Add a KIND Cluster flag that distinguishes between NoGPU
  and KIND Clusters.
- The StorageClassName in the GKMCache and ClusterGKMCache is required
  when it should have been optional. Move it to optional.
- To help with the Demo, it would be easier to manage if the PVC
  associated with ReadWriteOnce did not have the hash appended, which
  was done so each node has a unique PVC name. Add a "HACK" so that if
  the AccessMode of ReadWriteOncePod, the Operator treats the PVC as if
  ReadOnlyMany was entered and only creates one PVC for the Cluster, but
  creates it with AccessMode of ReadWriteOnce. This will not work where
  pods across multiple nodes want to share a PVC. But will "probably"
  work on one Node systems.

Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant