Skip to content

feat: Introduce v1alpha2 version of LlamaStackDistribution CRD#253

Open
VaishnaviHire wants to merge 11 commits intollamastack:mainfrom
VaishnaviHire:implement_run_config_schema
Open

feat: Introduce v1alpha2 version of LlamaStackDistribution CRD#253
VaishnaviHire wants to merge 11 commits intollamastack:mainfrom
VaishnaviHire:implement_run_config_schema

Conversation

@VaishnaviHire
Copy link
Copy Markdown
Collaborator

@VaishnaviHire VaishnaviHire commented Feb 23, 2026

This PR introduces the v1alpha2 API version for the LlamaStackDistribution CRD, enabling declarative, Kubernetes-native configuration of LlamaStack servers. Instead of requiring users to manually craft and supply a config.yaml via ConfigMap (as in v1alpha1), the operator now generates the server configuration automatically from structured CR fields (providers, resources, storage, networking). Both API versions are served concurrently with full conversion webhook support.

v1alpha2 Example

The v1alpha2 API replaces environment-variable-driven configuration with structured, declarative fields. All provider fields use typed []ProviderConfig slices with CEL validation.

Basic : single Ollama provider:

apiVersion: llamastack.io/v1alpha2
kind: LlamaStackDistribution
metadata:
  name: llamastackdistribution-v1alpha2-sample
spec:
  distribution:
    name: starter
  providers:
    inference:
      - provider: ollama
        endpoint: http://ollama-server-service.ollama-dist.svc.cluster.local:11434/v1
  resources:
    models:
      - name: "llama3.2:1b"
  networking:
    port: 8321
  workload:
    replicas: 1

Advanced : vLLM with secret refs, PostgreSQL storage, pgvector:

apiVersion: llamastack.io/v1alpha2
kind: LlamaStackDistribution
metadata:
  name: llamastack-vllm-pg
spec:
  distribution:
    name: starter
  providers:
    inference:
      - provider: vllm
        endpoint: http://vllm-service.vllm.svc.cluster.local:8000/v1
        secretRefs:
          api_key:
            name: vllm-creds
            key: token
    vectorIo:
      - provider: pgvector
        secretRefs:
          host:
            name: pg-credentials
            key: host
        settings:
          port: 5432
          db: llamastack
  resources:
    models:
      - name: llama3.2-8b
  storage:
    kv:
      type: redis
      endpoint: redis://redis-service.redis.svc.cluster.local:6379
    sql:
      type: postgres
      connectionString:
        name: pg-credentials
        key: dsn
  disabled:
    - safety
  workload:
    replicas: 2

Review Guide

This is a large PR (86 files, ~18k lines). The sections below group changes by area matching the commit structure. Each section is self-contained and can be reviewed independently.

1. v1alpha2 CRD Schema & Conversion (commit: e7cc5c2)

File What to review
api/v1alpha2/llamastackdistribution_types.go New spec/status types. Key design: typed []ProviderConfig slices with CEL validation for provider ID uniqueness. OverrideConfig is mutually exclusive with providers/resources/storage/disabled.
api/v1alpha2/zz_generated.deepcopy.go Auto-generated
api/v1alpha1/llamastackdistribution_conversion.go Bidirectional v1alpha1 ↔ v1alpha2 conversion. Uses JSON blob annotations (annV1Alpha1Extras, annV1Alpha2Extras) for lossless round-trips in both directions.
api/v1alpha1/llamastackdistribution_conversion_test.go Round-trip tests: providers, resources, storage, disabled, TLS, expose hostname, status fields
config/crd/bases/llamastack.io_llamastackdistributions.yaml Generated CRD YAML with OpenAPI and CEL rules

2. Validating Webhook (commit: a45c9f6)

File What to review
api/v1alpha2/llamastackdistribution_webhook.go Validating webhook: provider ID uniqueness across all API types, distribution name validation, model provider reference checks
api/v1alpha2/llamastackdistribution_webhook_test.go Unit tests: cross-slice collision detection, deriveProviderID behavior, unknown distribution rejection, edge cases
config/webhook/* Webhook service, manifests, kustomize config
config/certmanager/* Certificate and issuer for vanilla Kubernetes
config/crd/kustomization.yaml Enabled webhook/cert-manager patches
config/default/kustomization.yaml Enabled webhook and cert-manager components
config/default/manager_webhook_patch.yaml Webhook port and TLS cert volume mount
main.go Webhook registration

3. Config Generation Pipeline (commit: 3feb572)

File What to review
pkg/config/config.go Pipeline: resolve base config → expand providers → expand resources → apply storage → apply disabled APIs → clean registered_resources → override port → render YAML. Key: deep-copy safety, deterministic output
pkg/config/provider.go Provider expansion: remote:: prefix, endpoint → base_url, sorted secret ref iteration, settings merge with override protection
pkg/config/resource.go Model/tool/shield expansion with default provider resolution and provider existence validation
pkg/config/storage.go KV (sqlite/redis) and SQL (sqlite/postgres) with secret env var mapping
pkg/config/secret_resolver.go Resolves secretRefs maps to env vars (LLSD_<PROVIDER_ID>_<KEY>) and ${env.VAR_NAME} substitutions
pkg/config/resolver.go Base config resolver: resolves embedded configs by distribution name
pkg/config/version.go Config version detection (supports versions 1-2)
pkg/config/types.go Shared types: BaseConfig, ProviderEntry, GeneratedConfig
pkg/config/config_test.go Unit tests: determinism, provider/resource expansion, storage, secret resolution, disabled API cleanup, deep-copy safety
pkg/config/configs/*/config.yaml Embedded base configs for starter, starter-gpu, postgres-demo distributions
distributions.json Distribution metadata

4. Controller Integration (commit: bf90527)

File What to review
controllers/v1alpha2_config.go v1alpha2 config handling: determines config source (override / generated / default), creates immutable ConfigMaps with content-hash naming, validates secret/ConfigMap refs, injects secret-backed env vars into pod spec, cleans up old ConfigMaps
controllers/llamastackdistribution_controller.go Integration: calls handleV1Alpha2NativeConfig before standard reconcile. Dual status update path for v1alpha2 CRs
controllers/kubebuilder_rbac.go RBAC markers: added secrets (get/list/watch) and configmaps (delete)
controllers/resource_helper.go Deprecated startupScript. Sets RUN_CONFIG_PATH env var; image's built-in entrypoint.sh handles startup
controllers/resource_helper_test.go Updated assertions: RUN_CONFIG_PATH instead of command/args overrides
controllers/suite_test.go Envtest setup with webhook server
controllers/testing_support_test.go Test constants and helpers
controllers/llamastackdistribution_controller_test.go Envtest integration tests: config generation, ConfigMap creation, secret env var injection, status updates
config/rbac/role.yaml Generated ClusterRole with secrets and configmaps-delete permissions

5. OpenShift Webhook Overlay (commit: e3f405b)

File What to review
config/openshift/kustomization.yaml OpenShift overlay: replaces cert-manager with service-serving certificates
config/openshift/crd_ca_patch.yaml CRD CA injection annotation
config/openshift/manager_webhook_patch.yaml Manager cert volume mount
config/openshift/webhook_ca_patch.yaml Webhook CA injection annotation

6. E2E Tests (commit: c6d24e8)

File What to review
tests/e2e/creation_v1alpha2_test.go v1alpha2 CR creation, ConfigMap generation, Ready phase, secret env var injection into Deployment
tests/e2e/conversion_test.go Cross-version read (v1alpha1 as v1alpha2 and vice versa)
tests/e2e/webhook_validation_test.go Webhook rejects: missing distribution, duplicate provider IDs, invalid provider references
tests/e2e/validation_test.go CRD structure, webhook service/TLS, operator readiness
tests/e2e/creation_test.go Updated v1alpha1 creation tests
tests/e2e/e2e_test.go Test suite registration
tests/e2e/test_utils.go Test helpers
.github/workflows/run-e2e-test.yml CI workflow updates for v1alpha2 targets

7. Documentation & Samples (commit: 3b6972b)

File What to review
docs/migration-v1alpha1-to-v1alpha2.md Migration guide: field mapping tables, before/after examples, step-by-step migration
docs/api-overview.md Full v1alpha2 API reference for both versions
README.md Updated quick start with v1alpha2 examples using secretRefs and list syntax
config/samples/v1alpha1/* Existing samples moved into versioned subdirectory
config/samples/v1alpha2/* New v1alpha2 samples: basic, HA, vLLM+Postgres, networking
specs/002-operator-generated-config/* Updated spec contracts and data model

8. Build Tooling & Release (commit: b5a2df7)

File What to review
Makefile Build target updates for v1alpha2
.gitignore Ignore patterns for generated artifacts
go.mod / go.sum Dependency updates
release/operator.yaml Regenerated release manifest with all v1alpha2 resources

@VaishnaviHire VaishnaviHire marked this pull request as draft February 23, 2026 14:51
@VaishnaviHire VaishnaviHire force-pushed the implement_run_config_schema branch 4 times, most recently from 28c7f4a to afec277 Compare February 27, 2026 08:03
@VaishnaviHire VaishnaviHire force-pushed the implement_run_config_schema branch from b760e03 to 1843bd3 Compare March 3, 2026 14:33
@VaishnaviHire VaishnaviHire changed the title [DRAFT] Implement run config schema feat: Introduce v1alpha2 version of LlamaStackDistribution CRD Mar 3, 2026
@VaishnaviHire VaishnaviHire marked this pull request as ready for review March 3, 2026 14:42
@VaishnaviHire
Copy link
Copy Markdown
Collaborator Author

@Mergifyio rebase

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 5, 2026

rebase

✅ Branch has been successfully rebased

@VaishnaviHire VaishnaviHire force-pushed the implement_run_config_schema branch from 1843bd3 to 31e0e3c Compare March 5, 2026 15:08
@VaishnaviHire VaishnaviHire force-pushed the implement_run_config_schema branch 2 times, most recently from 3fde13b to 673d4e7 Compare March 6, 2026 17:21
@mfleader mfleader self-requested a review March 6, 2026 19:26
logger := log.FromContext(ctx)

// Handle v1alpha2 native config generation before standard reconciliation.
v1a2Result, v1a2Err := r.handleV1Alpha2NativeConfig(ctx, key, instance)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for FR-097 (preserve running Deployment on config generation failure).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this covered? I don't see a test that creates a Deployment first, then fails config generation, then checks the Deployment is unchanged.

@VaishnaviHire VaishnaviHire force-pushed the implement_run_config_schema branch from 673d4e7 to ac0683d Compare March 9, 2026 09:35
@VaishnaviHire VaishnaviHire force-pushed the implement_run_config_schema branch from a38bd8f to fc77468 Compare March 20, 2026 06:08
Copy link
Copy Markdown
Collaborator

@eoinfennessy eoinfennessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final few comments on the API.

Comment on lines +90 to +97
// SecretRefs is a map of named secret references for provider-specific
// connection fields (e.g., host, password). Each key becomes the env var
// field suffix and maps to config.<key> with env var substitution:
// ${env.LLSD_<PROVIDER_ID>_<KEY>}. Use this instead of embedding
// secretKeyRef inside settings.
// +optional
// +kubebuilder:validation:MinProperties=1
SecretRefs map[string]SecretKeyRef `json:"secretRefs,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought. Should we remove the ApiKey field if it is possible for users to supply API_KEY here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was a shorthand just for user convenience , since api key is one of the most common secret. I can remove this

Comment on lines +110 to +112
// +kubebuilder:validation:MinItems=1
// +kubebuilder:validation:XValidation:rule="self.size() <= 1 || self.all(p, has(p.id))",message="each provider must have an explicit id when multiple providers are specified"
Inference []ProviderConfig `json:"inference,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it also possible to add a CEL check to ensure each ID is unique when multiple providers are specified?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. I see this is handled in webhook validation.

Comment on lines +248 to +251
// Enabled activates external access via Ingress/Route.
// nil = not specified (no Ingress), false = explicitly disabled, true = create Ingress.
// +optional
Enabled *bool `json:"enabled,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure why this is a pointer. Are we differentiating the behaviour of false and nil? Should this just be a bool?

The specs seem to suggest that the presence of a non-nil expose object enables ingress. Maybe the Enabled field is actually unnecessary?

  • expose omitted → Expose is nil → no Ingress
  • expose: {} → Expose is non-nil → create Ingress (with defaults)
  • expose: {hostname: "foo.example.com"} → create Ingress with that hostname

Comment on lines +229 to +243
// TLSSpec configures TLS for the LlamaStack server.
// +kubebuilder:validation:XValidation:rule="!self.enabled || has(self.secretName)",message="secretName is required when TLS is enabled"
// +kubebuilder:validation:XValidation:rule="!has(self.secretName) || self.enabled",message="secretName is only valid when TLS is enabled"
// +kubebuilder:validation:XValidation:rule="!has(self.caBundle) || self.enabled",message="caBundle is only valid when TLS is enabled"
type TLSSpec struct {
// Enabled enables TLS on the server.
// +optional
Enabled bool `json:"enabled,omitempty"`
// SecretName references a Kubernetes TLS Secret. Required when enabled is true.
// +optional
SecretName string `json:"secretName,omitempty"`
// CABundle configures custom CA certificates via ConfigMap reference.
// +optional
CABundle *CABundleConfig `json:"caBundle,omitempty"`
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we follow a similar pattern to what I described above for ExposeConfig? I.e. remove Enabled from this struct and make SecretName strictly required?

The presence or absence of tls will indicate whether or not it is enabled.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the SecretName here. I will remove it from spec. The only required files is ca-bundle configmap

Copy link
Copy Markdown
Collaborator

@eoinfennessy eoinfennessy Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's no harm keeping SecretName here if it is our intent to serve LLS with TLS. It seems from the specs that this is the case.

Thinking more about this, TLSSpec is effectively configuring both incoming (SecretName) and outgoing (CABundle) TLS config. For that reason the Enabled boolean is overloaded here. I think we should rework this. How about the following? (@rhuss, hi, would appreciate your thoughts too if you have time)

# Before (conflates server and client; `enabled` is overloaded)
networking:
  tls:
    enabled: true
    secretName: llama-tls
    caBundle:
      configMapName: custom-ca

# After (separates server and client. Removes redundant `enabled`)
# If a CA bundle is provided, client-side TLS is enabled
# If TLS config is provided, server-side TLS is enabled
networking:
  tls:
    secretName: llama-tls
  caBundle:
    configMapName: custom-ca

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think regarding SecretName, I can open a follow-up PR , since it will need additional verification for downstream. Keep the configmap to continue to support v1alpha1 features.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a good conversation with Claude about this and came up with the suggestion below:

Suggestion: Separate server TLS from CA trust configuration

The current TLSSpec conflates two distinct concerns:

  1. Server TLS (serving) — the cert/key the LlamaStack server presents to incoming clients
  2. CA trust (outbound) — custom CA certificates the server trusts when connecting to external services (provider endpoints, etc.)

These have different lifecycles, different audiences, and different security implications. As-is, adding a serving certificate secret to this struct would mix both under a single tls field in NetworkingSpec, making the API harder to reason about as it grows.

Proposed change:

  1. Move caBundle to the top level of LlamaStackDistributionSpec. CA trust is a cross-cutting runtime concern (not a networking topology one), and caBundle already follows the dominant Kubernetes naming convention used by core webhooks, APIService, CRD conversion webhooks, and cert-manager.
  2. Replace TLSSpec with a server TLS struct inside NetworkingSpec that holds the serving certificate secret reference. This is where server-side TLS naturally belongs.
# Before
spec:
  networking:
    tls:
      caBundle:
        configMapName: my-ca-bundle
# After
spec:
  caBundle:
    configMapName: my-ca-bundle
  networking:
    tls:
      secretName: my-serving-cert

This gives us clear semantics (each field does one thing), independent lifecycle (CA trust without server TLS and vice versa), and aligns with how the broader Kubernetes ecosystem models these concepts (e.g., OpenShift separates trustedCA from route TLS termination; Istio separates caCertificates from server gateway TLS).

@VaishnaviHire VaishnaviHire force-pushed the implement_run_config_schema branch from fc77468 to 958b6f3 Compare March 20, 2026 11:56
Copy link
Copy Markdown
Collaborator

@eoinfennessy eoinfennessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of config gen pipeline. There are some critical issues that need to be addressed.

Comment on lines +136 to +138
for k, v := range settingsMap {
cfg[k] = v
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that values from the settings map can override the endpoint, secret refs, and the API key.

Should we skip adding items that are already in cfg? And log a warning?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe add settings to the cfg map first and then add fields like base_url secret_refs and api_key?

We need to be careful that secret_refs can't override api_key too (which it can currently).

Comment on lines +122 to +130
for key := range pc.SecretRefs {
ident := providerID + ":" + key
if sub, ok := substitutions[ident]; ok {
cfg[key] = sub
} else {
envName := GenerateEnvVarName(providerID, key)
cfg[key] = "${env." + envName + "}"
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of iteration is non-deterministic. This could potentially cause unnecessary Deployment updates.

We should sort the keys before iterating.

Comment on lines +101 to +107
for key, ref := range pc.SecretRefs {
addSecretToResolution(resolution, secretRefEntry{
ProviderID: providerID,
Field: key,
SecretName: ref.Name,
SecretKey: ref.Key,
})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of iteration is non-deterministic. This could potentially cause unnecessary Deployment updates.

We should sort the keys before iterating.

Comment on lines +236 to +251
// apiNameToConfigKey maps CRD-style camelCase API names to config.yaml snake_case keys.
var apiNameToConfigKey = map[string]string{
"vectorIo": "vector_io",
"toolRuntime": "tool_runtime",
"postTraining": "post_training",
"datasetIo": "datasetio",
}

// normalizeAPIName converts a CRD-style camelCase API name to the config.yaml
// snake_case key. Names already in snake_case pass through unchanged.
func normalizeAPIName(api string) string {
if mapped, ok := apiNameToConfigKey[api]; ok {
return mapped
}
return api
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all effectively dead code because the disabled enum already specifies snake-case values.

Comment on lines +371 to +382
// RenderConfigYAML serializes the config to deterministic YAML.
func RenderConfigYAML(config *BaseConfig) (string, error) {
// Build an ordered map for deterministic output
out := buildOrderedConfig(config)

data, err := yaml.Marshal(out)
if err != nil {
return "", fmt.Errorf("failed to marshal config YAML: %w", err)
}

return string(data), nil
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function mutates the provided config, which is unconventional and unexpected for a render function.

Consider having buildOrderedConfig write to out["registered_resources"] instead of config.Extra["registered_resources"].

EvalStore map[string]interface{} `json:"eval_store,omitempty" yaml:"eval_store,omitempty"`
DatasetIOStore map[string]interface{} `json:"datasetio_store,omitempty" yaml:"datasetio_store,omitempty"`
Server map[string]interface{} `json:"server,omitempty" yaml:"server,omitempty"`
ExternalProviders map[string]interface{} `json:"external_providers,omitempty" yaml:"external_providers,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is effectively unused because we never use it in buildOrderedConfig.

We should delete it to avoid confusion.

}

if mc.ContextLength != nil && *mc.ContextLength > 0 {
if entry["provider_model_id"] == nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not needed. It is always true

Comment on lines +77 to +80
provider := mc.Provider
if provider == "" {
provider = defaultProvider
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no validation that the model's provider ID actually exists.

The model can be registered with a non-existent provider and no error is returned. The llama-stack server would fail at startup with a confusing error about an unknown provider.

We should consider adding validation for this at the admission layer if not too complex. Otherwise, we can validate here and return an error so the CR's status can reflect the issue to users.

Copy link
Copy Markdown
Collaborator

@eoinfennessy eoinfennessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of webhooks:

We need to comprehensively test all validation logic (webhook and CEL). Currently we don't do any testing of validation logic.

There are some problems with data loss and stale data in the conversion logic. I added a suggestion to fix this.

@VaishnaviHire VaishnaviHire force-pushed the implement_run_config_schema branch from 958b6f3 to 4021d2f Compare March 23, 2026 07:00
@VaishnaviHire
Copy link
Copy Markdown
Collaborator Author

@eoinfennessy I have addressed the comments and updated the commits. Please take a look.

@eoinfennessy
Copy link
Copy Markdown
Collaborator

eoinfennessy commented Mar 23, 2026

@VaishnaviHire, thanks for addressing the comments. In future review cycles on this PR, please avoid squashing and force-pushing. Instead, please add new commits for each change. This makes it easier for me to review the changes that have been made between PR reviews, which is especially tricky in such a large PR.

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 27, 2026

This pull request has merge conflicts that must be resolved before it can be merged. @VaishnaviHire please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 27, 2026
Copy link
Copy Markdown
Collaborator

@eoinfennessy eoinfennessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-reviewed after the previous suggestions. Thanks for addressing these. Couple of things left from these reviews:

  1. CEL validation tests: Let's add envtest tests to ensure all of our complex CEL validation is actually working.
  2. Split TLSSpec for server and client: see my latest comment in the thread above discussing this
  3. Small bug remaining in resource.go (see below)

if provider == "" {
return nil, fmt.Errorf("failed to expand model %q: no provider specified and no default inference provider found", mc.Name)
}
if mc.Provider != "" && !providerExists(provider, userProviders, base) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check mc.Provider != "" means: "only validate if the user explicitly set a provider." But this creates a blind spot — if the user omits the provider and the default is used, no existence check happens at all. The default provider could be stale or wrong, and the error would only surface at LlamaStack server startup with a confusing message about an unknown provider.

The fix is simply:

if !providerExists(provider, userProviders, base) {

This validates the provider regardless of whether it came from the user or from the default, which is what you'd want — if we resolved a provider name, we should verify it exists.

Copy link
Copy Markdown
Collaborator

@eoinfennessy eoinfennessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review focussing on /controller. Mostly minor suggestions, but a couple of major things related to surfacing errors and status.

Comment on lines +195 to +199
// Handle v1alpha2 native config generation before standard reconciliation.
v1a2Result, v1a2Err := r.handleV1Alpha2NativeConfig(ctx, key, instance)
if v1a2Err != nil {
logger.Error(v1a2Err, "failed to handle v1alpha2 native config")
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FR-097 states:

If config generation or validation fails during a CR update, the operator MUST preserve the current running Deployment (image, ConfigMap, env vars) unchanged and set status condition ConfigGenerated=False with the failure reason. The running instance MUST NOT be disrupted.

Two gaps:

  1. ConfigGenerated=False is never set. When handleV1Alpha2NativeConfig fails, v1a2Result is nil, so finalizeReconciliation takes the else branch and calls updateStatus — which only writes v1alpha1 status fields. SetV1Alpha2Condition is only called inside persistV1Alpha2Status, which is only reached on success. The constant ReasonConfigGenFailed is declared but unused. The failure is logged to operator stdout but never surfaces in .status.conditions.

  2. No structural guarantee the Deployment is preserved. handleV1Alpha2NativeConfig mutates v1Instance in-place (setting UserConfig and appending env vars) after all fallible operations, then reconcileResources uses the same pointer to reconcile the Deployment. Today the mutation ordering is safe — mutations happen last, after all fallible steps. But this is an implicit invariant: if a fallible step is later added after the UserConfig assignment, reconcileResources would reconcile against a half-modified spec, potentially pointing the Deployment at a ConfigMap that doesn't exist.

Suggested approach: On handleV1Alpha2NativeConfig error, skip reconcileResources, persist ConfigGenerated=False with the failure reason, and return the error to requeue. This satisfies both halves of FR-097: the Deployment is untouched and the failure is visible in status.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also no test asserting that .status.conditions contains ConfigGenerated=False after a failed config generation.

// v1alpha2 Condition reasons.
const (
ReasonConfigGenSucceeded = "ConfigGenerationSucceeded"
ReasonConfigGenFailed = "ConfigGenerationFailed"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused

Comment on lines +232 to +234
if err := r.persistV1Alpha2Status(ctx, key, instance, v1a2Result); err != nil {
logger.Error(err, "failed to update v1alpha2 status")
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should return the status update error to match v1alpha1 behaviour in the else block, and ensure the status is eventually consistent.


// updateStatus refreshes the LlamaStack status.
func (r *LlamaStackDistributionReconciler) updateStatus(ctx context.Context, instance *llamav1alpha1.LlamaStackDistribution, reconcileErr error) error {
// computeStatus computes all status fields on the in-memory v1alpha1 instance
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the stale updateStatus comment above this line

Comment on lines +314 to +336
for _, envVar := range resolution.EnvVars {
if envVar.ValueFrom == nil || envVar.ValueFrom.SecretKeyRef == nil {
continue
}

secretName := envVar.ValueFrom.SecretKeyRef.Name
secretKey := envVar.ValueFrom.SecretKeyRef.Key

secret := &corev1.Secret{}
if err := r.Get(ctx, types.NamespacedName{
Name: secretName,
Namespace: namespace,
}, secret); err != nil {
if k8serrors.IsNotFound(err) {
return fmt.Errorf("failed to find Secret %q in namespace %q (referenced by env var %s)", secretName, namespace, envVar.Name)
}
return fmt.Errorf("failed to get Secret %q: %w", secretName, err)
}

if _, ok := secret.Data[secretKey]; !ok {
return fmt.Errorf("failed to find key %q in Secret %q in namespace %q", secretKey, secretName, namespace)
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe consider aggregating errors here to provide a better UX.

Comment on lines +386 to +395
for i, mc := range spec.Resources.Models {
if mc.Provider != "" {
if _, ok := providerIDs[mc.Provider]; !ok {
return fmt.Errorf(
"resources.models[%d].provider: provider ID %q not found; available providers: %s",
i, mc.Provider, strings.Join(sortedKeys(providerIDs), ", "),
)
}
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could aggregate errors here too.

Comment on lines +495 to +498
} else {
status.Conditions[i].Reason = reason
status.Conditions[i].Message = message
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no ObservedGeneration set on the condition. Without it, a client can't distinguish whether a ConfigGenerated=True condition was set for the current spec generation or a previous one. Consider setting condition.ObservedGeneration = instance.Generation to match the convention used by most Kubernetes controllers.

Comment on lines +991 to +1003
namespace := createTestNamespace(t, "test-v1alpha2-secret-ref")
operatorNamespace := createTestNamespace(t, "test-v1alpha2-secret-op")
t.Setenv("OPERATOR_NAMESPACE", operatorNamespace.Name)

// Create operator config ConfigMap (required by NewLlamaStackDistributionReconciler)
opConfig := &corev1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{
Name: "llama-stack-operator-config",
Namespace: operatorNamespace.Name,
},
Data: map[string]string{},
}
require.NoError(t, k8sClient.Create(t.Context(), opConfig))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is repeated 6 times. Consider writing a helper:

func setupV1Alpha2Env(t *testing.T, prefix string) (ns *corev1.Namespace, opNs *corev1.Namespace)

Comment on lines +1030 to +1037
clusterInfo := &cluster.ClusterInfo{
OperatorNamespace: operatorNamespace.Name,
DistributionImages: map[string]string{"starter": testImage},
}
reconciler, err := controllers.NewLlamaStackDistributionReconciler(
t.Context(), k8sClient, scheme.Scheme, clusterInfo,
)
require.NoError(t, err)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is repeated 5 times. Consider a helper:

func newV1Alpha2Reconciler(t *testing.T, opNamespace string) *controllers.LlamaStackDistributionReconciler

Comment on lines +185 to +187
agents:
- provider_id: meta-reference
provider_type: inline::meta-reference
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agents API has been renamed to responses: llamastack/llama-stack#5195

We probably need to update this in all embedded configs

@VaishnaviHire VaishnaviHire force-pushed the implement_run_config_schema branch from 4021d2f to 43e2ce6 Compare March 30, 2026 18:06
@VaishnaviHire
Copy link
Copy Markdown
Collaborator Author

@eoinfennessy I have addressed the comments. Additionally I added deploy time feature flag for v1alpha2 - an overlay v1alpha1-only that deploys only the v1alpha1 CRD.

Add typed v1alpha2 API (ProvidersSpec, ModelConfig, ExposeConfig,
StorageSpec) with kubebuilder validation markers and CEL rules.
Implement lossless v1alpha1<->v1alpha2 conversion via JSON-blob
annotations for fields that have no v1alpha1 equivalent.

Signed-off-by: Vaishnavi Hire <vhire@redhat.com>
Assisted-by : claude-4.6-opus
Implement admission webhook that validates distribution names against
the embedded registry, enforces unique provider IDs per category,
and checks model provider references. Wire up cert-manager and
webhook kustomize overlays.

Signed-off-by: Vaishnavi Hire <vhire@redhat.com>
Assisted-by : claude-4.6-opus
Build the config generation pipeline that renders a complete
config.yaml from v1alpha2 spec fields (providers, resources,
storage). Includes distribution registry, provider expansion,
model/tool/shield resource resolution, storage configuration,
secret-ref placeholder injection, and disabled-API pruning.

Signed-off-by: Vaishnavi Hire <vhire@redhat.com>
Assisted-by : claude-4.6-opus
Wire the config generation pipeline into the reconciliation loop.
Adds v1alpha2 config source detection, ConfigMap creation with
generated config.yaml, secret env-var injection into pod spec,
RBAC permissions for secrets and configmap deletion, and
controller-level integration tests.

Signed-off-by: Vaishnavi Hire <vhire@redhat.com>
Assisted-by : claude-4.6-opus
Add kustomize overlay for OpenShift deployments that patches the
webhook configuration to use the service-serving-cert-signer CA
instead of cert-manager, along with SCC-compatible manager patches.

Signed-off-by: Vaishnavi Hire <vhire@redhat.com>
Assisted-by : claude-4.6-opus
Add end-to-end tests covering v1alpha2 CR creation, conversion
round-trips, webhook validation rejection, secret env-var injection,
and TLS configuration. Refactor existing e2e tests into focused
test files with shared utilities.

Signed-off-by: Vaishnavi Hire <vhire@redhat.com>
Assisted-by : claude-4.6-opus
Reorganize sample CRs into v1alpha1/ and v1alpha2/ subdirectories.
Add v1alpha2 sample CRs (vLLM+Postgres, HA, networking), API
overview, and v1alpha1-to-v1alpha2 migration guide. Update README
with v1alpha2 quick-start examples.

Signed-off-by: Vaishnavi Hire <vhire@redhat.com>
Assisted-by : claude-4.6-opus
Update Makefile with webhook cert-manager targets, add go module
dependencies for the config pipeline and webhook infrastructure,
and regenerate the release operator manifest.

Signed-off-by: Vaishnavi Hire <vhire@redhat.com>
Assisted-by : claude-4.6-opus
Add v1alpha1-only overlay for deployment. This allows the v1alpha2 api to
be incrementally enabled for GA releases.

Signed-off-by: Vaishnavi Hire <vhire@redhat.com>
Assisted-by : claude-4.6-opus
Signed-off-by: Vaishnavi Hire <vhire@redhat.com>
Assisted-by : claude-4.6-opus
Remove support for eval, safety and related apis

Signed-off-by: Vaishnavi Hire <vhire@redhat.com>
Assisted-by : claude-4.6-opus
@VaishnaviHire VaishnaviHire force-pushed the implement_run_config_schema branch from 43e2ce6 to 2de6d1e Compare March 30, 2026 18:44
@mergify mergify bot removed the needs-rebase label Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants