feat(api): v1alpha2 CRD types and config generation pipeline#266
feat(api): v1alpha2 CRD types and config generation pipeline#266rhuss wants to merge 11 commits intollamastack:feature/sdd-v1alpha2from
Conversation
|
Addressed review feedback from @eoinfennessy on PR #264 The latest commit ( Empty collection guards (
Empty string guards:
Tighter storage validation:
Tighter TLS validation:
Range and constraint validation:
|
8fdf89e to
4a77086
Compare
Replace the `replace_by_api_type` provider merge strategy with `overlay_by_provider_id` across all spec artifacts. The previous strategy would silently drop base providers that users didn't explicitly re-declare, breaking distributions that ship multiple default providers per API type. The overlay strategy uses provider-ID-based matching: matching IDs are replaced, unmatched user IDs are appended, and unmatched base IDs are preserved. This keeps distributions intact when users only need to override a single provider. Addresses review feedback from @VaishnaviHire on PR llamastack#263 and aligns with the merge semantics already implemented in PR llamastack#253. Changes: - contracts/config-generation.yaml: strategy + examples - spec.md: add FR-035a for overlay merge semantics - research.md: update R3 decision and rationale - REVIEW.md: update Decision 3 - tasks.md: update T018 and T029 descriptions Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
…ovider_id Aligns plan.md with the overlay_by_provider_id change made in the config-generation contract and other spec artifacts. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
…validation Implements Phase 1 (T001-T013) of the operator-generated config spec: - v1alpha2 package scaffolding (doc.go, groupversion_info.go) - Complete type hierarchy: DistributionSpec, ProvidersSpec, ProviderConfig, ResourcesSpec, ModelConfig, StorageSpec (kv/sql), NetworkingSpec, TLSSpec, ExposeConfig, WorkloadSpec, WorkloadOverrides, OverrideConfigSpec - Status types: ResolvedDistributionStatus, ConfigGenerationStatus, condition type/reason constants, phase enum - CEL validation rules: - Mutual exclusivity (overrideConfig vs providers/resources/storage/disabled) - Distribution name/image mutual exclusivity - Provider ID required when multiple providers per API type - TLS secretName required when enabled - Redis endpoint required, Postgres connectionString required - Disabled + provider conflict detection - Generated deepcopy and CRD manifests Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
…e 2) Implements Phase 2 (T014-T030) of the operator-generated config spec: Config generation engine (pkg/config/): - ConfigResolver interface with EmbeddedConfigResolver (go:embed) - Embedded base configs for all 4 distributions (starter, remote-vllm, meta-reference-gpu, postgres-demo) - Config version detection and validation - Provider expansion: auto-ID generation, remote:: prefix, endpoint mapping, settings merge, secretRef env var placeholders - Resource expansion: model registration with default/explicit provider assignment, tool group and shield registration with provider fallback - Storage config: KV (sqlite/redis) and SQL (sqlite/postgres) backends with secretKeyRef env var substitution - Merge logic with overlay_by_provider_id strategy: matching IDs replaced, unmatched user IDs appended, unmatched base IDs preserved - Disabled API subtractive removal - GenerateConfig orchestrator producing deterministic YAML + SHA256 hash - Secret reference collection from apiKey, secretRefs, and storage fields Tests (53 tests, all passing): - EmbeddedConfigResolver: all distributions + unknown error - Provider expansion: single/multi, auto-ID, settings, apiKey, secretRefs - Resource expansion: models, tools, shields, provider fallback, errors - Secret collection: apiKey, secretRefs, storage, env var normalization - Storage: sqlite/redis/postgres, independent subsection merge - Merge: overlay match/append/preserve, disabled, base immutability - Generator: end-to-end with real embedded configs, determinism, overlay preservation, error cases Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
…alidation Incorporates all review feedback from @eoinfennessy on PR llamastack#264: - Add MinItems=1 on all optional slices to reject explicitly empty lists (provider slices, models, tools, shields, env, command, args, volumes, volumeMounts, topologySpreadConstraints, configMapKeys, namespaces, labels) - Add MinProperties=1 on secretRefs map - Add MinLength=1 on required string fields (provider, name, configMapName, providerId, image, secretKeyRef name/key) - Add CEL non-empty guards on optional strings (id, provider, modelType, quantization, hostname, serviceAccountName, mountPath, configMapNamespace) - Add reverse validation on storage types: endpoint/password only valid for redis, connectionString only valid for postgres - Add TLS reverse validation: secretName and caBundle only valid when enabled - Add port range validation (1-65535) - Add replicas Minimum=0 - Add autoscaling: minReplicas/maxReplicas Minimum=1, percentage range 1-100, CEL rule maxReplicas >= minReplicas - Add PDB mutual exclusivity: minAvailable and maxUnavailable cannot both be set, at least one required - Document *bool semantics on ExposeConfig.Enabled (nil vs false vs true) Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
4a77086 to
62acaf5
Compare
Fixes all pre-commit lint failures: - cyclop: split findProviderForAPI into smaller functions - errchkjson: check json.Marshal error in test helper - forcetypeassert: use checked type assertions via test helpers - goconst: extract storageTypeSQLite constant - godot: add periods to comment blocks - govet: fix shadowed err variable in generator.go - lll: add nolint:lll for long kubebuilder CEL markers - perfsprint: use errors.New and hex.EncodeToString - testifylint: use assert.Positive and assert.InDelta Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
- Prefix all error messages with "failed to" per project convention - Regenerate API docs (docs/api-overview.md) for v1alpha2 types - Regenerate installer manifest (release/operator.yaml) Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
eoinfennessy
left a comment
There was a problem hiding this comment.
Added a few comments on the config gen pipeline, and a few more validation suggestions that became apparent while reviewing it.
pkg/config/merge.go
Outdated
| spec *v1alpha2.LlamaStackDistributionSpec, | ||
| ) (map[string]interface{}, error) { | ||
| // Start with a copy of the base config | ||
| merged := shallowCopyMap(baseConfig) |
There was a problem hiding this comment.
shallowCopyMap only copies top-level keys. Nested maps (like providers, apis) are still shared references. overlayProviders mutates the providersMap in-place. This means the baseConfig's providers map is mutated.
The test TestMergeConfig_DoesNotMutateBase passes only because applyDisabled replaces config["apis"] with a new slice rather than mutating it in-place.
We should consider a recursive deep copy.
There was a problem hiding this comment.
Good catch. Replaced shallowCopyMap with deepCopyMap that uses a JSON round-trip to ensure nested maps and slices are fully independent copies. Also strengthened TestMergeConfig_DoesNotMutateBase to verify that base providers aren't mutated either (not just apis).
pkg/config/secret.go
Outdated
| func resolveProviderID(p v1alpha2.ProviderConfig, isSingle bool) string { | ||
| if p.ID != "" { | ||
| return p.ID | ||
| } | ||
| if isSingle { | ||
| return p.Provider | ||
| } | ||
| return p.Provider | ||
| } |
There was a problem hiding this comment.
isSingle is effectively unused because p.Provider is returned either way.
There was a problem hiding this comment.
Right, dead logic. Removed the isSingle parameter entirely. Also extracted the ID resolution into a shared ResolveProviderID helper in provider.go (addresses comment #11 about duplication too).
| func countMergedProviders(config map[string]interface{}) int { | ||
| providersMap, ok := config["providers"].(map[string]interface{}) | ||
| if !ok { | ||
| return 0 | ||
| } | ||
|
|
||
| count := 0 | ||
| // Sort keys for determinism | ||
| keys := make([]string, 0, len(providersMap)) | ||
| for k := range providersMap { | ||
| keys = append(keys, k) | ||
| } | ||
| sort.Strings(keys) | ||
|
|
||
| for _, k := range keys { | ||
| if providers, ok := providersMap[k].([]interface{}); ok { | ||
| count += len(providers) | ||
| } | ||
| } | ||
| return count | ||
| } |
There was a problem hiding this comment.
Sorting is not needed, and neither is creating a keys slice
There was a problem hiding this comment.
Fixed. Removed the keys slice and sorting. The function just sums counts, so iteration order is irrelevant.
pkg/config/resource.go
Outdated
| if model.Provider != "" { | ||
| // Explicit provider assignment (FR-041) | ||
| providerID = model.Provider | ||
| providerType = "remote::" + providerID |
There was a problem hiding this comment.
If the user sets provider: "my-vllm-provider" (a custom ID, not a provider name), this produces provider_type: "remote::my-vllm-provider". Is this the expected behavior? Should the provider type come from the actual provider's Provider field instead of its ID?
There was a problem hiding this comment.
Good catch, this was a bug. When model.Provider is a custom ID like vllm-primary, the code was producing provider_type: remote::vllm-primary instead of remote::vllm. Added lookupProviderType() that searches user providers and base config to resolve the actual provider name. Test updated to verify provider_type is correct.
| id = p.Provider | ||
| } | ||
|
|
||
| providerType := "remote::" + p.Provider |
There was a problem hiding this comment.
All providers will have the remote:: prefix. Are inline:: providers not possible to configure with the CRD? If this is intentional we should ensure it is documented.
There was a problem hiding this comment.
This is intentional. The CRD only supports configuring remote providers. Inline providers from the base config are preserved during merge but can't be added or configured through the CRD. Added documentation on ExpandProviders and an inline comment at the remote:: prefix line to make this explicit.
pkg/config/provider.go
Outdated
| // CountProviders counts total providers across all API types in the spec. | ||
| func CountProviders(providers *v1alpha2.ProvidersSpec) int { | ||
| if providers == nil { | ||
| return 0 | ||
| } | ||
| return len(providers.Inference) + | ||
| len(providers.Safety) + | ||
| len(providers.VectorIo) + | ||
| len(providers.ToolRuntime) + | ||
| len(providers.Telemetry) | ||
| } |
There was a problem hiding this comment.
Yes, CountProviders was only used in its own tests, not in production code. Removed it along with the test.
| // AllAPIProviders returns all configured API types and their providers in deterministic order. | ||
| func AllAPIProviders(providers *v1alpha2.ProvidersSpec) []apiProviders { | ||
| if providers == nil { | ||
| return nil | ||
| } | ||
|
|
||
| var result []apiProviders | ||
| if len(providers.Inference) > 0 { | ||
| result = append(result, apiProviders{APIType: "inference", Providers: providers.Inference}) | ||
| } | ||
| if len(providers.Safety) > 0 { | ||
| result = append(result, apiProviders{APIType: "safety", Providers: providers.Safety}) | ||
| } | ||
| if len(providers.VectorIo) > 0 { | ||
| result = append(result, apiProviders{APIType: "vector_io", Providers: providers.VectorIo}) | ||
| } | ||
| if len(providers.ToolRuntime) > 0 { | ||
| result = append(result, apiProviders{APIType: "tool_runtime", Providers: providers.ToolRuntime}) | ||
| } | ||
| if len(providers.Telemetry) > 0 { | ||
| result = append(result, apiProviders{APIType: "telemetry", Providers: providers.Telemetry}) | ||
| } | ||
| return result | ||
| } |
There was a problem hiding this comment.
Should this include agents, datasetio, eval, scoring?
There was a problem hiding this comment.
These API types (agents, datasetio, eval, scoring) don't have corresponding fields in ProvidersSpec, so they can't be configured through the CRD. This is intentional since these are typically handled by inline providers from the base config. The AllAPIProviders function only returns API types that have CRD fields. This ties into the remote:: only design discussed above.
| disabledSet := make(map[string]bool, len(disabled)) | ||
| for _, d := range disabled { | ||
| disabledSet[d] = true | ||
| } | ||
|
|
||
| // Remove from apis list | ||
| if apis, ok := config["apis"].([]interface{}); ok { | ||
| filtered := make([]interface{}, 0, len(apis)) | ||
| for _, api := range apis { | ||
| if apiStr, ok := api.(string); ok && !disabledSet[apiStr] { | ||
| filtered = append(filtered, api) | ||
| } | ||
| } | ||
| config["apis"] = filtered | ||
| } | ||
|
|
||
| // Remove from providers map | ||
| if providersMap, ok := config["providers"].(map[string]interface{}); ok { | ||
| for _, d := range disabled { | ||
| delete(providersMap, d) | ||
| } | ||
| } |
There was a problem hiding this comment.
Disabling an API removes it from apis and providers, but doesn't clean up any registered_resources entries that reference providers for that API type. For example, disabling safety leaves any shield registrations pointing to a now-removed provider. This could cause Llama Stack to fail at startup with a confusing error about a missing provider rather than a clear "disabled" message.
There was a problem hiding this comment.
Good catch, this was a real bug. applyDisabled now collects the provider IDs being removed before deleting them from the map, then filters registered_resources to remove entries whose provider_id matches. Added TestApplyDisabled_CleansUpRegisteredResources to verify.
| // Check user providers first | ||
| if id, pt, ok := findUserProvider(apiType, userProviders); ok { | ||
| return id, pt, nil | ||
| } | ||
|
|
||
| // Fall back to base config | ||
| if id, pt, ok := findBaseConfigProvider(apiType, baseConfig); ok { | ||
| return id, pt, nil | ||
| } | ||
|
|
||
| return "", "", fmt.Errorf("failed to find %s provider", apiType) | ||
| } |
There was a problem hiding this comment.
There's duplication of provider ID resolution logic between here and the ExpandProviders function in provider.go.
We should consider creating a shared helper function to ensure logic doesn't diverge.
There was a problem hiding this comment.
Agreed. Extracted a shared ResolveProviderID(p ProviderConfig) string helper in provider.go. Both findUserProvider and CollectSecretRefs now use it, eliminating the duplication risk.
| var existing []interface{} | ||
| if rr, ok := config["registered_resources"].([]interface{}); ok { | ||
| existing = rr | ||
| } | ||
|
|
||
| for _, r := range resources { | ||
| existing = append(existing, map[string]interface{}{ | ||
| "resource_type": r.ResourceType, | ||
| "provider": map[string]interface{}{ | ||
| "provider_id": r.Provider.ProviderID, | ||
| "provider_type": r.Provider.ProviderType, | ||
| }, | ||
| "params": r.Params, | ||
| }) | ||
| } |
There was a problem hiding this comment.
If the base config already registers a model (say llama3.2-8b) and the user's CRD also specifies llama3.2-8b in resources.models, both entries end up in the final registered_resources list. Llama Stack would see two registrations for the same model ID.
There was a problem hiding this comment.
Fixed. appendResources now builds a set of existing resource keys (resource_type/id) and skips duplicates. Added TestGenerateConfig_DeduplicatesResources to verify that a model already in the base config isn't added again.
- Replace shallowCopyMap with deepCopyMap (JSON round-trip) to prevent base config mutation through shared nested references - Fix wrong provider_type when model.Provider is a custom ID by adding lookupProviderType() to resolve actual provider names - Clean up registered_resources when disabling APIs to prevent orphaned entries that could cause startup failures - Deduplicate registered_resources in appendResources to avoid duplicate model/shield/tool registrations from base config overlap - Extract shared ResolveProviderID helper to eliminate duplicated provider ID resolution logic between resource.go and secret.go - Remove dead isSingle parameter from resolveProviderID - Remove unnecessary sorting in countMergedProviders - Remove unused CountProviders function and its tests - Add CEL validation for disabled API names (enum of 9 valid values) - Add CEL uniqueness rule for provider IDs in all provider slices - Document that CRD only supports remote:: providers (inline:: preserved from base config but not configurable via CRD) Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
Regenerated by pre-commit hooks after adding CEL validation rules for disabled API enum and provider ID uniqueness. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
Remove spaces in CEL expression and shorten message to stay under the 180-character line length limit enforced by golangci-lint. Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
|
Hey @eoinfennessy , thanks a ton for review. All your comments are very legit and highly appreciated. I also used those to harden the quality gates of the plugin to hopefully reduce these kind of (confessedly partly really stupid) errors. btw, the responses of your comments were auto generated (you guessed that anyway, I guess :) Sorry for the delay, but I'm in the middle of two very distracting weeks (PTO and KubeCon :-), but I'm still dedicated to get this PR through and learn more how to improve the workflow. Especially with the Thanks again for your support! |
Assisted-By: 🤖 Claude Code Signed-off-by: Roland Huß <rhuss@redhat.com>
6856c54 to
e80ebf0
Compare
|
@rhuss thanks for the review! Having acted as HITL for this PR and #253 , there's two major things I'd like to see introduced in future SDD workflows:
|
|
P.S. I will probably focus review efforts on PR #253 in the coming days, but I definitely intend to come back to this soon. Have fun at KubeCon :) |
|
Thanks @eoinfennessy , that feedback is very helpful and helps to improve the methodology: My main take aways:
No worries about the reviews, I will merge that one and work on the next phase, with some better code review process. |
Summary
Adds complete v1alpha2 CRD types and the config generation pipeline (Phase 1 + Phase 2 of the operator-generated config spec). Also updates the provider merge strategy from
replace_by_api_typetooverlay_by_provider_idacross spec artifacts (per review feedback from @VaishnaviHire on PR #263).Spec changes
overlay_by_provider_idin contracts, spec.md (FR-035a), research.md, REVIEW.md, plan.md, tasks.mdPhase 1: v1alpha2 CRD Types (
api/v1alpha2/)Complete type hierarchy with CEL validation:
DistributionSpec,ProvidersSpec,ProviderConfig(typed[]ProviderConfigslices)ResourcesSpec,ModelConfig,StorageSpec(kv/sql),NetworkingSpec,WorkloadSpecResolvedDistributionStatus,ConfigGenerationStatus, condition constantsMinItems=1on all optional slices,MinLength=1on required strings, CEL non-empty guards on optional strings, reverse storage/TLS validation, port range, autoscaling constraintsPhase 2: Config Generation Engine (
pkg/config/)Pure Go pipeline with no Kubernetes API dependencies:
ConfigResolverinterface withEmbeddedConfigResolver(go:embed for 4 distributions: starter, remote-vllm, meta-reference-gpu, postgres-demo)remote::prefix, endpoint mapping, settings merge, secretRef env var placeholdersoverlay_by_provider_idmerge: matching IDs replaced, new IDs appended, unmatched base IDs preservedGenerateConfigorchestrator producing deterministic YAML + SHA256 content hashGenerated artifacts
docs/api-overview.md: API reference for v1alpha2 typesrelease/operator.yaml: Updated installer manifest with v1alpha2 CRDconfig/crd/bases/: Updated CRD with v1alpha2 schema + CEL rulesTests
53 unit tests covering all pipeline components, all passing:
Supersedes
Test plan
pkg/config/unit tests passgo build ./...)make manifests)make generate)Assisted-By: 🤖 Claude Code