Skip to content

Commit 2e4fb3f

Browse files
committed
proposal for dynamo model CR
1 parent 37e2ec5 commit 2e4fb3f

File tree

1 file changed

+243
-0
lines changed

1 file changed

+243
-0
lines changed

model-crd/dynamo-model-crd.md

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# DynamoModel: Kubernetes Custom Resource to simplify Model Lifecycle Management UX
2+
3+
**Status**: In-review
4+
5+
**Authors**: [biswapanda](https://github.com/biswapanda)
6+
7+
**Category**: Architecture
8+
9+
**Required Reviewers**: [Maksim, Itay, Anish, Ganesh, Neelay, Kavin]
10+
11+
**Review Date**: [targeted: Oct 9, 2025]
12+
13+
14+
**Slack thread**: [link](https://nvidia.slack.com/archives/C06850J381Y/p1758647954211439?thread_ts=1758613631.539669&cid=C06850J381Y)
15+
16+
# Summary
17+
18+
This proposal introduces `DynamoModel`, a dedicated Kubernetes Custom Resource (CR) for managing model lifecycle in the Dynamo ecosystem. DynamoModel decouples model downloading, versioning, and caching from DynamoGraphDeployment (DGD), enabling consistent model references across deployments, benchmarks, and services while eliminating boilerplate code and preventing model version drift.
19+
20+
# Motivation
21+
22+
Currently, Dynamo users face three critical challenges:
23+
24+
1. **Model Version Drift**: Inconsistent behavior occurs when AI-perf benchmarks use different model versions than deployments. This was observed during 70B model benchmarking where the deployment used stale weights while the benchmark job pulled the latest commit from HuggingFace.
25+
26+
2. **No Cross-Deployment/perf job Model Reuse**: Multiple DGDs or aiperf jobs cannot easily share the same model weights, leading to duplicated operational overhead managing PVCs, secrets, and Jobs.
27+
28+
3. **Boilerplate Code**: Each deployment requires *manual* setup of PVCs, secrets, and Jobs to download models before starting DGD, adding complexity and maintenance burden.
29+
30+
These issues stem from tightly coupling model management with deployment lifecycle, making it difficult to:
31+
- Pin specific model versions across the ecosystem
32+
- Share models between multiple deployments and benchmarks
33+
- Verify model weights readiness before starting workers (Currently, this is done by users manually)
34+
35+
## Goals
36+
37+
- Decouple model lifecycle from DynamoGraphDeployment lifecycle
38+
- Enable model version pinning and eliminate version drift
39+
- Provide model sharing across multiple DynamoGraphDeployments and aiperf jobs
40+
- Simplify model download operations through operator-managed automation
41+
- Ensure services or aiperf workers only start after model weights are fully downloaded and verified
42+
43+
### Non Goals
44+
45+
- Providing model registry functionality (models still sourced from HF/S3/NGC)
46+
47+
# Requirements
48+
49+
### Model Source Flexibility
50+
DynamoModel MUST support multiple model sources including HuggingFace Hub, S3-compatible storage, NVIDIA NGC, and local file systems. The CR MUST use URI schemes (e.g., `hf://`, `s3://`, `ngc://`, `file://`) to specify sources.
51+
52+
### Version Immutability
53+
Once a DynamoModel CR references a specific model version (e.g., HuggingFace commit SHA), that version MUST NOT change unless the CR is explicitly updated. This ensures deployment consistency.
54+
55+
### Status-Based Readiness
56+
DynamoModel MUST expose a status field indicating readiness states (`Pending`, `Downloading`, `Ready`, `Failed`). Dependent resources (DGD, AIperf Job) SHOULD be able to wait for `Ready` state before proceeding.
57+
58+
### Storage Persistence
59+
Downloaded model weights MUST be stored in Persistent Volume Claims (PVCs) that persist beyond the lifecycle of individual DGDs, enabling reuse across multiple deployments.
60+
61+
### Credential Management
62+
DynamoModel MUST support Kubernetes Secret references for authenticated model sources (private HuggingFace repos, S3 buckets with credentials).
63+
64+
# Proposal
65+
66+
## DynamoModel Custom Resource Definition
67+
68+
```yaml
69+
apiVersion: nvidia.com/v1alpha1
70+
kind: DynamoModel
71+
metadata:
72+
name: llama-3-70b-instruct-v1
73+
namespace: dynamo-system
74+
spec:
75+
# Model identification
76+
modelName: meta-llama/Llama-3.3-70B-Instruct
77+
version: 8a4556b53a7d81d7e07db15eafb5af5dcd321b33 # HuggingFace commit SHA
78+
# Source configuration
79+
source:
80+
uri: hf://meta-llama/Llama-3.3-70B-Instruct
81+
secretRef:
82+
name: huggingface-token
83+
key: token
84+
# Storage configuration
85+
storage:
86+
pvc:
87+
create: true # Auto-create PVC
88+
name: llama-3-70b-instruct-v1-pvc # Optional explicit name override defaults to <cr-name>-pvc
89+
storageClassName: fast-nvme # Simple field for convenience
90+
size: 150Gi # Simple field for convenience
91+
accessModes:
92+
- ReadWriteMany
93+
extraPvcSpec: {}
94+
# OR reference existing PVC
95+
# pvc:
96+
# name: existing-model-cache
97+
# subPath: llama-3-70b
98+
99+
# Optional: Download configuration (defaults to HF Downloader or Base Dynamo image with HF)
100+
downloader:
101+
image: my-registry/hf-downloader:my-tag # HF Downloader
102+
resources: {}
103+
retryLimit: 5
104+
timeout: 3600s
105+
```
106+
107+
108+
Status is updated as follows after the model is downloaded:
109+
```yaml
110+
status:
111+
phase: Ready # Pending | Downloading | Ready | Failed
112+
conditions:
113+
- type: Downloaded
114+
status: "True"
115+
lastTransitionTime: "2025-10-07T10:30:00Z"
116+
reason: DownloadComplete
117+
message: "Model downloaded successfully"
118+
# Storage details
119+
storageRef:
120+
pvcName: llama-3-70b-instruct-v1-pvc
121+
path: /models/llama-3-70b-instruct-v1
122+
# Metadata
123+
modelSize: 140Gi
124+
downloadStartTime: "2025-10-07T10:00:00Z"
125+
downloadCompleteTime: "2025-10-07T10:30:00Z"
126+
lastAccessTime: "2025-10-07T12:15:00Z"
127+
# Usage tracking
128+
referencedBy:
129+
- kind: DynamoGraphDeployment
130+
name: vllm-disagg
131+
namespace: dynamo-system
132+
```
133+
134+
## DynamoGraphDeployment Integration
135+
136+
DGDs reference models using `modelRef`:
137+
138+
```yaml
139+
apiVersion: nvidia.com/v1alpha1
140+
kind: DynamoGraphDeployment
141+
metadata:
142+
name: vllm-disagg
143+
namespace: dynamo-system
144+
spec:
145+
services:
146+
VllmPrefillWorker:
147+
modelRef:
148+
name: llama-3-70b-instruct-v1
149+
mountPath: /models # Where to mount in container
150+
replicas: 2
151+
image: my-registry/vllm:my-tag
152+
153+
VllmDecodeWorker:
154+
modelRef:
155+
name: llama-3-70b-instruct-v1
156+
mountPath: /models
157+
replicas: 4
158+
```
159+
160+
# Lifecycle
161+
162+
## DynamoModel Lifecycle
163+
164+
```
165+
┌─────────┐
166+
│ Created │
167+
└────┬────┘
168+
169+
v
170+
┌─────────┐ ┌──────────────┐
171+
│ Pending │────>│ Downloading │
172+
└─────────┘ └──────┬───────┘
173+
174+
┌──────┴──────┐
175+
│ │
176+
v v
177+
┌────────┐ ┌────────┐
178+
│ Ready │ │ Failed │
179+
└────┬───┘ └────┬───┘
180+
│ │
181+
│ │ (retry)
182+
│ v
183+
│ ┌─────────────┐
184+
│ │ Downloading │
185+
│ └─────────────┘
186+
187+
v
188+
┌─────────┐
189+
│ Deleted │
190+
└─────────┘
191+
```
192+
193+
194+
### DGD Controller Changes
195+
196+
The existing DynamoGraphDeployment controller needs modifications:
197+
198+
1. **Model Reference Resolution**: When a service spec contains `modelRef`, resolve it to the actual DynamoModel CR
199+
2. **Readiness Gating**: Before creating worker Deployments, check that the referenced model's `Ready` condition is `True`
200+
3. **PVC Mounting**: Automatically mount the model's PVC to worker pods
201+
4. **Environment Variables**: Set `MODEL_PATH` environment variable to the model's mount path
202+
5. **Reference Counting**: Increment/decrement the model's `referenceCount` when DGDs are created/deleted
203+
6. **Watch Events**: Watch for DynamoModel status changes to trigger DGD reconciliation
204+
205+
206+
# Benefits
207+
208+
- Eliminates boilerplate (PVC/Job init) by centralizing model operations in the operator
209+
- Prevents model version drift with immutable version pinning
210+
- Enables sharing across DGDs and aiperf jobs (single PVC, multiple mounts)
211+
- Improves observability via status conditions
212+
- Extensible to multiple sources (HF/S3/NGC/File) and future features (LoRA, air-gapped deployments from private model registries)
213+
214+
215+
## Additional Considerations
216+
217+
- Model verification:
218+
- We can add verification of the entire folder (sorted by file path)
219+
- Problem: HF doesn't provide folder checksums - these neeed to be pre-computed
220+
- verification:
221+
- enabled: true
222+
- checksum: sha256:abc123
223+
224+
```yaml
225+
apiVersion: nvidia.com/v1alpha1
226+
kind: DynamoModel
227+
metadata:
228+
name: llama-3-70b-instruct-v1
229+
namespace: dynamo-system
230+
spec:
231+
# Additional verification
232+
verification:
233+
enabled: true
234+
checksum: sha256:abc123
235+
status:
236+
conditions:
237+
- type: Verified
238+
status: "True"
239+
lastTransitionTime: "2025-10-07T10:30:00Z"
240+
reason: ChecksumValid
241+
message: "Model verification passed"
242+
243+
```

0 commit comments

Comments
 (0)