|  | 
| 1 |  | -# Dynamo model serving recipes | 
|  | 1 | +# Dynamo Model Serving Recipes | 
| 2 | 2 | 
 | 
| 3 |  | -| Model family  | Backend | Mode                | GPU   | Deployment | Benchmark | | 
| 4 |  | -|---------------|---------|---------------------|-------|------------|-----------| | 
| 5 |  | -| llama-3-70b   | vllm    | agg                 | H100, H200  |     ✓      |     ✓     | | 
| 6 |  | -| llama-3-70b   | vllm    | disagg-multi-node   | H100, H200  |     ✓      |     ✓     | | 
| 7 |  | -| llama-3-70b   | vllm    | disagg-single-node  | H100, H200  |     ✓      |     ✓     | | 
| 8 |  | -| DeepSeek-R1   | sglang  | disaggregated       | H200  |     ✓      |    🚧     | | 
| 9 |  | -| oss-gpt       | trtllm  | aggregated          | GB200 |     ✓      |     ✓     | | 
|  | 3 | +This repository contains production-ready recipes for deploying large language models using the Dynamo platform. Each recipe includes deployment configurations, performance benchmarking, and model caching setup. | 
|  | 4 | + | 
|  | 5 | +## Contents | 
|  | 6 | +- [Available Models](#available-models) | 
|  | 7 | +- [Quick Start](#quick-start) | 
|  | 8 | +- [Prerequisites](#prerequisites) | 
|  | 9 | +- Deployment Methods | 
|  | 10 | +   - [Option 1: Automated Deployment](#option-1-automated-deployment) | 
|  | 11 | +   - [Option 2: Manual Deployment](#option-2-manual-deployment) | 
|  | 12 | + | 
|  | 13 | + | 
|  | 14 | +## Available Models | 
|  | 15 | + | 
|  | 16 | +| Model Family    | Framework | Deployment Mode      | GPU Requirements | Status | Benchmark | | 
|  | 17 | +|-----------------|-----------|---------------------|------------------|--------|-----------| | 
|  | 18 | +| llama-3-70b     | vllm      | agg                 | 4x H100/H200     | ✅     | ✅        | | 
|  | 19 | +| llama-3-70b     | vllm      | disagg (1 node)      | 8x H100/H200    | ✅     | ✅        | | 
|  | 20 | +| llama-3-70b     | vllm      | disagg (multi-node)     | 16x H100/H200    | ✅     | ✅        | | 
|  | 21 | +| deepseek-r1     | sglang    | disagg (1 node, wide-ep)     | 8x H200          | ✅     | 🚧        | | 
|  | 22 | +| deepseek-r1     | sglang    | disagg (multi-node, wide-ep)     | 16x H200        | ✅     | 🚧        | | 
|  | 23 | +| gpt-oss-120b    | trtllm    | agg                 | 4x GB200         | ✅     | ✅        | | 
|  | 24 | + | 
|  | 25 | +**Legend:** | 
|  | 26 | +- ✅ Functional | 
|  | 27 | +- 🚧 Under development | 
|  | 28 | + | 
|  | 29 | + | 
|  | 30 | +**Recipe Directory Structure:** | 
|  | 31 | +Recipes are organized into a directory structure that follows the pattern: | 
|  | 32 | +```text | 
|  | 33 | +<model-name>/ | 
|  | 34 | +├── model-cache/ | 
|  | 35 | +│   ├── model-cache.yaml         # PVC for model cache | 
|  | 36 | +│   └── model-download.yaml      # Job for model download | 
|  | 37 | +├── <framework>/ | 
|  | 38 | +│   └── <deployment-mode>/ | 
|  | 39 | +│       ├── deploy.yaml          # DynamoGraphDeployment CRD and optional configmap for custom configuration | 
|  | 40 | +│       └── perf.yaml (optional) # Performance benchmark | 
|  | 41 | +└── README.md (optional)         # Model documentation | 
|  | 42 | +``` | 
|  | 43 | + | 
|  | 44 | +## Quick Start | 
|  | 45 | + | 
|  | 46 | +Follow the instructions in the [Prerequisites](#prerequisites) section to set up your environment. | 
|  | 47 | + | 
|  | 48 | +Choose your preferred deployment method: using the `run.sh` script or manual deployment steps. | 
| 10 | 49 | 
 | 
| 11 | 50 | 
 | 
| 12 | 51 | ## Prerequisites | 
| 13 | 52 | 
 | 
| 14 |  | -1. Create a namespace and populate NAMESPACE environment variable | 
| 15 |  | -This environment variable is used in later steps to deploy and perf-test the model. | 
|  | 53 | +### 1. Environment Setup | 
|  | 54 | + | 
|  | 55 | +Create a Kubernetes namespace and set environment variable: | 
| 16 | 56 | 
 | 
| 17 | 57 | ```bash | 
| 18 | 58 | export NAMESPACE=your-namespace | 
| 19 | 59 | kubectl create namespace ${NAMESPACE} | 
| 20 | 60 | ``` | 
| 21 | 61 | 
 | 
| 22 |  | -2. **Dynamo Cloud Platform installed** - Follow [Quickstart Guide](../docs/kubernetes/README.md) | 
|  | 62 | +### 2. Deploy Dynamo Platform | 
|  | 63 | + | 
|  | 64 | +Install the Dynamo Cloud Platform following the [Quickstart Guide](../docs/kubernetes/README.md). | 
|  | 65 | + | 
|  | 66 | +### 3. GPU Cluster | 
|  | 67 | + | 
|  | 68 | +Ensure your Kubernetes cluster has: | 
|  | 69 | +- GPU nodes with appropriate GPU types (see model requirements above) | 
|  | 70 | +- GPU operator installed | 
|  | 71 | +- Sufficient GPU memory and compute resources | 
|  | 72 | + | 
|  | 73 | +### 4. Container Registry Access | 
| 23 | 74 | 
 | 
| 24 |  | -3. **Kubernetes cluster with GPU support** | 
|  | 75 | +Ensure access to NVIDIA container registry for runtime images: | 
|  | 76 | +- `nvcr.io/nvidia/ai-dynamo/vllm-runtime:x.y.z` | 
|  | 77 | +- `nvcr.io/nvidia/ai-dynamo/trtllm-runtime:x.y.z` | 
|  | 78 | +- `nvcr.io/nvidia/ai-dynamo/sglang-runtime:x.y.z` | 
| 25 | 79 | 
 | 
| 26 |  | -4. **Container registry access** for vLLM runtime images | 
|  | 80 | +### 5. HuggingFace Access and Kubernetes Secret Creation | 
| 27 | 81 | 
 | 
| 28 |  | -5. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`) | 
| 29 |  | -Update the `hf-token-secret.yaml` file with your HuggingFace token. | 
|  | 82 | +Set up a kubernetes secret with the HuggingFace token for model download: | 
| 30 | 83 | 
 | 
| 31 | 84 | ```bash | 
|  | 85 | +# Update the token in the secret file | 
|  | 86 | +vim hf_hub_secret/hf_hub_secret.yaml | 
|  | 87 | + | 
|  | 88 | +# Apply the secret | 
| 32 | 89 | kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE} | 
| 33 | 90 | ``` | 
| 34 | 91 | 
 | 
| 35 |  | -6. (Optional) Create a shared model cache pvc to store the model weights. | 
| 36 |  | -Choose a storage class to create the model cache pvc. You'll need to use this storage class name to update the `storageClass` field in the model-cache/model-cache.yaml file. | 
|  | 92 | +### 6. Configure Storage Class | 
|  | 93 | + | 
|  | 94 | +Configure persistent storage for model caching: | 
| 37 | 95 | 
 | 
| 38 | 96 | ```bash | 
|  | 97 | +# Check available storage classes | 
| 39 | 98 | kubectl get storageclass | 
| 40 | 99 | ``` | 
| 41 | 100 | 
 | 
| 42 |  | -## Running the recipes | 
|  | 101 | +Replace "your-storage-class-name" with your actual storage class in the file: `<model>/model-cache/model-cache.yaml` | 
|  | 102 | + | 
|  | 103 | +```yaml | 
|  | 104 | +# In <model>/model-cache/model-cache.yaml | 
|  | 105 | +spec: | 
|  | 106 | +  storageClassName: "your-actual-storage-class"  # Replace this | 
|  | 107 | +``` | 
|  | 108 | +
 | 
|  | 109 | +## Option 1: Automated Deployment | 
| 43 | 110 | 
 | 
| 44 |  | -Run the recipe to deploy a model: | 
|  | 111 | +Use the `run.sh` script for fully automated deployment: | 
|  | 112 | + | 
|  | 113 | +**Note:** The script automatically: | 
|  | 114 | +- Create model cache PVC and downloads the model | 
|  | 115 | +- Deploy the model service | 
|  | 116 | +- Runs performance benchmark if a `perf.yaml` file is present in the deployment directory | 
|  | 117 | + | 
|  | 118 | + | 
|  | 119 | +#### Script Usage | 
| 45 | 120 | 
 | 
| 46 | 121 | ```bash | 
| 47 |  | -./run.sh --model <model> --framework <framework> <deployment-type> | 
|  | 122 | +./run.sh [OPTIONS] --model <model> --framework <framework> --deployment <deployment-type> | 
| 48 | 123 | ``` | 
| 49 | 124 | 
 | 
| 50 |  | -Arguments: | 
| 51 |  | -  <deployment-type>  Deployment type (e.g., agg, disagg-single-node, disagg-multi-node) | 
|  | 125 | +**Required Options:** | 
|  | 126 | +- `--model <model>`: Model name matching the directory name in the recipes directory (e.g., llama-3-70b, gpt-oss-120b, deepseek-r1) | 
|  | 127 | +- `--framework <framework>`: Backend framework (`vllm`, `trtllm`, `sglang`) | 
|  | 128 | +- `--deployment <deployment-type>`: Deployment mode (e.g., agg, disagg, disagg-single-node, disagg-multi-node) | 
|  | 129 | + | 
|  | 130 | +**Optional Options:** | 
|  | 131 | +- `--namespace <namespace>`: Kubernetes namespace (default: dynamo) | 
|  | 132 | +- `--dry-run`: Show commands without executing them | 
|  | 133 | +- `-h, --help`: Show help message | 
|  | 134 | + | 
|  | 135 | +**Environment Variables:** | 
|  | 136 | +- `NAMESPACE`: Kubernetes namespace (default: dynamo) | 
|  | 137 | + | 
|  | 138 | +#### Example Usage | 
|  | 139 | +```bash | 
|  | 140 | +# Set up environment | 
|  | 141 | +export NAMESPACE=your-namespace | 
|  | 142 | +kubectl create namespace ${NAMESPACE} | 
|  | 143 | +# Configure HuggingFace token | 
|  | 144 | +kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE} | 
|  | 145 | +
 | 
|  | 146 | +# use run.sh script to deploy the model | 
|  | 147 | +# Deploy Llama-3-70B with vLLM (aggregated mode) | 
|  | 148 | +./run.sh --model llama-3-70b --framework vllm --deployment agg | 
|  | 149 | +
 | 
|  | 150 | +# Deploy GPT-OSS-120B with TensorRT-LLM | 
|  | 151 | +./run.sh --model gpt-oss-120b --framework trtllm --deployment agg | 
|  | 152 | +
 | 
|  | 153 | +# Deploy DeepSeek-R1 with SGLang (disaggregated mode) | 
|  | 154 | +./run.sh --model deepseek-r1 --framework sglang --deployment disagg | 
|  | 155 | +
 | 
|  | 156 | +# Deploy with custom namespace | 
|  | 157 | +./run.sh --namespace my-namespace --model llama-3-70b --framework vllm --deployment agg | 
|  | 158 | +
 | 
|  | 159 | +# Dry run to see what would be executed | 
|  | 160 | +./run.sh --dry-run --model llama-3-70b --framework vllm --deployment agg | 
|  | 161 | +``` | 
| 52 | 162 | 
 | 
| 53 |  | -Required Options: | 
| 54 |  | -  --model <model>    Model name (e.g., llama-3-70b) | 
| 55 |  | -  --framework <fw>   Framework one of VLLM TRTLLM SGLANG (default: VLLM) | 
| 56 | 163 | 
 | 
| 57 |  | -Optional: | 
| 58 |  | -  --skip-model-cache Skip model downloading (assumes model cache already exists) | 
| 59 |  | -  -h, --help         Show this help message | 
|  | 164 | +## Option 2: Manual Deployment | 
| 60 | 165 | 
 | 
| 61 |  | -Environment Variables: | 
| 62 |  | -  NAMESPACE          Kubernetes namespace (default: dynamo) | 
|  | 166 | +For step-by-step manual deployment follow these steps : | 
| 63 | 167 | 
 | 
| 64 |  | -Examples: | 
| 65 |  | -  ./run.sh --model llama-3-70b --framework vllm agg | 
| 66 |  | -  ./run.sh --skip-model-cache --model llama-3-70b --framework vllm agg | 
| 67 |  | -  ./run.sh --model llama-3-70b --framework trtllm disagg-single-node | 
| 68 |  | -Example: | 
| 69 | 168 | ```bash | 
| 70 |  | -./run.sh --model llama-3-70b --framework vllm --deployment-type agg | 
|  | 169 | +# 0. Set up environment (see Prerequisites section) | 
|  | 170 | +export NAMESPACE=your-namespace | 
|  | 171 | +kubectl create namespace ${NAMESPACE} | 
|  | 172 | +kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE} | 
|  | 173 | +
 | 
|  | 174 | +# 1. Download model (see Model Download section) | 
|  | 175 | +kubectl apply -n $NAMESPACE -f <model>/model-cache/ | 
|  | 176 | +
 | 
|  | 177 | +# 2. Deploy model (see Deployment section) | 
|  | 178 | +kubectl apply -n $NAMESPACE -f <model>/<framework>/<mode>/deploy.yaml | 
|  | 179 | +
 | 
|  | 180 | +# 3. Run benchmarks (optional, if perf.yaml exists) | 
|  | 181 | +kubectl apply -n $NAMESPACE -f <model>/<framework>/<mode>/perf.yaml | 
|  | 182 | +``` | 
|  | 183 | + | 
|  | 184 | +### Step 1: Download Model | 
|  | 185 | + | 
|  | 186 | +```bash | 
|  | 187 | +# Start the download job | 
|  | 188 | +kubectl apply -n $NAMESPACE -f <model>/model-cache | 
|  | 189 | +
 | 
|  | 190 | +# Verify job creation | 
|  | 191 | +kubectl get jobs -n $NAMESPACE | grep model-download | 
|  | 192 | +``` | 
|  | 193 | + | 
|  | 194 | +Monitor and wait for the model download to complete: | 
|  | 195 | + | 
|  | 196 | +```bash | 
|  | 197 | +
 | 
|  | 198 | +# Wait for job completion (timeout after 100 minutes) | 
|  | 199 | +kubectl wait --for=condition=Complete job/model-download -n $NAMESPACE --timeout=6000s | 
|  | 200 | +
 | 
|  | 201 | +# Check job status | 
|  | 202 | +kubectl get job model-download -n $NAMESPACE | 
|  | 203 | +
 | 
|  | 204 | +# View download logs | 
|  | 205 | +kubectl logs job/model-download -n $NAMESPACE | 
|  | 206 | +``` | 
|  | 207 | + | 
|  | 208 | +### Step 2: Deploy Model Service | 
|  | 209 | + | 
|  | 210 | +```bash | 
|  | 211 | +# Navigate to the specific deployment configuration | 
|  | 212 | +cd <model>/<framework>/<deployment-mode>/ | 
|  | 213 | +
 | 
|  | 214 | +# Deploy the model service | 
|  | 215 | +kubectl apply -n $NAMESPACE -f deploy.yaml | 
|  | 216 | +
 | 
|  | 217 | +# Verify deployment creation | 
|  | 218 | +kubectl get deployments -n $NAMESPACE | 
| 71 | 219 | ``` | 
| 72 | 220 | 
 | 
|  | 221 | +#### Wait for Deployment Ready | 
| 73 | 222 | 
 | 
| 74 |  | -## Dry run mode | 
|  | 223 | +```bash | 
|  | 224 | +# Get deployment name from the deploy.yaml file | 
|  | 225 | +DEPLOYMENT_NAME=$(grep "name:" deploy.yaml | head -1 | awk '{print $2}') | 
|  | 226 | +
 | 
|  | 227 | +# Wait for deployment to be ready (timeout after 10 minutes) | 
|  | 228 | +kubectl wait --for=condition=available deployment/$DEPLOYMENT_NAME -n $NAMESPACE --timeout=1200s | 
|  | 229 | +
 | 
|  | 230 | +# Check deployment status | 
|  | 231 | +kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE | 
|  | 232 | +
 | 
|  | 233 | +# Check pod status | 
|  | 234 | +kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT_NAME | 
|  | 235 | +``` | 
|  | 236 | + | 
|  | 237 | +#### Verify Model Service | 
|  | 238 | + | 
|  | 239 | +```bash | 
|  | 240 | +# Check if service is running | 
|  | 241 | +kubectl get services -n $NAMESPACE | 
|  | 242 | +
 | 
|  | 243 | +# Test model endpoint (port-forward to test locally) | 
|  | 244 | +kubectl port-forward service/${DEPLOYMENT_NAME}-frontend 8000:8000 -n $NAMESPACE | 
|  | 245 | +
 | 
|  | 246 | +# Test the model API (in another terminal) | 
|  | 247 | +curl http://localhost:8000/v1/models | 
| 75 | 248 | 
 | 
| 76 |  | -To dry run the recipe, add the `--dry-run` flag. | 
|  | 249 | +# Stop port-forward when done | 
|  | 250 | +pkill -f "kubectl port-forward" | 
|  | 251 | +``` | 
|  | 252 | + | 
|  | 253 | +### Step 3: Performance Benchmarking (Optional) | 
|  | 254 | + | 
|  | 255 | +Run performance benchmarks to evaluate model performance. Note that benchmarking is only available for models that include a `perf.yaml` file (optional): | 
|  | 256 | + | 
|  | 257 | +#### Launch Benchmark Job | 
| 77 | 258 | 
 | 
| 78 | 259 | ```bash | 
| 79 |  | -./run.sh --dry-run --model llama-3-70b --framework vllm agg | 
|  | 260 | +# From the deployment directory | 
|  | 261 | +kubectl apply -n $NAMESPACE -f perf.yaml | 
|  | 262 | +
 | 
|  | 263 | +# Verify benchmark job creation | 
|  | 264 | +kubectl get jobs -n $NAMESPACE | 
| 80 | 265 | ``` | 
| 81 | 266 | 
 | 
| 82 |  | -## (Optional) Running the recipes with model cache | 
| 83 |  | -You may need to cache the model weights on a PVC to avoid repeated downloads of the model weights. | 
| 84 |  | - See the [Prerequisites](#prerequisites) section for more details. | 
|  | 267 | +#### Monitor Benchmark Progress | 
| 85 | 268 | 
 | 
| 86 | 269 | ```bash | 
| 87 |  | -./run.sh --model llama-3-70b --framework vllm --deployment-type agg --skip-model-cache | 
|  | 270 | +# Get benchmark job name | 
|  | 271 | +PERF_JOB_NAME=$(grep "name:" perf.yaml | head -1 | awk '{print $2}') | 
|  | 272 | +
 | 
|  | 273 | +# Monitor benchmark logs in real-time | 
|  | 274 | +kubectl logs -f job/$PERF_JOB_NAME -n $NAMESPACE | 
|  | 275 | +
 | 
|  | 276 | +# Wait for benchmark completion (timeout after 100 minutes) | 
|  | 277 | +kubectl wait --for=condition=Complete job/$PERF_JOB_NAME -n $NAMESPACE --timeout=6000s | 
| 88 | 278 | ``` | 
|  | 279 | + | 
|  | 280 | +#### View Benchmark Results | 
|  | 281 | + | 
|  | 282 | +```bash | 
|  | 283 | +# Check final benchmark results | 
|  | 284 | +kubectl logs job/$PERF_JOB_NAME -n $NAMESPACE | tail -50 | 
|  | 285 | +``` | 
0 commit comments