Status: ✅ Ready to Deploy
Date: December 2, 2025
| Metric | Current (On-Demand) | Optimized (Spot + On-Demand) | Savings |
|---|---|---|---|
| Monthly Cost | $2,895 | $1,008 | $1,887 (65%) |
| Annual Cost | $34,740 | $12,096 | $22,644 |
| Hourly Cost | $4.02 | $1.40 | $2.62 |
- GPU Nodes (g5.2xlarge): 3 nodes for NVIDIA NIMs
- System Worker Nodes (m5.xlarge): 2 nodes for backend/frontend
- Milvus Vector Database: 1 node (data persistence)
- etcd/MinIO/Kafka: On same node (critical metadata)
cd infrastructure
./migrate-to-spot.shDuration: 10-15 minutes
Downtime: None (rolling updates)
| Component | Why Safe | Recovery Time |
|---|---|---|
| NVIDIA NIMs | Stateless, model re-downloads | 3-5 minutes |
| Backend/Frontend | 2 replicas, PDB protection | 30 seconds |
| Component | Why Protected | Location |
|---|---|---|
| Milvus | Vector embeddings (97 PDFs) | On-demand node |
| etcd | Milvus metadata | On-demand node |
| MinIO | Vector object storage | On-demand node |
- Pod Disruption Budgets: At least 1 replica always running
- Karpenter Interruption Queue: 2-minute warning before termination
- Node Affinity: Critical services pinned to on-demand
- Capacity Fallback: Spot → on-demand if unavailable
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter -fkubectl get nodes -L karpenter.sh/capacity-typekubectl get pdb --all-namespacesaws ce get-cost-and-usage \
--time-period Start=$(date -d '7 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity DAILY \
--metrics BlendedCost# Revert Karpenter to on-demand
kubectl edit nodepool nvidia-nim-gpu -n karpenter
# Change: values: ["spot", "on-demand"] → values: ["on-demand"]cd infrastructure/terraform
git checkout HEAD -- main.tf karpenter-provisioner.yaml
terraform apply
kubectl apply -f karpenter-provisioner.yamlinfrastructure/terraform/karpenter-provisioner.yaml- Spot priority for GPUinfrastructure/terraform/main.tf- Added spot system node groupinfrastructure/helm/milvus-standalone-values.yaml- Pin to on-demandinfrastructure/kubernetes/pod-disruption-budgets.yaml- New file (PDBs)infrastructure/migrate-to-spot.sh- New file (deployment script)
After deployment, verify:
- GPU nodes labeled
capacity-type: spot - System spot nodes running
- Milvus/etcd/MinIO on on-demand nodes
- PDBs active (
kubectl get pdb --all-namespaces) - No service interruptions during testing
- Cost reduction visible in AWS console (next day)
See infrastructure/SPOT_INSTANCE_STRATEGY.md for complete details.
Ready to save ~$1,900/month? Run ./infrastructure/migrate-to-spot.sh