This guide covers the enterprise-grade deployment of the NVIDIA RAG Blueprint to your AWS EKS cluster for the US Customs Tariff RAG use case.
Architecture Components:
- Milvus Vector Database - Cloud-native, scalable vector storage
- NVIDIA NeMo Retriever - Embedding generation and retrieval
- RAG Query Server - Handles search and retrieval requests
- RAG Ingest Server - PDF processing and document ingestion
- Hybrid Search - Combines vector similarity + keyword (BM25) search
✅ Production-Ready: Battle-tested, enterprise-grade architecture
✅ Scalable: Milvus handles billions of vectors
✅ Future-Proof: Industry standard, actively maintained
✅ Hybrid Search: Vector + keyword for complex tariff queries
✅ Advanced Document Processing: GPU-accelerated PDF parsing, OCR, table extraction
✅ Observability: Built-in monitoring and logging
┌─────────────────────────────────────────────────────────────┐
│ AWS EKS Cluster │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ AI-Q Agent │─────▶│ RAG Query │ │
│ │ (Backend) │ │ Server │ │
│ │ │ │ (Port 8081) │ │
│ └────────────────┘ └────────┬───────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ ┌────────────────┐ │ Milvus │ │
│ │ Tariff PDF │ │ Vector Store │ │
│ │ Ingestion │──┐ │ (Port 19530) │ │
│ │ │ │ └────────▲────────┘ │
│ └────────────────┘ │ │ │
│ │ ┌────────┴────────┐ │
│ └──▶│ RAG Ingest │ │
│ │ Server │ │
│ │ (Port 8082) │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ Embedding NIM │ │
│ │ (NeMo Retr.) │ │
│ │ (Port 8000) │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Before deploying the RAG Blueprint:
-
EKS Cluster - Already provisioned via Terraform
kubectl cluster-info
-
NGC API Key - For pulling NVIDIA containers
- Get it from: https://org.ngc.nvidia.com/setup/api-key
export NGC_API_KEY="your-ngc-api-key"
-
Helm - For deploying the RAG Blueprint
helm version # If not installed: curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash -
kubectl - Configured for your EKS cluster
aws eks update-kubeconfig --region us-west-2 --name aiq-udf-eks
-
Storage Class - For Milvus persistence (gp3 recommended)
kubectl get storageclass
Navigate to the Helm directory:
cd infrastructure/helmMake the deployment script executable:
chmod +x deploy-rag-blueprint.sh
chmod +x verify-rag-deployment.shDeploy the RAG Blueprint:
export NGC_API_KEY="your-ngc-api-key"
./deploy-rag-blueprint.shWhat this does:
- Creates
rag-blueprintnamespace - Creates NGC secret for image pulling
- Adds NVIDIA Helm repository
- Deploys Milvus vector database (standalone mode)
- Deploys RAG query server (port 8081)
- Deploys RAG ingest server (port 8082)
- Configures connection to existing embedding NIM
Expected duration: 10-15 minutes
Run the verification script:
./verify-rag-deployment.shExpected output:
✅ Namespace 'rag-blueprint' exists
✅ Milvus - 1/1 pods running
✅ RAG Query Server - 2/2 pods running
✅ RAG Ingest Server - 1/1 pods running
✅ Ingest Server Health - Endpoint responding
✅ Query Server Health - Endpoint responding
If services are still starting, wait a few minutes and re-run.
Navigate to the scripts directory:
cd ../../scripts
chmod +x setup_tariff_rag_enterprise.shRun the ingestion script:
./setup_tariff_rag_enterprise.shWhat this does:
- Sets up port-forward to RAG ingest service (if running locally)
- Creates
us_tariffscollection in Milvus - Uploads and processes all 99 tariff PDF chapters
- Runs test queries to verify functionality
Expected duration: 20-30 minutes (depending on PDF size)
Update your agent backend to use the RAG service:
Edit infrastructure/kubernetes/agent-deployment.yaml:
env:
- name: RAG_SERVER_URL
value: "http://rag-query-server.rag-blueprint.svc.cluster.local:8081/v1"
- name: RAG_COLLECTION
value: "us_tariffs"Apply the changes:
cd ../infrastructure/kubernetes
kubectl apply -f agent-deployment.yaml
kubectl rollout restart deployment/aiq-agent-backend -n aiq-agentimport requests
RAG_SERVER_URL = "http://rag-query-server.rag-blueprint.svc.cluster.local:8081/v1"
COLLECTION_NAME = "us_tariffs"
def query_tariff_rag(user_query: str) -> dict:
"""Query the tariff RAG collection"""
response = requests.post(
f"{RAG_SERVER_URL}/generate",
json={
"messages": [{"role": "user", "content": user_query}],
"use_knowledge_base": True,
"enable_citations": True,
"collection_name": COLLECTION_NAME
},
timeout=60
)
return response.json()-
Replacement Batteries
"What is the tariff for replacement batteries for a Raritan remote management card?" -
Food Items
"What's the tariff of Reese's Pieces?" -
Used Electronics
"Tariff of a replacement Roomba vacuum motherboard, used" -
Specific HTS Codes
"What items are covered under HTS code 8507.60?"
Helm values file for customizing the RAG Blueprint deployment:
Key configurations:
milvus.standalone.persistence.size: 100Gi- Vector storage sizeembeddings.externalService.host- Points to your embedding NIMqueryServer.replicas: 2- Query server scalingingestServer.resources.limits.nvidia.com/gpu: 1- GPU for PDF processing
Kubernetes manifests for deploying RAG services individually (if Helm chart unavailable):
- RAG Query Server Deployment & Service
- RAG Ingest Server Deployment & Service
- Environment variables for Milvus and embedding connections
kubectl get pods -n rag-blueprint
kubectl get pods -n rag-blueprint -w # Watch modekubectl get svc -n rag-blueprint# RAG Query Server
kubectl logs -n rag-blueprint -l app=rag-query-server -f
# RAG Ingest Server
kubectl logs -n rag-blueprint -l app=rag-ingest-server -f
# Milvus
kubectl logs -n rag-blueprint -l app.kubernetes.io/name=milvus -f# Ingest Server (for running ingestion locally)
kubectl port-forward -n rag-blueprint svc/rag-ingest-server 8082:8082
# Query Server (for testing queries locally)
kubectl port-forward -n rag-blueprint svc/rag-query-server 8081:8081# Ingest server
curl http://localhost:8082/health
# Query server
curl http://localhost:8081/healthcurl -X POST http://localhost:8081/v1/generate \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "What is the tariff for computers?"}],
"use_knowledge_base": true,
"enable_citations": true,
"collection_name": "us_tariffs"
}'kubectl scale deployment rag-query-server -n rag-blueprint --replicas=5Edit nvidia-rag-values.yaml:
milvus:
standalone:
persistence:
size: 500Gi # Increase storage
resources:
limits:
cpu: "8"
memory: "16Gi"Re-apply:
helm upgrade nvidia-rag nvidia/rag-blueprint \
--namespace rag-blueprint \
--values nvidia-rag-values.yamlkubectl top pods -n rag-blueprint
kubectl describe node | grep -A 5 "Allocated resources"Check events:
kubectl get events -n rag-blueprint --sort-by='.lastTimestamp'Common issues:
-
ImagePullBackOff: NGC secret not configured correctly
kubectl get secret -n rag-blueprint ngc-secret # Recreate if needed kubectl delete secret -n rag-blueprint ngc-secret # Re-run deploy script
-
Pending (no GPU): Karpenter hasn't provisioned GPU nodes yet
kubectl get nodes -l karpenter.sh/provisioner-name=nvidia-nim # Wait 5-10 minutes for Karpenter to provision -
OOMKilled: Increase memory limits in values file
Test connectivity:
kubectl port-forward -n rag-blueprint svc/rag-ingest-server 8082:8082
curl http://localhost:8082/healthCheck logs:
kubectl logs -n rag-blueprint -l app=rag-ingest-server -fCommon issues:
- Connection refused: Service not ready yet, wait a few minutes
- Embedding service unreachable: Check embedding NIM is running
kubectl get pods -n nim -l app=embedding-nim
- Out of disk space: Increase
persistence.sizein values file
Verify collection exists:
# Connect to Milvus
kubectl port-forward -n rag-blueprint svc/milvus-standalone 19530:19530
# Use Milvus CLI or Python SDK to list collectionsCheck document count:
kubectl logs -n rag-blueprint -l app=rag-ingest-server | grep "ingested"Re-ingest if needed:
cd scripts
./setup_tariff_rag_enterprise.shThe NVIDIA RAG Blueprint uses hybrid search combining:
-
Vector Search (Semantic)
- Finds conceptually similar documents
- Example: "replacement batteries" → matches "rechargeable cells", "power supplies"
-
Keyword Search (BM25)
- Exact term matching
- Example: "HTS 8507.60" → matches exact tariff codes
-
Hybrid Ranking
- Combines both scores with learned weights
- Optimal for tariff queries (codes + descriptions)
AWS Resources:
-
EBS Volumes:
- Milvus: 100Gi gp3 (~$8/month)
- Etcd: 10Gi gp3 (~$0.80/month)
- Documents: 50Gi gp3 (~$4/month)
-
GPU Nodes (for ingest server):
- g5.xlarge: ~$1.00/hour (on-demand)
- Karpenter auto-scales, so only pay when ingesting
-
CPU Nodes (for query server):
- c6i.2xlarge: ~$0.34/hour
- 2 replicas for HA
Optimization Tips:
- Use Spot instances for Karpenter provisioner (50-70% savings)
- Scale query server replicas down during low-traffic periods
- Use Fargate for non-GPU workloads
-
Network Policies: Restrict traffic to RAG services
# Apply network policy (create this file) kubectl apply -f infrastructure/helm/rag-network-policy.yaml -
RBAC: Limit service account permissions
- Already configured in
nvidia-rag-values.yaml
- Already configured in
-
Secrets Management: Use AWS Secrets Manager for NGC API key
# Store in Secrets Manager aws secretsmanager create-secret \ --name ngc-api-key \ --secret-string "$NGC_API_KEY" # Use External Secrets Operator to sync to Kubernetes
-
Encrypt PVCs: Enable EBS encryption
# In values file milvus: standalone: persistence: storageClass: gp3-encrypted
To remove the RAG Blueprint:
cd infrastructure/helm
# Delete the Helm release (if using Helm chart)
helm uninstall nvidia-rag -n rag-blueprint
# Or delete manifests (if deployed manually)
kubectl delete -f rag-services.yaml -n rag-blueprint
helm uninstall milvus -n rag-blueprint
# Delete namespace (this removes all resources)
kubectl delete namespace rag-blueprintNote: Persistent volumes may need manual cleanup:
kubectl get pvc -n rag-blueprint
kubectl delete pvc --all -n rag-blueprint✅ RAG Blueprint Deployed: Milvus + RAG servers running
✅ Tariff PDFs Ingested: 99 chapters in us_tariffs collection
✅ Agent Configured: Backend points to RAG service
Now you can:
- Test complex tariff queries in the UI
- Integrate RAG into your UDF dynamic strategies
- Add more document collections (e.g., customs regulations, trade agreements)
- Monitor and scale based on usage patterns
- NVIDIA NIM Documentation: https://docs.nvidia.com/nim/
- Milvus Documentation: https://milvus.io/docs
- NeMo Retriever: https://docs.nvidia.com/nemo/retriever/
- This Project's README:
/README.md
Congratulations! You now have an enterprise-grade RAG system powered by NVIDIA blueprints running on AWS EKS. 🎉