Use the following documentation to get started with the NVIDIA RAG Blueprint.
- Obtain an API Key
- Interact using native python APIs
- Deploy With Docker Compose
- Deploy With Helm Chart
- Data Ingestion
You need to generate an API key to access NIM services, to access models hosted in the NVIDIA API Catalog, and to download models on-premises. For more information, refer to NGC API Keys.
To generate an API key, use the following procedure.
- Go to https://org.ngc.nvidia.com/setup/api-keys.
- Click + Generate Personal Key.
- Enter a Key Name.
- For Services Included, select NGC Catalog and Public API Endpoints.
- Click Generate Personal Key.
After you generate your key, export your key as an environment variable by using the following code.
export NGC_API_KEY="<your-ngc-api-key>"You can interact with and deploy the NVIDIA RAG Blueprint directly from Python using the provided Jupyter notebook. This approach is ideal for users who prefer a programmatic interface for setup, ingestion, and querying, or for those who want to automate workflows and integrate with other Python tools.
- Notebook: rag_library_usage.ipynb
The notebook demonstrates environment setup, document ingestion, collection management, and querying using the NVIDIA RAG Python client. Follow the cells in the notebook for an end-to-end example of using the RAG system natively from Python.
Use these procedures to deploy with Docker Compose for a single node deployment. Alternatively, you can Deploy With Helm Chart to deploy on a Kubernetes cluster.
Developers need to deploy ingestion services and rag services using seperate dedicated docker compose files. For both retrieval and ingestion services, by default all the models are deployed on-prem. Follow relevant section below as per your requirement and hardware availability.
- Start the Microservices
-
Install Docker Engine. For more information, see Ubuntu.
-
Install Docker Compose. For more information, see install the Compose plugin.
a. Ensure the Docker Compose plugin version is 2.29.1 or later.
b. After you get the Docker Compose plugin installed, run
docker compose versionto confirm. -
To pull images required by the blueprint from NGC, you must first authenticate Docker with nvcr.io. Use the NGC API Key you created in Obtain an API Key.
export NGC_API_KEY="nvapi-..." echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin
-
Some containers with are enabled with GPU acceleration, such as Milvus and NVIDIA NIMS deployed on-prem. To configure Docker for GPU-accelerated containers, install, the NVIDIA Container Toolkit
-
Ensure you meet the hardware requirements if you are deploying models on-prem.
Use the following procedure to start all containers needed for this blueprint. This launches the ingestion services followed by the rag services and all of its dependent NIMs on-prem.
-
Fulfill the prerequisites. Ensure you meet the hardware requirements.
-
Create a directory to cache the models and export the path to the cache as an environment variable.
mkdir -p ~/.cache/model-cache export MODEL_DIRECTORY=~/.cache/model-cache
-
Export all the required environment variables to use on-prem models. Ensure the section
Endpoints for using cloud NIMsis commented in this file.source deploy/compose/.env[Optional]: Check out the guidance here for some [best practices] for choosing configurations(./accuracy_perf.md). To turn on recommended configs for accuracy optimized profile set additional configs:
source deploy/compose/accuracy_profile.envTo turn on recommended configs for perf optimized profile set additional configs:
source deploy/compose/perf_profile.env -
Start all required NIMs.
Before running the command please ensure the GPU allocation is done appropriately in the deploy/compose/.env. You might need to override them for the hardware you are deploying this blueprint on. The default assumes you are deploying this on a 2XH100 environment.
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml up -d-
Wait till the
nemoretriever-ranking-ms,nemoretriever-embedding-msandnim-llm-msNIMs are in healthy state before proceeding further. -
The nemo LLM service may take upto 30 mins to start for the first time as the model is downloaded and cached. The models are downloaded and cached in the path specified by
MODEL_DIRECTORY. Subsequent deployments will take 2-5 mins to startup based on the GPU profile. -
The default configuration allocates one GPU (GPU ID 1) to
nim-llm-mswhich defaults to minimum GPUs needed for H100 or B200 profile. If you are deploying the solution on A100, please allocate 2 available GPUs by exporting below env variable before launching:export LLM_MS_GPU_ID=1,2 -
To start just the NIMs specific to rag or ingestion add the
--profile ragor--profile ingestflag to the command. -
Ensure all the below are running before proceeding further
watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"'NAMES STATUS nemoretriever-ranking-ms Up 14 minutes (healthy) compose-page-elements-1 Up 14 minutes compose-paddle-1 Up 14 minutes compose-graphic-elements-1 Up 14 minutes compose-table-structure-1 Up 14 minutes nemoretriever-embedding-ms Up 14 minutes (healthy) nim-llm-ms Up 14 minutes (healthy)
-
-
Start the vector db containers from the repo root.
docker compose -f deploy/compose/vectordb.yaml up -d
[!TIP] By default GPU accelerated Milvus DB is deployed. You can choose the GPU ID to be allocated using below env variable.
VECTORSTORE_GPU_DEVICE_ID=0
For B200 and A100 GPUs, use Milvus CPU indexing due to known retrieval accuracy issues with Milvus GPU indexing and search. Export following environment variables to disable Milvus GPU ingexing and search.
export APP_VECTORSTORE_ENABLEGPUSEARCH=False export APP_VECTORSTORE_ENABLEGPUINDEX=False
-
Start the ingestion containers from the repo root. This pulls the prebuilt containers from NGC and deploys it on your system.
docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d
[!TIP] You can add a
--buildargument in case you have made some code changes or have any requirement of re-building ingestion containers from source code:docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d --build
-
Start the rag containers from the repo root. This pulls the prebuilt containers from NGC and deploys it on your system.
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
[!TIPS] You can add a
--buildargument in case you have made some code changes or have any requirement of re-building containers from source code:docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d --build
You can check the status of the rag-server and its dependencies by issuing this curl command
curl -X 'GET' 'http://workstation_ip:8081/v1/health?check_dependencies=true' -H 'accept: application/json'
-
Confirm all the below mentioned containers are running.
docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"Example Output
NAMES STATUS compose-nv-ingest-ms-runtime-1 Up 5 minutes (healthy) ingestor-server Up 5 minutes compose-redis-1 Up 5 minutes rag-playground Up 9 minutes rag-server Up 9 minutes milvus-standalone Up 36 minutes milvus-minio Up 35 minutes (healthy) milvus-etcd Up 35 minutes (healthy) nemoretriever-ranking-ms Up 38 minutes (healthy) compose-page-elements-1 Up 38 minutes compose-paddle-1 Up 38 minutes compose-graphic-elements-1 Up 38 minutes compose-table-structure-1 Up 38 minutes nemoretriever-embedding-ms Up 38 minutes (healthy) nim-llm-ms Up 38 minutes (healthy) -
Open a web browser and access
http://localhost:8090to use the RAG Playground. You can use the upload tab to ingest files into the server or follow the notebooks to understand the API usage. -
To stop all running services, after making some customizations
docker compose -f deploy/compose/docker-compose-ingestor-server.yaml down docker compose -f deploy/compose/nims.yaml down docker compose -f deploy/compose/docker-compose-rag-server.yaml down docker compose -f deploy/compose/vectordb.yaml down
📝 Notes:
-
A single NVIDIA A100-80GB or H100-80GB, B200 GPU can be used to start non-LLM NIMs (nemoretriever-embedding-ms, nemoretriever-ranking-ms, and ingestion services like page-elements, paddle, graphic-elements, and table-structure) for ingestion and RAG workflows. You can control which GPU is used for each service by setting these environment variables in
deploy/compose/.envfile before launching:EMBEDDING_MS_GPU_ID=0 RANKING_MS_GPU_ID=0 YOLOX_MS_GPU_ID=0 YOLOX_GRAPHICS_MS_GPU_ID=0 YOLOX_TABLE_MS_GPU_ID=0 PADDLE_MS_GPU_ID=0
-
If the NIMs are deployed in a different workstation or outside the nvidia-rag docker network on the same system, replace the host address of the below URLs with workstation IPs.
APP_EMBEDDINGS_SERVERURL="workstation_ip:8000" APP_LLM_SERVERURL="workstation_ip:8000" APP_RANKING_SERVERURL="workstation_ip:8000" PADDLE_GRPC_ENDPOINT="workstation_ip:8001" YOLOX_GRPC_ENDPOINT="workstation_ip:8001" YOLOX_GRAPHIC_ELEMENTS_GRPC_ENDPOINT="workstation_ip:8001" YOLOX_TABLE_STRUCTURE_GRPC_ENDPOINT="workstation_ip:8001"
-
Due to react limitations, any changes made to below environment variables will require developers to rebuilt the rag containers. This will be fixed in a future release.
# Model name for LLM NEXT_PUBLIC_MODEL_NAME: ${APP_LLM_MODELNAME:-meta/llama-3.1-70b-instruct} # Model name for embeddings NEXT_PUBLIC_EMBEDDING_MODEL: ${APP_EMBEDDINGS_MODELNAME:-nvidia/llama-3.2-nv-embedqa-1b-v2} # Model name for reranking NEXT_PUBLIC_RERANKER_MODEL: ${APP_RANKING_MODELNAME:-nvidia/llama-3.2-nv-rerankqa-1b-v2} # URL for rag server container NEXT_PUBLIC_CHAT_BASE_URL: "http://rag-server:8081/v1" # URL for ingestor container NEXT_PUBLIC_VDB_BASE_URL: "http://ingestor-server:8082/v1"
-
Verify that you meet the prerequisites.
-
Open
deploy/compose/.envand uncomment the sectionEndpoints for using cloud NIMs. Then set the environment variables by executing below command.source deploy/compose/.env[Optional]: Check out the guidance here for some [best practices] for choosing configurations(./accuracy_perf.md). To turn on recommended configs for accuracy optimized profile set additional configs:
source deploy/compose/accuracy_profile.envTo turn on recommended configs for perf optimized profile set additional configs:
source deploy/compose/perf_profile.env📝 Note: When using NVIDIA hosted endpoints, you may encounter rate limiting with larger file ingestions (>10 files).
-
Start the vector db containers from the repo root.
docker compose -f deploy/compose/vectordb.yaml up -d
[!NOTE] If you don't have a GPU available, you can switch to CPU-only Milvus by following the instructions in milvus-configuration.md.
[!TIP] For B200 and A100 GPUs, use Milvus CPU indexing due to known retrieval accuracy issues with Milvus GPU indexing and search. Export following environment variables to disable Milvus GPU ingexing and search.
export APP_VECTORSTORE_ENABLEGPUSEARCH=False export APP_VECTORSTORE_ENABLEGPUINDEX=False
-
Start the ingestion containers from the repo root. This pulls the prebuilt containers from NGC and deploys it on your system.
docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d
[!TIP] You can add a
--buildargument in case you have made some code changes or have any requirement of re-building ingestion containers from source code:docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d --build
-
Start the rag containers from the repo root. This pulls the prebuilt containers from NGC and deploys it on your system.
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
[!TIP] You can add a
--buildargument in case you have made some code changes or have any requirement of re-building containers from source code:docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d --build
You can check the status of the rag-server and its dependencies by issuing this curl command
curl -X 'GET' 'http://workstation_ip:8081/v1/health?check_dependencies=true' -H 'accept: application/json'
-
Confirm all the below mentioned containers are running.
docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"Example Output
NAMES STATUS compose-nv-ingest-ms-runtime-1 Up 5 minutes (healthy) ingestor-server Up 5 minutes compose-redis-1 Up 5 minutes rag-playground Up 9 minutes rag-server Up 9 minutes milvus-standalone Up 36 minutes milvus-minio Up 35 minutes (healthy) milvus-etcd Up 35 minutes (healthy) -
Open a web browser and access
http://localhost:8090to use the RAG Playground. You can use the upload tab to ingest files into the server or follow the notebooks to understand the API usage. -
To stop all running services, after making some customizations
docker compose -f deploy/compose/docker-compose-ingestor-server.yaml down docker compose -f deploy/compose/docker-compose-rag-server.yaml down docker compose -f deploy/compose/vectordb.yaml down
Use these procedures to deploy with Helm Chart to deploy on a Kubernetes cluster. Alternatively, you can Deploy With Docker Compose for a single node deployment.
-
Verify that you meet the prerequisites.
-
Verify that you meet the hardware requirements.
- The total GPU requirement for deploying this chart is as follows:
- 8xH100-80GB
- 9xA100-80GB
- 8xB200
- The total GPU requirement for deploying this chart is as follows:
-
Verify that you have the NGC CLI available on your client machine. You can download the CLI from https://ngc.nvidia.com/setup/installers/cli.
-
Verify that you have Kubernetes installed and running Ubuntu 22.04. For more information, see Kubernetes documentation and NVIDIA Cloud Native Stack repository.
-
Verify that you have a default storage class available in the cluster for PVC provisioning. One option is the local path provisioner by Rancher. Refer to the installation section of the README in the GitHub repository.
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.26/deploy/local-path-storage.yaml kubectl get pods -n local-path-storage kubectl get storageclass
-
If the local path storage class is not set as default, it can be made default by using the following command.
kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' -
Verify that you have installed the NVIDIA GPU Operator following steps here.
-
Optionally you can also enable time slicing for sharing GPUs between pods. Refer to this on GPU operator user guide for more details.
- RAG server
- Ingestor server
- NV-Ingest
Export NGC API Key Refer to this for obtaining the API Key.
export NGC_API_KEY="nvapi-*"Change directory to deploy/helm/
cd deploy/helm/Create a namespace for the deployment:
kubectl create namespace ragRun the following command to install the RAG server with the Ingestor Server and Frontend enabled:
helm upgrade --install rag -n rag https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-rag-v2.2.0.tgz \
--username '$oauthtoken' \
--password "${NGC_API_KEY}" \
--set imagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.password=$NGC_API_KEY[!NOTE]
For B200 based deployment, you need to add an environment variable to the nvidia-nim-llama-32-nv-embedqa-1b-v2 deployment.
Due to a known issue in the chart, we currently need to manually edit the deployment and add this env variable.
- Edit the embedding deployment with the command below.
kubectl edit deployment 'rag-nvidia-nim-llama-32-nv-embedqa-1b-v2' -n rag- Add the env variable
NIM_TRT_ENGINE_HOST_CODE_ALLOWEDand set its value to1.
spec:
containers:
- env:
- name: NIM_CACHE_PATH
value: /opt/nim/.cache
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api
- name: NIM_TRT_ENGINE_HOST_CODE_ALLOWED
value: "1"- Delete the NIM Embedding pod for the changes in the deployment to reflect.
kubectl delete pod <embedqa-pod-name> -n rag[!TIP] For B200 and A100 GPUs, it is recommended to use CPU indexing and search for better response. You can set this by either:
- Using helm set command:
helm upgrade --install rag -n rag https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-rag-v2.2.0.tgz \
--username '$oauthtoken' \
--password "${NGC_API_KEY}" \
--set imagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.password=$NGC_API_KEY \
--set ingestor-server.envVars.APP_VECTORSTORE_ENABLEGPUINDEX=False \
--set ingestor-server.envVars.APP_VECTORSTORE_ENABLEGPUSEARCH=False- Or by modifying values.yaml:
ingestor-server:
envVars:
APP_VECTORSTORE_ENABLEGPUINDEX: "False"
APP_VECTORSTORE_ENABLEGPUSEARCH: "False"Note: If you're using the source Helm chart, make these changes in
deploy/helm/rag-server/values.yaml.
Note
To deploy the Helm chart within 4xH100 using MIG slicing, see RAG Deployment with MIG Support.
Follow this section only if you are working directly with the source Helm chart and want to customize components individually.
helm repo add nvidia-nim https://helm.ngc.nvidia.com/nim/nvidia/ --username='$oauthtoken' --password=$NGC_API_KEY
helm repo add nim https://helm.ngc.nvidia.com/nim/ --username='$oauthtoken' --password=$NGC_API_KEY
helm repo add nemo-microservices https://helm.ngc.nvidia.com/nvidia/nemo-microservices --username='$oauthtoken' --password=$NGC_API_KEY
helm repo add baidu-nim https://helm.ngc.nvidia.com/nim/baidu --username='$oauthtoken' --password=$NGC_API_KEYhelm dependency update rag-server/charts/ingestor-server
helm dependency update rag-serverhelm upgrade --install rag -n rag rag-server/ \
--set imagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.password=$NGC_API_KEYIf you wish to switch the LLM for example to Llama-3.1-8b-instruct in this case, refer to the steps below.
Update the following in values.yaml and patch the deployment following the steps from here.
env:
# ... existing code ...
# LLM Model Config
APP_LLM_MODELNAME: "meta/llama-3.1-8b-instruct"
nim-llm:
# ... existing code ...
image:
repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
pullPolicy: IfNotPresent
tag: "1.3.3"
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
model:
ngcApiKey: ""
modelName: "meta/llama-3.1-8b-instruct"kubectl get pods -n ragNAME READY STATUS RESTARTS AGE
ingestor-server-7bcff75fbb-s655f 1/1 Running 0 23m
nv-ingest-paddle-0 1/1 Running 0 23m
rag-etcd-0 1/1 Running 0 23m
rag-frontend-5d6c6dc4bd-5xpcw 1/1 Running 0 23m
rag-milvus-standalone-5f5699dfb6-dzlhr 1/1 Running 3 (23m ago) 23m
rag-minio-f88fb7fd4-29fxk 1/1 Running 0 23m
rag-nemoretriever-graphic-elements-v1-b6d465575-rl66q 1/1 Running 0 23m
rag-nemoretriever-page-elements-v2-596679ff54-z2kkf 1/1 Running 0 23m
rag-nemoretriever-table-structure-v1-748df88f86-z7mwb 1/1 Running 0 23m
rag-nim-llm-0 1/1 Running 0 23m
rag-nv-ingest-75cdb75c48-kbr7r 1/1 Running 0 23m
rag-nvidia-nim-llama-32-nv-embedqa-1b-v2-5b6dc664d8-8flpd 1/1 Running 0 23m
rag-opentelemetry-collector-558b89885-c7c8j 1/1 Running 0 23m
rag-redis-master-0 1/1 Running 0 23m
rag-redis-replicas-0 1/1 Running 0 23m
rag-server-7758bbf9bd-rw2wh 1/1 Running 0 23m
rag-text-reranking-nim-74c5f499cd-clcdg 1/1 Running 0 23m
rag-zipkin-5dc8d6d977-nqvvc 1/1 Running 0 23mNote: It takes around 5 minutes for all pods to come up. Check K8s events using
kubectl get events -n ragkubectl get svc -n ragNAME TYPE EXTERNAL-IP PORT(S) AGE
ingestor-server ClusterIP <none> 8082/TCP 26m
kubernetes ClusterIP <none> 443/TCP 4d20h
nemo-embedding-ms ClusterIP <none> 8000/TCP 26m
nemo-ranking-ms ClusterIP <none> 8000/TCP 26m
nemoretriever-graphic-elements-v1 ClusterIP <none> 8000/TCP,8001/TCP 26m
nemoretriever-page-elements-v2 ClusterIP <none> 8000/TCP,8001/TCP 26m
nemoretriever-table-structure-v1 ClusterIP <none> 8000/TCP,8001/TCP 26m
nim-llm ClusterIP <none> 8000/TCP 26m
nim-llm-sts ClusterIP <none> 8000/TCP 26m
nv-ingest-paddle ClusterIP <none> 8000/TCP,8001/TCP 26m
nv-ingest-paddle-sts ClusterIP <none> 8000/TCP,8001/TCP 26m
rag-etcd ClusterIP <none> 2379/TCP,2380/TCP 26m
rag-etcd-headless ClusterIP <none> 2379/TCP,2380/TCP 26m
rag-frontend NodePort <none> 3000:31645/TCP 26m
rag-milvus ClusterIP <none> 19530/TCP,9091/TCP 26m
rag-minio ClusterIP <none> 9000/TCP 26m
rag-nv-ingest ClusterIP <none> 7670/TCP 26m
rag-opentelemetry-collector ClusterIP <none> 6831/UDP,14250/TCP,14268/TCP,4317/TCP,4318/TCP,9411/TCP 26m
rag-redis-headless ClusterIP <none> 6379/TCP 26m
rag-redis-master ClusterIP <none> 6379/TCP 26m
rag-redis-replicas ClusterIP <none> 6379/TCP 26m
rag-server ClusterIP <none> 8081/TCP 26m
rag-zipkin ClusterIP <none> 9411/TCP 26mFor patching an existing deployment, modify values.yaml with required changes and run
helm upgrade --install rag -n rag https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-rag-v2.2.0.tgz \
--username '$oauthtoken' \
--password "${NGC_API_KEY}" \
--set imagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.password=$NGC_API_KEY \
-f rag-server/values.yamlTo enable tracing and view the Zipkin or Grafana UI, follow these steps:
-
Modify
values.yaml:Update the
values.yamlfile to enable the OpenTelemetry Collector and Zipkin:env: # ... existing code ... APP_TRACING_ENABLED: "True" # ... existing code ... serviceMonitor: enabled: true opentelemetry-collector: enabled: true # ... existing code ... zipkin: enabled: true kube-prometheus-stack: enabled: true
-
Deploy the Changes:
Redeploy the Helm chart to apply these changes:
helm uninstall rag -n rag helm install rag -n rag https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-rag-v2.2.0.tgz \ --username '$oauthtoken' \ --password "${NGC_API_KEY}" \ --set imagePullSecret.password=$NGC_API_KEY \ --set ngcApiSecret.password=$NGC_API_KEY \ -f rag-server/values.yaml
For detailed information on tracing, refer to observability.
-
Frontend:
Run the following command to port-forward the Forntend service to your local machine:
kubectl port-forward -n rag service/rag-frontend 3000:3000 --address 0.0.0.0
Access the Frontend UI at
http://localhost:3000. -
Zipkin UI:
Run the following command to port-forward the Zipkin service to your local machine:
kubectl port-forward -n rag service/rag-zipkin 9411:9411 --address 0.0.0.0
Access the Zipkin UI at
http://localhost:9411. -
Grafana UI:
Run the following command to port-forward the Grafana service:
kubectl port-forward -n rag service/rag-grafana 3001:80 --address 0.0.0.0
Access the Grafana UI at
http://localhost:3001using the default credentials (admin/admin).-
Upload JSON to Grafana:
- Navigate to the Grafana UI at
http://localhost:3000. - Log in with the default credentials (
admin/admin). - Go to the "Dashboards" section and click on "Import".
- Upload the JSON file located in the
deploy/configdirectory.
- Navigate to the Grafana UI at
-
Configure the Dashboard:
- After uploading, select the data source that the dashboard will use. Ensure that the data source is correctly configured to pull metrics from your Prometheus instance.
-
Save and View:
- Once the dashboard is configured, save it and start viewing your metrics and traces.
-
To uninstall the deployment, run:
helm uninstall rag -n ragRun the following command to install the RAG Server:
helm upgrade --install rag https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-rag-v2.2.0.tgz -n rag \
--username '$oauthtoken' \
--password "${NGC_API_KEY}" \
--set imagePullSecret.password=$NGC_API_KEY \
--set nvidia-nim-llama-32-nv-embedqa-1b-v2.nim.ngcAPIKey=$NGC_API_KEY \
--set text-reranking-nim.nim.ngcAPIKey=$NGC_API_KEY \
--set nim-llm.model.ngcAPIKey=$NGC_API_KEY \
--set ingestor-server.enabled=falseBefore installing the standalone Ingestor Server, update rag-server/charts/ingestor-server/values.yaml to enable the required secrets and the embedding NIM:
nv-ingest:
ngcApiSecret:
create: true
ngcImagePullSecret:
create: true
nvidia-nim-llama-32-nv-embedqa-1b-v2:
deployed: true
envVars:
EMBEDDING_NIM_ENDPOINT: "http://nv-ingest-embedqa:8000/v1"Then run the following command to install (or upgrade) the Ingestor Server:
helm upgrade --install rag rag-server/charts/ingestor-server -n rag \
--set imagePullSecret.password="$NGC_API_KEY" \
--set nv-ingest.ngcImagePullSecret.password="$NGC_API_KEY" \
--set nv-ingest.ngcApiSecret.password="$NGC_API_KEY"-
For enabling persistence for NIM LLM refer to the NIM LLM documentation. Update the required fields in
values.yamlfile fornim-llmsection. -
For enabling persistence for Nemo Retriever embedding Nemo Retriever Text Embedding documentation. Update the required fields in
values.yamlfile fornvidia-nim-llama-32-nv-embedqa-1b-v2section. -
For enabling persistence for Nemo Retriever reranking Nemo Retriever Text Reranking documentation. Update the required fields in
values.yamlfile fortext-reranking-nimsection.
To use a custom Milvus endpoint, you need to update the APP_VECTORSTORE_URL environment variable in the values.yaml file for both the RAG server and the Ingestor server. Follow these steps:
-
Edit
values.yaml:Open the
deploy/helm/rag-server/values.yamlfile and update theAPP_VECTORSTORE_URLandMINIO_ENDPOINTfor both the RAG server and the Ingestor server sections:env: # ... existing code ... APP_VECTORSTORE_URL: "http://your-custom-milvus-endpoint:19530" MINIO_ENDPOINT: "http://your-custom-minio-endpoint:9000" # ... existing code ... ingestor-server: env: # ... existing code ... APP_VECTORSTORE_URL: "http://your-custom-milvus-endpoint:19530" MINIO_ENDPOINT: "http://your-custom-minio-endpoint:9000" # ... existing code ... nv-ingest: envVars: # ... existing code ... MINIO_INTERNAL_ADDRESS: "http://your-custom-minio-endpoint:9000" # ... existing code ...
-
Disable Milvus Deployment:
Set
milvusDeployed: falsein theingestor-server.nv-ingestsection to prevent deploying the default Milvus instance:ingestor-server: nv-ingest: # ... existing code ... milvusDeployed: false # ... existing code ...
-
Deploy the Changes:
Redeploy the Helm chart to apply these changes:
helm upgrade rag https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-rag-v2.2.0.tgz -f rag-server/values.yaml -n rag
Currently, Frontend doesn't support dynamic variables. The default variables and values are the following:
name: NEXT_PUBLIC_MODEL_NAME
value: "meta/llama-3.1-70b-instruct"
- name: NEXT_PUBLIC_EMBEDDING_MODEL
value: "nvidia/llama-3.2-nv-embedqa-1b-v2"
- name: NEXT_PUBLIC_RERANKER_MODEL
value: "nvidia/llama-3.2-nv-rerankqa-1b-v2"
- name: NEXT_PUBLIC_CHAT_BASE_URL
value: "http://rag-server:8081/v1"
- name: NEXT_PUBLIC_VDB_BASE_URL
value: "http://ingestor-server:8082/v1"
If you have a plan to customize the RAG server deployment like LLM Model Change then please follow the steps to deploy the Frontend
-
Build the new docker image with updated model name from docker compose
cd ../deploy/composeModify the
imageandargsaccordingly indocker-compose-rag-server.yamlforrag-playgroundserviceExample:
# Sample UI container which interacts with APIs exposed by rag-server container rag-playground: container_name: rag-playground image: <image-registry-with-tag> build: # Set context to repo's root directory context: ../../frontend dockerfile: ./Dockerfile args: # Model name for LLM NEXT_PUBLIC_MODEL_NAME: ${APP_LLM_MODELNAME:-meta/llama-3.1-8b-instruct} # Model name for embeddings NEXT_PUBLIC_EMBEDDING_MODEL: ${APP_EMBEDDINGS_MODELNAME:-nvidia/llama-3.2-nv-embedqa-1b-v2} # Model name for reranking NEXT_PUBLIC_RERANKER_MODEL: ${APP_RANKING_MODELNAME:-nvidia/llama-3.2-nv-rerankqa-1b-v2} # URL for rag server container NEXT_PUBLIC_CHAT_BASE_URL: "http://rag-server:8081/v1" # URL for ingestor container NEXT_PUBLIC_VDB_BASE_URL: "http://ingestor-server:8082/v1" ports: - "8090:3000" expose: - "3000" depends_on: - rag-serverRun the below command to create a docker image
docker compose -f docker-compose-rag-server.yaml build --no-cacheOnce docker image has been build to push the image to a docker a registry
-
Run the following command to install the RAG server with the Ingestor Server and New Frontend with updated
<new-image-repository>and<new-image-tag>:helm install rag -n rag https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-rag-v2.2.0.tgz \ --username '$oauthtoken' \ --password "${NGC_API_KEY}" \ --set imagePullSecret.password=$NGC_API_KEY \ --set ngcApiSecret.password=$NGC_API_KEY --set frontend.image.repository='<new-image-repository>' \ --set frontend.image.tag="<new-image-tag>" \ --set frontend.imagePullSecret.password="$NGC_API_KEY"
For troubleshooting issues with Helm deployment, checkout the troubleshooting section here.
[!IMPORTANT] Before you can use this procedure, you must deploy the blueprint by using Deploy With Docker Compose or Deploy With Helm Chart.
-
Download and install Git LFS by following the installation instructions.
-
Initialize Git LFS in your environment.
git lfs install
-
Pull the dataset into the current repo.
git-lfs pull
-
Install jupyterlab.
pip install jupyterlab
-
Use this command to run Jupyter Lab so that you can execute this IPython notebook.
jupyter lab --allow-root --ip=0.0.0.0 --NotebookApp.token='' --port=8889 -
Run the ingestion_api_usage notebook.
Follow the cells in the notebook to ingest the PDF files from the data/dataset folder into the vector store.
Tip
Important Configuration Tip
Check out the best practices guide especially the Vector Store Retriever Consistency Level section to configure the required settings before starting the ingestion/retrieval process based on your use case.
- Change the Inference or Embedding Model
- Customize Prompts
- Customize LLM Parameters at Runtime
- Support Multi-Turn Conversations
- Enable NeMo Guardrails for Content Safety
- Query Across Multiple Collections
- Troubleshoot NVIDIA RAG Blueprint
- Understand latency breakdowns and debug errors using observability services
- Enable Self-Reflection to improve accuracy
- Enable Query rewriting to Improve accuracy of Multi-Turn Conversations
- Enable Image captioning support for ingested documents
- Enable hybrid search for milvus
- Add custom metadata while uploaded documents
- Enable low latency, low compute text only pipeline
- Enable VLM based inferencing in RAG
- Enable PDF extraction with Nemoretriever Parse
- Enable standalone NV-Ingest for direct document ingestion without ingestor server
- Explore best practices for enhancing accuracy or latency
- Explore migration guide if you are migrating from rag v1.0.0 to this version.
⚠️ Important B200 Limitation Notice:B200 GPUs are not supported for the following advanced features:
- Self-Reflection to improve accuracy
- Query rewriting to Improve accuracy of Multi-Turn Conversations
- Image captioning support for ingested documents
- NeMo Guardrails for guardrails at input/output
- VLM based inferencing in RAG
- PDF extraction with Nemoretriever Parse
- Poor retrieval accuracy is observed with Milvus GPU indexing and search.
For these features, please use H100 or A100 GPUs instead.