Enable PDF extraction with Nemoretriever Parse

For enhanced PDF extraction capabilities, you can use the Nemoretriever Parse service. This service provides improved PDF parsing and structure understanding compared to the default PDF extraction method.

Using Docker Compose

Using On-Prem Models

Follow steps outlined in the quickstart guide till step 4. Deploy all the deployed NIMs for ingestion.

Deploy the Nemoretriever Parse service along with other required NIMs:

USERID=$(id -u) docker compose --profile rag --profile nemoretriever-parse -f deploy/compose/nims.yaml up -d

Configure the ingestor-server to use Nemoretriever Parse by setting the environment variable:
```
export APP_NVINGEST_PDFEXTRACTMETHOD=nemoretriever_parse
```
Deploy the ingestion-server and rag-server containers following the remaining steps in the quickstart guide.
You can now ingest PDF files using the ingestion API usage notebook.

Using NVIDIA Hosted API Endpoints

Follow steps outlined in the quickstart guide till step 2. Export the following variables to use nemoretriever parse API endpoints:

export NEMORETRIEVER_PARSE_HTTP_ENDPOINT=https://integrate.api.nvidia.com/v1/chat/completions
export NEMORETRIEVER_PARSE_MODEL_NAME=nvidia/nemoretriever-parse
export NEMORETRIEVER_PARSE_INFER_PROTOCOL=http

Configure the ingestor-server to use Nemoretriever Parse by setting the environment variable:
```
export APP_NVINGEST_PDFEXTRACTMETHOD=nemoretriever_parse
```
Deploy the ingestion-server and rag-server containers following the remaining steps in the quickstart guide.
You can now ingest PDF files using the ingestion API usage notebook.

Note

When using NVIDIA hosted endpoints, you may encounter rate limiting with larger file ingestions (>10 files).

Using Helm

To enable PDF extraction with Nemoretriever Parse using Helm, you need to enable the Nemoretriever Parse service along with other required services:

helm upgrade --install rag -n rag https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-rag-v2.2.0.tgz \
  --username '$oauthtoken' \
  --password "${NGC_API_KEY}" \
  --set nim-llm.enabled=true \
  --set nvidia-nim-llama-32-nv-embedqa-1b-v2.enabled=true \
  --set text-reranking-nim.enabled=true \
  --set ingestor-server.enabled=true \
  --set ingestor-server.nv-ingest.nemoretriever-page-elements-v2.deployed=true \
  --set ingestor-server.nv-ingest.nemoretriever-graphic-elements-v1.deployed=true \
  --set ingestor-server.nv-ingest.nemoretriever-table-structure-v1.deployed=true \
  --set ingestor-server.nv-ingest.paddleocr-nim.deployed=true \
  --set ingestor-server.nv-ingest.nim-vlm-text-extraction.deployed=true \
  --set ingestor-server.envVars.APP_NVINGEST_PDFEXTRACTMETHOD="nemoretriever_parse" \
  --set imagePullSecret.password=$NGC_API_KEY \
  --set ngcApiSecret.password=$NGC_API_KEY

Limitations and Requirements

When using Nemoretriever Parse for PDF extraction, please note the following:

Nemoretriever Parse only supports PDF format documents. Attempting to process non-PDF files will result in extraction errors.
The service requires GPU resources. Make sure you have sufficient GPU resources available before enabling this feature.
The extraction quality may vary depending on the PDF structure and content.
Nemoretriever Parse is currently not supported on NVIDIA B200 GPUs.

For detailed information about hardware requirements and supported GPUs for all NeMo Retriever extraction NIMs, refer to the NeMo Retriever Extraction Support Matrix.

Available PDF Extraction Methods

The APP_NVINGEST_PDFEXTRACTMETHOD environment variable supports the following values:

nemoretriever_parse: Uses the Nemoretriever Parse service for enhanced PDF extraction
pdfium: Uses the default PDFium-based extraction
None: Uses the default extraction method

Note

The Nemoretriever Parse service requires GPU resources. Make sure you have sufficient GPU resources available before enabling this feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable PDF extraction with Nemoretriever Parse

Using Docker Compose

Using On-Prem Models

Using NVIDIA Hosted API Endpoints

Using Helm

Limitations and Requirements

Available PDF Extraction Methods

FilesExpand file tree

nemoretriever-parse-extraction.md

Latest commit

History

nemoretriever-parse-extraction.md

File metadata and controls

Enable PDF extraction with Nemoretriever Parse

Using Docker Compose

Using On-Prem Models

Using NVIDIA Hosted API Endpoints

Using Helm

Limitations and Requirements

Available PDF Extraction Methods