NVIDIA · kheiss-uwzoo · Feb 19, 2026 · Feb 24, 2026 · Feb 25, 2026 · Feb 26, 2026
@@ -48,13 +48,16 @@ The `split` task uses a tokenizer to count the number of tokens in the document,
 and splits the document based on the desired maximum chunk size and chunk overlap. 
 We recommend that you use the `meta-llama/Llama-3.2-1B` tokenizer, 
 because it's the same tokenizer as the llama-3.2 embedding model that we use for embedding.
-However, you can use any tokenizer from any HuggingFace model that includes a tokenizer file.
+
+You can use any tokenizer from a Hugging Face model that provides a tokenizer file. Tokenizers run locally (local HF) and can be downloaded directly from the [Hugging Face Hub](https://huggingface.co/models).
 
 Use the `split` method to chunk large documents as shown in the following code.
 
 !!! note
 
-    The default tokenizer (`meta-llama/Llama-3.2-1B`) requires a [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens). You must set `hf_access_token": "hf_***` to authenticate.
+    The default tokenizer (meta-llama/Llama-3.2-1B) runs locally (local HF) and requires a  [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens) for authentication. Set "hf_access_token": "hf_***" to provide your token.
+
+
 
 ```python
 ingestor = ingestor.split(

@@ -4,9 +4,9 @@ NeMo Retriever Library is a scalable, performance-oriented document content and
 NeMo Retriever Library uses specialized NVIDIA NIM microservices 
 to find, contextualize, and extract text, tables, charts and infographics that you can use in downstream generative applications.
 
-!!! note
+!!! tip "Get Started Recommendation"
 
-    This library is the NeMo Retriever Library.
+    **[Deploy without containers (Library Mode)](quickstart-library-mode.md)** is the recommended approach for workloads with fewer than 100 PDFs. It’s best suited for local development, experimentation, and small-scale ingestion.
 
 NeMo Retriever Library enables parallelization of splitting documents into pages where artifacts are classified (such as text, tables, charts, and infographics), extracted, and further contextualized through optical character recognition (OCR) into a well defined JSON schema. 
 From there, NeMo Retriever Library can optionally manage computation of embeddings for the extracted content, 

@@ -459,7 +459,7 @@ ingestor = ingestor.embed()
 
 !!! note
 
-    By default, `embed` uses the [llama-3.2-nv-embedqa-1b-v2](https://build.nvidia.com/nvidia/llama-3_2-nv-embedqa-1b-v2) model.
+    By default, `embed` uses the [llama-3.2-nv-embedqa-1b-v2](https://build.nvidia.com/nvidia/llama-3_2-nv-embedqa-1b-v2) model. Embedding supports **hosted NIM** (default), **local Hugging Face** models, or a **self-hosted** endpoint.
 
 To use a different embedding model, such as [nv-embedqa-e5-v5](https://build.nvidia.com/nvidia/nv-embedqa-e5-v5), specify a different `model_name` and `endpoint_url`.
 

@@ -15,7 +15,7 @@ In addition, you can use library mode, which is intended for the following cases
 
 By default, library mode depends on NIMs that are hosted on build.nvidia.com. 
 In library mode you launch the main pipeline service directly within a Python process, 
-while all other services (such as embedding and storage) are hosted remotely in the cloud.
+while embedding and reranking use hosted NIMs; you can also use local Hugging Face models or self-hosted endpoints by setting custom NIM endpoints (see [FAQ](faq.md)).
 
 To get started using library mode, you need the following:
 

@@ -12,7 +12,7 @@ Before you begin using [NeMo Retriever Library](overview.md), ensure that you ha
 The NeMo Retriever Library core pipeline features run on a single A10G or better GPU. 
 The core pipeline features include the following:
 
-- llama3.2-nv-embedqa-1b-v2 — Embedding model for converting text chunks into vectors.
+- llama3.2-nv-embedqa-1b-v2 — Embedding model for converting text chunks into vectors. Embedding is available as hosted NIM, local Hugging Face, or self-hosted.
 - nemoretriever-page-elements-v3 — Detects and classifies images on a page as a table, chart or infographic.
 - nemoretriever-table-structure-v1 — Detects rows, columns, and cells within a table to preserve table structure and convert to Markdown format. 
 - nemoretriever-graphic-elements-v1 — Detects graphic elements within chart images such as titles, legends, axes, and numerical values. 
@@ -30,7 +30,7 @@ This includes the following:
 
         While nemotron-nano-12b-v2-vl is the default VLM, you can configure and use other vision language models for image captioning based on your specific use case requirements. For more information, refer to [Extract Captions from Images](python-api-reference.md#extract-captions-from-images).
 
-- Reranker — Use [llama-3.2-nv-rerankqa-1b-v2](https://build.nvidia.com/nvidia/llama-3.2-nv-rerankqa-1b-v2) for improved retrieval accuracy.
+- Reranker — Use [llama-3.2-nv-rerankqa-1b-v2](https://build.nvidia.com/nvidia/llama-3.2-nv-rerankqa-1b-v2) for improved retrieval accuracy. Reranking is available as **hosted NIM**, **local Hugging Face**, or **self-hosted**.
 
 
 

@@ -1,6 +1,6 @@
 # Use Multimodal Embedding with NeMo Retriever Library
 
-This guide explains how to use the [NeMo Retriever Library](https://www.perplexity.ai/search/overview.md) with the multimodal embedding model [Llama Nemotron Embed VL 1B v2](https://build.nvidia.com/nvidia/llama-nemotron-embed-vl-1b-v2).
+This guide explains how to use the [NeMo Retriever Library](overview.md) with the multimodal embedding model [Llama Nemotron Embed VL 1B v2](https://build.nvidia.com/nvidia/llama-nemotron-embed-vl-1b-v2). This page covers self-hosted deployment of the multimodal embedding NIM; text embedding and reranking also support hosted NIMs and local Hugging Face models.
 
 The `Llama Nemotron Embed VL 1B v2` model is optimized for multimodal question-answering and retrieval tasks.
 It can embed documents as text, images, or paired text-image combinations.

@@ -1,13 +1,13 @@
-# What is NVIDIA NeMo Retriever?
+# What is NVIDIA NeMo Retriever Library?
 
-NVIDIA NeMo Retriever is a collection of microservices 
+NVIDIA NeMo Retriever Library is a collection of microservices 
 for building and scaling multimodal data extraction, embedding, and reranking pipelines 
 with high accuracy and maximum data privacy – built with NVIDIA NIM. 
-NeMo Retriever, part of the [NVIDIA NeMo](https://www.nvidia.com/en-us/ai-data-science/products/nemo/) software suite for managing the AI agent lifecycle, 
+NeMo Retriever Library, part of the [NVIDIA NeMo](https://www.nvidia.com/en-us/ai-data-science/products/nemo/) software suite for managing the AI agent lifecycle, 
 ensures data privacy and seamlessly connects to proprietary data wherever it resides, 
 empowering secure, enterprise-grade retrieval.
 
-NeMo Retriever provides the following:
+NeMo Retriever Library provides the following:
 
 - **Multimodal Data Extraction** — Quickly extract documents at scale that include text, tables, charts, and infographics.
 - **Embedding + Indexing** — Embed all extracted text from text chunks and images, and then insert into LanceDB (default) or Milvus — accelerated with NVIDIA cuVS.
@@ -17,20 +17,26 @@ NeMo Retriever provides the following:
 ![Overview diagram](extraction/images/overview-retriever.png)
 
 
+## Get Started
+
+**[Deploy without containers (Library Mode)](extraction/quickstart-library-mode.md)** is the primary, recommended path for workloads under 100 PDFs. Use it for local development, experimentation, and small-scale ingestion.
+
+
 ## Enterprise-Ready Features
 
-NVIDIA NeMo Retriever comes with enterprise-ready features, including the following:
+NVIDIA NeMo Retriever Library comes with enterprise-ready features, including the following:
 
-- **High Accuracy** — NeMo Retriever exhibits a high level of accuracy when retrieving across various modalities through enterprise documents. 
-- **High Throughput** — NeMo Retriever is capable of extracting, embedding, indexing and retrieving across hundreds of thousands of documents at scale with high throughput. 
-- **Decomposable/Customizable** — NeMo Retriever consists of modules that can be separately used and deployed in your own environment. 
-- **Enterprise-Grade Security** — NeMo Retriever NIMs come with security features such as the use of [safetensors](https://huggingface.co/docs/safetensors/index), continuous patching of CVEs, and more. 
+- **[World-class performance](extraction/benchmarking.md)** — See Benchmarks & Comparison for throughput and recall metrics.
+- **High Accuracy** — NeMo Retriever Library exhibits a high level of accuracy when retrieving across various modalities through enterprise documents. 
+- **High Throughput** — NeMo Retriever Library is capable of extracting, embedding, indexing and retrieving across hundreds of thousands of documents at scale with high throughput. 
+- **Decomposable/Customizable** — NeMo Retriever Library consists of modules that can be separately used and deployed in your own environment. 
+- **Enterprise-Grade Security** — NeMo Retriever Library NIMs come with security features such as the use of [safetensors](https://huggingface.co/docs/safetensors/index), continuous patching of CVEs, and more. 
 
 
 
 ## Applications
 
-The following are some applications that use NVIDIA NeMo Retriever:
+The following are some applications that use NVIDIA NeMo Retriever Library:
 
 - [AI Virtual Assistant for Customer Service](https://github.com/NVIDIA-AI-Blueprints/ai-virtual-assistant) (NVIDIA AI Blueprint)
 - [Build an Enterprise RAG pipeline](https://build.nvidia.com/nvidia/build-an-enterprise-rag-pipeline/blueprintcard) (NVIDIA AI Blueprint)
@@ -43,7 +49,9 @@ The following are some applications that use NVIDIA NeMo Retriever:
 
 ## Related Topics
 
-- [NeMo Retriever Text Embedding NIM](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/overview.html)
-- [NeMo Retriever Text Reranking NIM](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/overview.html)
+Embedding and reranking support **hosted NIMs**, **local Hugging Face** models, and **self-hosted** deployment:
+
+- [NeMo Retriever Library Text Embedding NIM](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/overview.html) (hosted NIM, local HF, self-hosted)
+- [NeMo Retriever Library Text Reranking NIM](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/overview.html) (hosted NIM, local HF, self-hosted)
 - [NVIDIA NIM for Object Detection](https://docs.nvidia.com/nim/ingestion/object-detection/latest/overview.html)
 - [NVIDIA NIM for Image OCR](https://docs.nvidia.com/nim/ingestion/table-extraction/latest/overview.html)
@@ -1,4 +1,4 @@
-site_name: NeMo Retriever Documentation
+site_name: NeMo Retriever Library Documentation
 site_url: https://docs.nvidia.com/nemo/retriever/
 
 repo_name: NVIDIA/nv-ingest
@@ -55,12 +55,13 @@ extra_css:
 
 
 nav:
-  - NeMo Retriever: 
+  - NeMo Retriever Library: 
     - Overview: 
       - Overview: index.md
     - NeMo Retriever Extraction:
       - Overview: extraction/overview.md
       - Release Notes: extraction/releasenotes-nv-ingest.md
+      # Get Started CTA points to Library Mode QuickStart; Library Mode is the primary path for workloads <100 PDFs.
       - Get Started:
         - Prerequisites: extraction/prerequisites.md
         - Support Matrix: extraction/support-matrix.md
@@ -85,7 +86,7 @@ nav:
         - NimClient Usage: extraction/nimclient.md
         - Resource Scaling Modes: extraction/scaling-modes.md
       - Performance:
-        - Benchmarking: extraction/benchmarking.md
+        - Benchmarks & Comparison: extraction/benchmarking.md
         - Telemetry: extraction/telemetry.md
         - Throughput Is Dataset-Dependent: extraction/throughput-is-dataset-dependent.md
       - Reference: