Skip to content

Kheiss/readme quickstart#1492

Open
kheiss-uwzoo wants to merge 34 commits intoNVIDIA:mainfrom
kheiss-uwzoo:kheiss/readme-quickstart
Open

Kheiss/readme quickstart#1492
kheiss-uwzoo wants to merge 34 commits intoNVIDIA:mainfrom
kheiss-uwzoo:kheiss/readme-quickstart

Conversation

@kheiss-uwzoo
Copy link
Collaborator

@kheiss-uwzoo kheiss-uwzoo commented Mar 5, 2026

Pull Request Summary: NeMo Retriever README — NVIDIA Style Guide and PRD Alignment

Overview

Updates to nemo_retriever/README.md to align with the NVIDIA Writing Style Guide and the Ingest 2.0 PRD for NeMo Retriever Library, covering voice and tone, formatting, links, acronyms, structure, and naming/positioning.


Changes

Voice and Tone (PACE)

  • Contractions: Replaced "it is" with "it's" for a more conversational tone.
  • Latinisms: Replaced "via" with "through" (per guidance to prefer "by" or "through" instead of "via").

Acronyms and First Use

  • RAG: First use now spelled out as "retrieval-augmented generation (RAG) ingestion pipeline."
  • OCR: First use now spelled out as "If optical character recognition (OCR) fails…".

Links

Replaced generic or bare link text with descriptive text that matches the destination (avoiding "here," "read more," and raw URLs):

  • [docs.nvidia](...benchmarking/)NeMo Retriever extraction benchmarking documentation
  • [docs.nvidia](...extraction/audio/)NeMo Retriever audio extraction documentation
  • [docs.nvidia](...25.6.3/extraction/audio/)NeMo Retriever audio extraction documentation (25.6.3)
  • [docs.nvidia](...ray.html)NeMo Ray run guide
  • [huggingface](...parakeet-ctc-1.1b)Parakeet CTC 1.1B model on Hugging Face
  • [discuss.ray](...)Connecting to a remote Ray cluster on Kubernetes
  • [cohesity](...)How Cohesity uses NVIDIA NeMo Retriever microservices to improve RAG AI retrieval recall (Cohesity blog)

Formatting and Structure

  • Wrapped the LD_LIBRARY_PATH example in a fenced code block and introduced it with a full sentence and colon.
  • Removed a duplicate "Quick end‑to‑end test" section (repeated ## 8 heading, explanatory paragraph, and extra horizontal rule).
  • Added a missing comma in the leading sentence: "For example, the following command…".
  • Fixed a double space in "In this step, you uninstall" to a single space after "you".

PRD Conformance

Conformance: Yes (with 2 fixes applied)

The README was checked against the Ingest 2.0 PRD (NeMo Retriever Library). It already matched the PRD’s naming and positioning; two small fixes were made so it fully conforms.

Fixes Applied

  • Python package name (PRD: nemo_retriever):
    • Was: “installs the nemoretriever Python package”
    • Now: “installs the nemo_retriever Python package”
    • PRD specifies the Python import as nemo_retriever (lowercase, underscore).
  • Typo in file extension:
    • Was: “used for .txt ingestion” (with a trailing space)
    • Now: “used for .txt ingestion”

What Already Matched the PRD

  • Product name: “NeMo Retriever Library” (title case, space‑separated) used consistently.
  • GitHub repo: NVIDIA/NeMo-Retriever referenced correctly.
  • PyPI: nemo-retriever (lowercase, hyphenated) in install commands.
  • Python: nemo_retriever in paths and module references (aside from the one nemoretriever fix).
  • No legacy names: no nv-ingest, nv_ingest, or abbreviations like nr / nemo-ret.
  • Hyphens vs underscores: external identifiers use hyphens (nemo-retriever); Python/internal use underscores (nemo_retriever).
  • Scope: library‑first install, Ray, NIM, HuggingFace, PDF/HTML/text/audio, LanceDB, and the benchmark harness all reflected as in the PRD.

Optional Follow-up

  • Benchmark CLI naming: the PRD says the benchmark CLI is nemo-retriever-bench, while the README currently documents retriever harness (for example, retriever harness run).
    • If the shipped command is actually nemo-retriever-bench, add a short note or adjustment in the benchmark section.
    • If retriever harness is the correct, shipped interface, the README is fine as‑is.

A short conformance report was saved at docs/README_PRD_Conformance.md for reference and for inclusion in the PR if desired.


Files Changed

  • nemo_retriever/README.md

Reference

  • NVIDIA Style Guide (April 2025): Voice and Tone, Links, Abbreviations and Acronyms, Latinisms, Formatting, Technical Content.
  • Ingest 2.0 PRD (NeMo Retriever Library): naming, positioning, and scope.

@kheiss-uwzoo kheiss-uwzoo added the doc Improvements or additions to documentation label Mar 5, 2026
Copy link
Collaborator Author

@kheiss-uwzoo kheiss-uwzoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed title and opening paragraph

@kheiss-uwzoo kheiss-uwzoo marked this pull request as ready for review March 5, 2026 20:05
@kheiss-uwzoo kheiss-uwzoo requested a review from a team as a code owner March 5, 2026 20:05
Changed opening paragraph to be more specific
Copy link
Collaborator Author

@kheiss-uwzoo kheiss-uwzoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated opening

## Prerequisites
This quick start guide shows how to run NeMo Retriever in **library mode**, directly from your application, without Docker. In library mode, NeMo Retriever Library supports two deployment options:
- Load Hugging Face models locally on your GPU.
- Use locally deployed NeMo Retriever NIM endpoints for embedding and OCR.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not just for embedding and OCR. It includes all the NIMs (page-elements, OCR, embedding, by default, and graphic-elements and table-structure optionally)
@edknv and @ChrisJar to confirm this statement.

This installs the retriever in editable mode and its in-repo dependencies. Core dependencies (see `nemo_retriever/pyproject.toml`) include Ray, pypdfium2, pandas, LanceDB, PyYAML, torch, transformers, and the Nemotron packages (page-elements, graphic-elements, table-structure). The retriever also depends on the sibling packages `nv-ingest`, `nv-ingest-api`, and `nv-ingest-client` in this repo.

### OCR and CUDA 13 runtime
## 2. Create and activate the NeMo Retriever Library environment
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numbered steps don't go in headings. Remove the number from the heading, or remove the heading.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated document based on your comments

Comment on lines +215 to +217
- `run.runtime.summary.json`: run totals (input files, pages, elapsed seconds)
- `run.ray.timeline.json`: detailed Ray execution timeline
- `run.rd_dataset.stats.txt`: Ray dataset stats dump
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct format: Change code formatting to bold, change colons to dash (Alt+0150)

Comment on lines +225 to +226
- `bo20`: ~9.0 MiB total, ~8.6 MiB LanceDB
- `jp20`: ~36.8 MiB total, ~36.2 MiB LanceDB
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix formatting

kheiss-uwzoo and others added 11 commits March 6, 2026 07:11
Co-authored-by: nkmcalli <nkmcalli@yahoo.com>
Co-authored-by: nkmcalli <nkmcalli@yahoo.com>
Co-authored-by: nkmcalli <nkmcalli@yahoo.com>
Co-authored-by: nkmcalli <nkmcalli@yahoo.com>
Co-authored-by: nkmcalli <nkmcalli@yahoo.com>
Co-authored-by: nkmcalli <nkmcalli@yahoo.com>
Co-authored-by: nkmcalli <nkmcalli@yahoo.com>
updated per Nicole's review
Fixed formatting per Nicole's review
Updated formatting per Nicole's review
changed from step to procedure
Copy link
Collaborator Author

@kheiss-uwzoo kheiss-uwzoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated per Nicole review comments

This installs the retriever in editable mode and its in-repo dependencies. Core dependencies (see `nemo_retriever/pyproject.toml`) include Ray, pypdfium2, pandas, LanceDB, PyYAML, torch, transformers, and the Nemotron packages (page-elements, graphic-elements, table-structure). The retriever also depends on the sibling packages `nv-ingest`, `nv-ingest-api`, and `nv-ingest-client` in this repo.

### OCR and CUDA 13 runtime
## 2. Create and activate the NeMo Retriever Library environment
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated document based on your comments

@kheiss-uwzoo kheiss-uwzoo requested a review from nkmcalli March 6, 2026 16:23
removed Library Mode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants