Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 13 additions & 13 deletions nemo_retriever/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ From the repo root:

```bash
cd /path/to/nv-ingest
uv venv .retriever
source .retriever/bin/activate
uv venv .nr
source .nr/bin/activate
uv pip install -e ./nemo_retriever
```

Expand Down Expand Up @@ -52,7 +52,7 @@ uv run python nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py /path

Pass the directory that contains your PDFs as the first argument (`input-dir`). For recall evaluation, the pipeline uses `bo767_query_gt.csv` in the current directory by default; override with `--query-csv <path>`. For document-level recall, use `--recall-match-mode pdf_only` with `query,expected_pdf` data. Recall is skipped if the query file does not exist. By default, per-query details (query, gold, hits) are printed; use `--no-recall-details` to print only the missed-gold summary and recall metrics. To use an existing Ray cluster, pass `--ray-address auto`. If OCR fails with a missing `libcudart.so.13`, install the CUDA 13 runtime and set `LD_LIBRARY_PATH` as shown above.

For **HTML** or **text** ingestion, use `--input-type html` or `--input-type txt` with the same examples (e.g. `batch_pipeline.py <dir> --input-type html`). HTML files are converted to markdown via markitdown, then chunked with the same tokenizer as .txt. Staged CLI: `retriever html run --input-dir <dir>` writes `*.html_extraction.json`; then `retriever local stage5 run --input-dir <dir> --pattern "*.html_extraction.json"` and `retriever local stage6 run --input-dir <dir>`.
For **HTML** or **text** ingestion, use `--input-type html` or `--input-type txt` with the same examples (e.g. `batch_pipeline.py <dir> --input-type html`). HTML files are converted to markdown via markitdown, then chunked with the same tokenizer as .txt. Staged CLI: `nr html run --input-dir <dir>` writes `*.html_extraction.json`; then `nr local stage5 run --input-dir <dir> --pattern "*.html_extraction.json"` and `nr local stage6 run --input-dir <dir>`.

## Harness (run, sweep, nightly)

Expand All @@ -61,7 +61,7 @@ For **HTML** or **text** ingestion, use `--input-type html` or `--input-type txt
- Config files:
- `nemo_retriever/harness/test_configs.yaml`
- `nemo_retriever/harness/nightly_config.yaml`
- CLI entrypoint is nested under `retriever harness`.
- CLI entrypoint is nested under `nr harness`.
- First pass is LanceDB-only and enforces recall-required pass/fail by default.
- Single-run artifact directories default to `<dataset>_<timestamp>`.
- Dataset-specific recall adapters are supported via config:
Expand All @@ -77,37 +77,37 @@ For **HTML** or **text** ingestion, use `--input-type html` or `--input-type txt

```bash
# Dataset preset from test_configs.yaml (recall-required example)
retriever harness run --dataset jp20 --preset single_gpu
nr harness run --dataset jp20 --preset single_gpu

# Direct dataset path
retriever harness run --dataset /datasets/nv-ingest/bo767 --preset single_gpu
nr harness run --dataset /datasets/nv-ingest/bo767 --preset single_gpu

# Add repeatable run or session tags for later review
retriever harness run --dataset jp20 --preset single_gpu --tag nightly --tag candidate
nr harness run --dataset jp20 --preset single_gpu --tag nightly --tag candidate
```

### Sweep runs (explicit runs list)

```bash
retriever harness sweep --runs-config nemo_retriever/harness/nightly_config.yaml
nr harness sweep --runs-config nemo_retriever/harness/nightly_config.yaml
```

### Nightly session

```bash
retriever harness nightly --runs-config nemo_retriever/harness/nightly_config.yaml
retriever harness nightly --dry-run
retriever harness nightly --runs-config nemo_retriever/harness/nightly_config.yaml --tag nightly
nr harness nightly --runs-config nemo_retriever/harness/nightly_config.yaml
nr harness nightly --dry-run
nr harness nightly --runs-config nemo_retriever/harness/nightly_config.yaml --tag nightly
```

### Session inspection

```bash
# Print a compact table from a completed sweep/nightly session
retriever harness summary nemo_retriever/artifacts/nightly_20260305_010203_UTC
nr harness summary nemo_retriever/artifacts/nightly_20260305_010203_UTC

# Compare two session summaries by run name
retriever harness compare \
nr harness compare \
nemo_retriever/artifacts/nightly_20260305_010203_UTC \
nemo_retriever/artifacts/nightly_20260306_010204_UTC
```
Expand Down
4 changes: 2 additions & 2 deletions nemo_retriever/chart_stage_config.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Example config for chart extraction.
#
# Intended usage (once the chart stage CLI is wired up similarly to table stage):
# - `retriever chart stage run --config <this.yaml> --input <primitives.parquet>`
# - `retriever local stage4 run --config <this.yaml> --input <primitives.parquet>`
# - `nr chart stage run --config <this.yaml> --input <primitives.parquet>`
# - `nr local stage4 run --config <this.yaml> --input <primitives.parquet>`
#
# This YAML is parsed into `nv_ingest_api.internal.schemas.extract.extract_chart_schema.ChartExtractorSchema`
# via `nemo_retriever.chart.config.load_chart_extractor_schema_from_dict`.
Expand Down
2 changes: 1 addition & 1 deletion nemo_retriever/embedding_stage_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
api_key: "" # e.g. $NGC_API_KEY or $NVIDIA_API_KEY

# Embedding service settings
# If set to null/empty, `retriever local stage5` will fall back to local HF embeddings
# If set to null/empty, `nr local stage5` will fall back to local HF embeddings
# via `nemo_retriever.model.local.llama_nemotron_embed_1b_v2_embedder`.
embedding_nim_endpoint: null
# embedding_nim_endpoint: "http://localhost:8012/v1"
Expand Down
22 changes: 11 additions & 11 deletions nemo_retriever/harness/HANDOFF.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,36 +49,36 @@ From repo root:

```bash
source ~/setup_env.sh
source .retriever/bin/activate
source .nr/bin/activate
uv pip install -e ./nemo_retriever
```

Single run:

```bash
retriever harness run --dataset jp20 --preset single_gpu
retriever harness run --dataset jp20 --preset single_gpu --tag nightly --tag candidate
nr harness run --dataset jp20 --preset single_gpu
nr harness run --dataset jp20 --preset single_gpu --tag nightly --tag candidate
```

Sweep:

```bash
retriever harness sweep --runs-config nemo_retriever/harness/nightly_config.yaml
nr harness sweep --runs-config nemo_retriever/harness/nightly_config.yaml
```

Nightly:

```bash
retriever harness nightly --runs-config nemo_retriever/harness/nightly_config.yaml
retriever harness nightly --dry-run
retriever harness nightly --runs-config nemo_retriever/harness/nightly_config.yaml --tag nightly
nr harness nightly --runs-config nemo_retriever/harness/nightly_config.yaml
nr harness nightly --dry-run
nr harness nightly --runs-config nemo_retriever/harness/nightly_config.yaml --tag nightly
```

Session inspection:

```bash
retriever harness summary nemo_retriever/artifacts/nightly_20260305_010203_UTC
retriever harness compare \
nr harness summary nemo_retriever/artifacts/nightly_20260305_010203_UTC
nr harness compare \
nemo_retriever/artifacts/nightly_20260305_010203_UTC \
nemo_retriever/artifacts/nightly_20260306_010204_UTC
```
Expand Down Expand Up @@ -148,9 +148,9 @@ Notes:
- `financebench` now defaults to `data/financebench_train.json` with recall enabled.
- Session UX improvements:
- Runs, sweeps, and nightly sessions accept repeatable `--tag` values persisted into artifacts.
- `retriever harness summary` prints a compact table from `session_summary.json`.
- `nr harness summary` prints a compact table from `session_summary.json`.
- Comparison utility:
- `retriever harness compare` prints pages/sec and recall deltas by run name for two sessions.
- `nr harness compare` prints pages/sec and recall deltas by run name for two sessions.

## Current Validation Status

Expand Down
4 changes: 2 additions & 2 deletions nemo_retriever/infographic_stage_config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Example config for:
# - `retriever infographic stage run --config <this.yaml> --input <primitives.parquet>`
# - `retriever local stage2 run --config <this.yaml> --input <primitives.parquet>`
# - `nr infographic stage run --config <this.yaml> --input <primitives.parquet>`
# - `nr local stage2 run --config <this.yaml> --input <primitives.parquet>`
#
# This YAML is parsed into `nv_ingest_api.internal.schemas.extract.extract_infographic_schema.InfographicExtractorSchema`
# via `nemo_retriever.infographic.config.load_infographic_extractor_schema_from_dict`.
Expand Down
4 changes: 2 additions & 2 deletions nemo_retriever/pdf_stage_config.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Example config for: `retriever pdf stage page-elements --config <this.yaml>`
# Example config for: `nr pdf stage page-elements --config <this.yaml>`
#
# CLI override rule:
# - If you pass an option explicitly on the CLI, it wins.
# - Otherwise the value from this YAML file is used.
#
# You can run repeatedly:
# retriever pdf stage page-elements --config nemo_retriever/pdf_stage_config.yaml
# nr pdf stage page-elements --config nemo_retriever/pdf_stage_config.yaml
#

# Directory containing PDFs (scanned recursively for *.pdf)
Expand Down
3 changes: 2 additions & 1 deletion nemo_retriever/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ dependencies = [
"soundfile>=0.12.0",
"scipy>=1.11.0",
"nvidia-ml-py",
"pytest"
]

[project.optional-dependencies]
Expand All @@ -78,7 +79,7 @@ dev = [
]

[project.scripts]
retriever = "nemo_retriever.__main__:main"
nr = "nemo_retriever.__main__:main"

[tool.setuptools.dynamic]
version = {attr = "nemo_retriever.version.get_build_version"}
Expand Down
12 changes: 10 additions & 2 deletions nemo_retriever/src/nemo_retriever/adapters/cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,14 @@
import typer

from nemo_retriever.audio import app as audio_app
from nemo_retriever.examples.batch_pipeline import app as batch_app
from nemo_retriever.utils.benchmark import app as benchmark_app
from nemo_retriever.chart import app as chart_app
from nemo_retriever.utils.compare import app as compare_app
from nemo_retriever.harness import app as harness_app
from nemo_retriever.html import __main__ as html_main
from nemo_retriever.utils.image import app as image_app
from nemo_retriever.examples.inprocess_pipeline import app as inprocess_app
from nemo_retriever.local import app as local_app
from nemo_retriever.online import __main__ as online_main
from nemo_retriever.pdf import app as pdf_app
Expand All @@ -21,7 +23,13 @@
from nemo_retriever.vector_store import app as vector_store_app
from nemo_retriever.version import get_version_info

app = typer.Typer(help="Retriever")
app = typer.Typer(help="NeMo Retriever – RAG ingestion pipeline CLI.")

ingest_app = typer.Typer(help="Run ingestion pipelines (batch or in-process).")
ingest_app.add_typer(batch_app, name="batch")
ingest_app.add_typer(inprocess_app, name="inprocess")
app.add_typer(ingest_app, name="ingest")

app.add_typer(audio_app, name="audio")
app.add_typer(image_app, name="image")
app.add_typer(pdf_app, name="pdf")
Expand Down Expand Up @@ -54,7 +62,7 @@ def _callback(
version: bool = typer.Option(
False,
"--version",
help="Show retriever version metadata and exit.",
help="Show nr version metadata and exit.",
callback=_version_callback,
is_eager=True,
)
Expand Down
2 changes: 1 addition & 1 deletion nemo_retriever/src/nemo_retriever/audio/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
This module intentionally contains **no configuration logic**. It simply re-exports the
`nemo_retriever.audio.stage` Typer application so any arguments provided to:

`retriever audio ...`
`nr audio ...`

are handled exactly the same as the stage commands (e.g. `extract`, `discover`).
"""
Expand Down
4 changes: 2 additions & 2 deletions nemo_retriever/src/nemo_retriever/audio/stage.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
"""
Audio extraction stage: chunk + ASR only, write *.audio_extraction.json sidecars.

Invoked as `retriever audio extract` / `retriever audio discover` (or
`python -m nemo_retriever.audio extract` / `discover`). Analogous to `retriever pdf stage page-elements`.
Invoked as `nr audio extract` / `nr audio discover` (or
`python -m nemo_retriever.audio extract` / `discover`). Analogous to `nr pdf stage page-elements`.
"""

from __future__ import annotations
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

"""
Batch ingestion pipeline with optional recall evaluation.
Run with: uv run python -m nemo_retriever.examples.batch_pipeline <input-dir>
Run with: nr batch <input-dir>
"""

import json
Expand Down Expand Up @@ -345,7 +345,7 @@ def _hit_key_and_distance(hit: dict) -> tuple[str | None, float | None]:
return key, dist


@app.command()
@app.command(name="pipeline")
def main(
ctx: typer.Context,
debug: bool = typer.Option(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

"""
In-process ingestion pipeline (no Ray) with optional recall evaluation.
Run with: uv run python -m nemo_retriever.examples.inprocess_pipeline <input-dir>
Run with: nr ingest inprocess <input-dir>
"""

import json
Expand Down Expand Up @@ -96,7 +96,7 @@ def _hit_key_and_distance(hit: dict) -> tuple[str | None, float | None]:
return key, dist


@app.command()
@app.command(name="pipeline")
def main(
input_path: Path = typer.Argument(
...,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

- Inprocess: runs the full pipeline locally (no server).
- Online: submits each document to the online ingest REST service (start with
`retriever online serve`). Uses the same LanceDB for recall evaluation.
`nr online serve`). Uses the same LanceDB for recall evaluation.

Run with:
uv run python -m nemo_retriever.examples.online_pipeline <input-dir>
Expand Down
6 changes: 3 additions & 3 deletions nemo_retriever/src/nemo_retriever/html/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"""
CLI for .html extraction: markitdown -> markdown -> tokenizer split, write *.html_extraction.json.

Use with: retriever local stage5 run --pattern "*.html_extraction.json" then stage6.
Use with: nr local stage5 run --pattern "*.html_extraction.json" then stage6.
"""

from __future__ import annotations
Expand Down Expand Up @@ -63,8 +63,8 @@ def run(
Scan input_dir for *.html, convert to markdown and chunk each, write <stem>.html_extraction.json.

Output JSON has the same primitives-like shape as stage5 input (text, path, page_number, metadata).
Then run: retriever local stage5 run --input-dir <dir> --pattern "*.html_extraction.json"
and retriever local stage6 run --input-dir <dir>.
Then run: nr local stage5 run --input-dir <dir> --pattern "*.html_extraction.json"
and nr local stage6 run --input-dir <dir>.
"""
input_dir = Path(input_dir)
html_files = sorted(input_dir.glob("*.html"))
Expand Down
22 changes: 11 additions & 11 deletions nemo_retriever/src/nemo_retriever/ingest-config.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# nv-ingest retriever consolidated configuration
# nv-ingest nr consolidated configuration
#
# This single file replaces the older per-stage YAML configs:
# - pdf_stage_config.yaml
Expand All @@ -17,7 +17,7 @@
# - Sections below are consumed by their respective stages.

pdf:
# Example config for: `retriever pdf stage page-elements --config <this.yaml>`
# Example config for: `nr pdf stage page-elements --config <this.yaml>`
#
# Directory containing PDFs (scanned recursively for *.pdf)
input_dir: /home/local/jdyer/datasets/jp20
Expand Down Expand Up @@ -59,14 +59,14 @@ pdf:
# Optionally limit number of PDFs processed
limit: null

# Optional config for `retriever txt run` and .extract_txt() API
# Optional config for `nr txt run` and .extract_txt() API
txt:
max_tokens: 512
overlap_tokens: 0
tokenizer_model_id: nvidia/llama-3.2-nv-embedqa-1b-v2
encoding: utf-8

# Optional config for `retriever html run` and .extract_html() API
# Optional config for `nr html run` and .extract_html() API
html:
max_tokens: 512
overlap_tokens: 0
Expand Down Expand Up @@ -94,8 +94,8 @@ audio_asr:

table:
# Example config for:
# - `retriever table stage run --config <this.yaml> --input <primitives.parquet>`
# - `retriever local stage3 run --config <this.yaml> --input <primitives.parquet>`
# - `nr table stage run --config <this.yaml> --input <primitives.parquet>`
# - `nr local stage3 run --config <this.yaml> --input <primitives.parquet>`
#
# This YAML is parsed into `nv_ingest_api.internal.schemas.extract.extract_table_schema.TableExtractorSchema`
# via `nemo_retriever.table.config.load_table_extractor_schema_from_dict`.
Expand Down Expand Up @@ -133,8 +133,8 @@ table:

chart:
# Example config for:
# - `retriever chart stage run --config <this.yaml> --input <primitives.parquet>`
# - `retriever local stage4 run --config <this.yaml> --input <primitives.parquet>`
# - `nr chart stage run --config <this.yaml> --input <primitives.parquet>`
# - `nr local stage4 run --config <this.yaml> --input <primitives.parquet>`
#
# This YAML is parsed into `nv_ingest_api.internal.schemas.extract.extract_chart_schema.ChartExtractorSchema`
# via `nemo_retriever.chart.config.load_chart_extractor_schema_from_dict`.
Expand Down Expand Up @@ -185,8 +185,8 @@ chart:

infographic:
# Example config for:
# - `retriever infographic stage run --config <this.yaml> --input <primitives.parquet>`
# - `retriever local stage2 run --config <this.yaml> --input <primitives.parquet>`
# - `nr infographic stage run --config <this.yaml> --input <primitives.parquet>`
# - `nr local stage2 run --config <this.yaml> --input <primitives.parquet>`
#
# This YAML is parsed into `nv_ingest_api.internal.schemas.extract.extract_infographic_schema.InfographicExtractorSchema`
# via `nemo_retriever.infographic.config.load_infographic_extractor_schema_from_dict`.
Expand Down Expand Up @@ -231,7 +231,7 @@ embedding:
api_key: "" # e.g. $NGC_API_KEY or $NVIDIA_API_KEY

# Embedding service settings
# If set to null/empty, `retriever local stage5` will fall back to local HF embeddings
# If set to null/empty, `nr local stage5` will fall back to local HF embeddings
# via `nemo_retriever.model.local.llama_nemotron_embed_1b_v2_embedder`.
embedding_nim_endpoint: null
# embedding_nim_endpoint: "http://localhost:8012/v1"
Expand Down
Loading
Loading