Skip to content

(retriever) Add ingest batch and ingest inprocess commands to the new 'nr' cli command#1500

Draft
jdye64 wants to merge 6 commits intoNVIDIA:mainfrom
jdye64:nr-command
Draft

(retriever) Add ingest batch and ingest inprocess commands to the new 'nr' cli command#1500
jdye64 wants to merge 6 commits intoNVIDIA:mainfrom
jdye64:nr-command

Conversation

@jdye64
Copy link
Collaborator

@jdye64 jdye64 commented Mar 6, 2026

Rename CLI entrypoint from retriever to nr and add ingest subcommands

Renames the pip-installed CLI command from retriever to nr and introduces a new nr ingest command group with batch and inprocess subcommands that expose the full ingestion pipelines directly from the CLI.

Changes

CLI entrypoint rename (retrievernr)

  • Updated [project.scripts] in pyproject.toml so the installed command is nr instead of retriever.
  • Updated all CLI usage references in documentation (README.md, HANDOFF.md), YAML configs, Python docstrings/comments, and tests within nemo_retriever/ to use nr.

New nr ingest command group

  • nr ingest batch <input-path> — Wraps the existing Ray-based batch pipeline (examples/batch_pipeline.py) with all its options (Ray cluster, actor counts, GPU allocation, recall evaluation, etc.).
  • nr ingest inprocess <input-path> — New CLI for the in-process pipeline (no Ray required). Supports all extraction, embedding, VDB upload, and execution options:
    • Extract: --extract-text, --extract-tables, --extract-charts, --extract-infographics, --use-table-structure, --table-output-format, --inference-batch-size, remote endpoint URLs
    • Embed: --embed-model-name, --embed-invoke-url, --embed-modality, --embed-granularity, modality overrides
    • VDB: --lancedb-uri, --lancedb-table, --hybrid, --overwrite
    • Execution: --parallel, --max-workers, --gpu-devices, --page-chunk-size, --show-progress
    • Output: --output-dir for JSON results

Files changed (24 files, +366 / −67)

Category Files
Core CLI pyproject.toml, adapters/cli/main.py
New file ingest_modes/inprocess_cli.py
Docs README.md, harness/HANDOFF.md
YAML configs ingest-config.yaml, pdf_stage_config.yaml, chart_stage_config.yaml, table_stage_config.yaml, infographic_stage_config.yaml, embedding_stage_config.yaml
Python refs batch_pipeline.py, online_pipeline.py, audio/cli.py, audio/stage.py, html/__main__.py, txt/__main__.py, vector_store/stage.py, local/stages/stage1_pdf_extraction.py, stage5_text_embeddings.py, stage6_vdb_upload.py, stage7_vdb_query.py, stage999_post_mortem_analysis.py
Tests test_audio_stage.py

Usage examples

# Batch pipeline (Ray)
nr ingest batch ./data/pdfs --embed-model-name nvidia/llama-3.2-nv-embedqa-1b-v2

# In-process pipeline (no Ray)
nr ingest inprocess ./data/pdfs --parallel --gpu-devices 0,1

# Existing subcommands still work
nr harness run --dataset jp20 --preset single_gpu
nr --version

Test plan

  • nr --help shows ingest as a subcommand
  • nr ingest --help shows batch and inprocess subcommands
  • nr ingest batch --help shows all batch pipeline options
  • nr ingest inprocess --help shows all inprocess pipeline options
  • Existing subcommands (nr harness, nr pdf, nr local, etc.) still work
  • nr --version still works

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant