Consolidate/pr3 recall helpers by jdye64 · Pull Request #1509 · NVIDIA/NeMo-Retriever

jdye64 · 2026-03-07T01:24:24Z

Summary

Consolidates duplicated recall-evaluation helpers and pipeline utility functions that were copy-pasted across batch_pipeline.py, inprocess_pipeline.py, online_pipeline.py, and fused_pipeline.py into two shared modules.

recall/core.py — Adds gold_to_doc_page() and hit_key_and_distance() alongside the existing recall functions. These were independently defined (with minor variations) in three pipeline scripts.
examples/common.py (new) — Adds estimate_processed_pages() and print_pages_per_second(), previously duplicated in batch_pipeline.py and inprocess_pipeline.py.
All four pipeline scripts are updated to import from the shared modules instead of defining their own copies.
fused_pipeline.py now imports gold_to_doc_page, hit_key_and_distance, and is_hit_at_k from recall/core.py and estimate_processed_pages/print_pages_per_second from examples/common.py, instead of reaching into batch_pipeline's private functions.
batch_pipeline._ensure_lancedb_table now uses lancedb_schema() from lancedb_utils (PR2) instead of an inline 10-field PyArrow schema definition.
Removes a duplicate HF model registry entry in hf_model_registry.py.

Net diff: +134 / −227 lines (~93 lines removed)

Stacked on

consolidate/pr2-lancedb-utils

Files changed

File	Change
`nemo_retriever/src/nemo_retriever/recall/core.py`	Added `gold_to_doc_page()` and `hit_key_and_distance()`
`nemo_retriever/src/nemo_retriever/examples/common.py`	New — `estimate_processed_pages()`, `print_pages_per_second()`
`nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py`	Removed 5 local helper defs; uses `lancedb_schema()` in `_ensure_lancedb_table`
`nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py`	Removed 5 local helper defs; imports from `recall/core` and `examples/common`
`nemo_retriever/src/nemo_retriever/examples/online_pipeline.py`	Removed 3 local helper defs; imports from `recall/core`
`nemo_retriever/src/nemo_retriever/examples/fused_pipeline.py`	Switches from `batch_pipeline._*` private imports to shared public APIs
`nemo_retriever/src/nemo_retriever/utils/hf_model_registry.py`	Removed duplicate `nvidia/llama-nemotron-embed-1b-v2` entry
`nemo_retriever/tests/test_create_local_embedder.py`	Align expected alias to actual mapping
`nemo_retriever/tests/test_lancedb_utils.py`	CI stub adjustments from PR2 fixes

Test plan

Existing unit tests pass (test_create_local_embedder, test_lancedb_utils, test_multimodal_embed)
inprocess_pipeline, batch_pipeline, online_pipeline, and fused_pipeline execute end-to-end with recall evaluation enabled
Verify gold_to_doc_page / hit_key_and_distance produce identical results to the removed per-pipeline versions
Verify estimate_processed_pages / print_pages_per_second output matches previous behavior

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

Centralises the resolve → branch → construct pattern for local HF embedding models (VL and non-VL) that was duplicated across batch, inprocess, fused, gpu_pool, recall, retriever, and text_embed code paths into a single `create_local_embedder` factory function. Made-with: Cursor

Extracts duplicated LanceDB row-building, schema definition, and table-creation logic from batch.py and inprocess.py into a shared ingest_modes/lancedb_utils.py module. Made-with: Cursor

- Remove unused Path import and unused _extract_* aliases from inprocess.py - Remove unused pytest import from test_lancedb_utils.py - Apply black formatting to set literal and DataFrame constructor Made-with: Cursor

…import The ingest_modes __init__.py eagerly imports batch/fused/inprocess/online which pull in ray, torch, etc. Pre-populate sys.modules with MagicMock stubs so lancedb_utils tests can run in lightweight CI without those deps. Made-with: Cursor

Centralises gold_to_doc_page, hit_key_and_distance, estimate_processed_pages, and print_pages_per_second that were duplicated across batch, inprocess, online, and fused pipeline examples. Fixes broken imports in fused_pipeline.py that referenced non-existent functions in batch_pipeline.py. Made-with: Cursor

jdye64 added 6 commits March 6, 2026 18:38

Consolidate LanceDB row construction, schema, and table creation

7fe8d21

Extracts duplicated LanceDB row-building, schema definition, and table-creation logic from batch.py and inprocess.py into a shared ingest_modes/lancedb_utils.py module. Made-with: Cursor

Merge branch 'main' into consolidate/pr2-lancedb-utils

5e734c4

Fix lint: remove unused imports, apply black formatting

13933f7

- Remove unused Path import and unused _extract_* aliases from inprocess.py - Remove unused pytest import from test_lancedb_utils.py - Apply black formatting to set literal and DataFrame constructor Made-with: Cursor

jdye64 requested a review from a team as a code owner March 7, 2026 01:24

jdye64 requested a review from nkmcalli March 7, 2026 01:24

jdye64 added 3 commits March 6, 2026 20:25

Merge branch 'main' into consolidate/pr3-recall-helpers

8576d8b

linter fixes

ad43c71

Merge branch 'main' into consolidate/pr3-recall-helpers

678d28d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate/pr3 recall helpers#1509

Consolidate/pr3 recall helpers#1509
jdye64 wants to merge 9 commits intoNVIDIA:mainfrom
jdye64:consolidate/pr3-recall-helpers

jdye64 commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdye64 commented Mar 7, 2026

Summary

Stacked on

Files changed

Test plan

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant