Skip to content

Consolidate/pr3 recall helpers#1509

Open
jdye64 wants to merge 9 commits intoNVIDIA:mainfrom
jdye64:consolidate/pr3-recall-helpers
Open

Consolidate/pr3 recall helpers#1509
jdye64 wants to merge 9 commits intoNVIDIA:mainfrom
jdye64:consolidate/pr3-recall-helpers

Conversation

@jdye64
Copy link
Collaborator

@jdye64 jdye64 commented Mar 7, 2026

Summary

Consolidates duplicated recall-evaluation helpers and pipeline utility functions that were copy-pasted across batch_pipeline.py, inprocess_pipeline.py, online_pipeline.py, and fused_pipeline.py into two shared modules.

  • recall/core.py — Adds gold_to_doc_page() and hit_key_and_distance() alongside the existing recall functions. These were independently defined (with minor variations) in three pipeline scripts.
  • examples/common.py (new) — Adds estimate_processed_pages() and print_pages_per_second(), previously duplicated in batch_pipeline.py and inprocess_pipeline.py.
  • All four pipeline scripts are updated to import from the shared modules instead of defining their own copies.
  • fused_pipeline.py now imports gold_to_doc_page, hit_key_and_distance, and is_hit_at_k from recall/core.py and estimate_processed_pages/print_pages_per_second from examples/common.py, instead of reaching into batch_pipeline's private functions.
  • batch_pipeline._ensure_lancedb_table now uses lancedb_schema() from lancedb_utils (PR2) instead of an inline 10-field PyArrow schema definition.
  • Removes a duplicate HF model registry entry in hf_model_registry.py.

Net diff: +134 / −227 lines (~93 lines removed)

Stacked on

  • consolidate/pr2-lancedb-utils

Files changed

File Change
nemo_retriever/src/nemo_retriever/recall/core.py Added gold_to_doc_page() and hit_key_and_distance()
nemo_retriever/src/nemo_retriever/examples/common.py Newestimate_processed_pages(), print_pages_per_second()
nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py Removed 5 local helper defs; uses lancedb_schema() in _ensure_lancedb_table
nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py Removed 5 local helper defs; imports from recall/core and examples/common
nemo_retriever/src/nemo_retriever/examples/online_pipeline.py Removed 3 local helper defs; imports from recall/core
nemo_retriever/src/nemo_retriever/examples/fused_pipeline.py Switches from batch_pipeline._* private imports to shared public APIs
nemo_retriever/src/nemo_retriever/utils/hf_model_registry.py Removed duplicate nvidia/llama-nemotron-embed-1b-v2 entry
nemo_retriever/tests/test_create_local_embedder.py Align expected alias to actual mapping
nemo_retriever/tests/test_lancedb_utils.py CI stub adjustments from PR2 fixes

Test plan

  • Existing unit tests pass (test_create_local_embedder, test_lancedb_utils, test_multimodal_embed)
  • inprocess_pipeline, batch_pipeline, online_pipeline, and fused_pipeline execute end-to-end with recall evaluation enabled
  • Verify gold_to_doc_page / hit_key_and_distance produce identical results to the removed per-pipeline versions
  • Verify estimate_processed_pages / print_pages_per_second output matches previous behavior

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

jdye64 added 6 commits March 6, 2026 18:38
Centralises the resolve → branch → construct pattern for local HF embedding
models (VL and non-VL) that was duplicated across batch, inprocess, fused,
gpu_pool, recall, retriever, and text_embed code paths into a single
`create_local_embedder` factory function.

Made-with: Cursor
Extracts duplicated LanceDB row-building, schema definition, and
table-creation logic from batch.py and inprocess.py into a shared
ingest_modes/lancedb_utils.py module.

Made-with: Cursor
- Remove unused Path import and unused _extract_* aliases from inprocess.py
- Remove unused pytest import from test_lancedb_utils.py
- Apply black formatting to set literal and DataFrame constructor

Made-with: Cursor
…import

The ingest_modes __init__.py eagerly imports batch/fused/inprocess/online
which pull in ray, torch, etc. Pre-populate sys.modules with MagicMock
stubs so lancedb_utils tests can run in lightweight CI without those deps.

Made-with: Cursor
Centralises gold_to_doc_page, hit_key_and_distance, estimate_processed_pages,
and print_pages_per_second that were duplicated across batch, inprocess,
online, and fused pipeline examples. Fixes broken imports in fused_pipeline.py
that referenced non-existent functions in batch_pipeline.py.

Made-with: Cursor
@jdye64 jdye64 requested a review from a team as a code owner March 7, 2026 01:24
@jdye64 jdye64 requested a review from nkmcalli March 7, 2026 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant