Consolidate/pr4 detection summary by jdye64 · Pull Request #1511 · NVIDIA/NeMo-Retriever

jdye64 · 2026-03-07T12:18:14Z

Summary

Extract duplicated detection-summary logic from batch_pipeline.py and inprocess.py into a single shared module at utils/detection_summary.py.

Both pipelines contained ~100 lines of near-identical code to collect per-model detection totals (PageElements v3, OCR table/chart/infographic), deduplicate by (source_id, page_number), and pretty-print the result. This PR replaces that with thin wrappers around a shared compute_detection_summary() core that accepts a generic row iterator, plus two adapters:

iter_lancedb_rows — reads from a LanceDB table (batch pipeline path)
iter_dataframe_rows — reads from an in-memory pandas DataFrame (inprocess pipeline path)

Changes

New nemo_retriever/utils/detection_summary.py — shared detection summary computation (compute_detection_summary), LanceDB/DataFrame iterators, print_detection_summary, and convenience wrappers (collect_detection_summary_from_lancedb, collect_detection_summary_from_df).
Simplified examples/batch_pipeline.py — _collect_detection_summary and _print_detection_summary now delegate to the shared module (~90 lines removed).
Simplified ingest_modes/inprocess.py — _collect_summary_from_df and _print_detection_summary now delegate to the shared module (~100 lines removed).
Cleaned up utils/hf_model_registry.py — removed stale nvidia/llama-nemotron-embed-1b-v2 revision entry.
Updated tests/test_create_local_embedder.py — assertion updated to match corrected alias target (nvidia/llama-3.2-nv-embedqa-1b-v2).
Simplified tests/test_lancedb_utils.py — trimmed heavy-module stub list (now stubs the four ingest_modes siblings directly instead of all transitive deps); removed unnecessary importorskip("pyarrow").

Stats

 6 files changed, 198 insertions(+), 237 deletions(-)

Test plan

Existing unit tests pass (test_lancedb_utils.py, test_create_local_embedder.py)
pre-commit run --all-files passes (black, flake8, trailing whitespace)
Run batch pipeline end-to-end and verify detection summary JSON output matches previous format
Run inprocess pipeline end-to-end and verify printed detection summary matches previous output

Stack

This is PR 4 in the consolidation series:

~~PR1 — embedder factory~~ (merged)
~~PR2 — LanceDB utils~~ (merged)
PR3 — recall helpers (consolidate/pr3-recall-helpers)
PR4 — detection summary (consolidate/pr4-detection-summary) ← this PR

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

Centralises the resolve → branch → construct pattern for local HF embedding models (VL and non-VL) that was duplicated across batch, inprocess, fused, gpu_pool, recall, retriever, and text_embed code paths into a single `create_local_embedder` factory function. Made-with: Cursor

Extracts duplicated LanceDB row-building, schema definition, and table-creation logic from batch.py and inprocess.py into a shared ingest_modes/lancedb_utils.py module. Made-with: Cursor

- Remove unused Path import and unused _extract_* aliases from inprocess.py - Remove unused pytest import from test_lancedb_utils.py - Apply black formatting to set literal and DataFrame constructor Made-with: Cursor

…import The ingest_modes __init__.py eagerly imports batch/fused/inprocess/online which pull in ray, torch, etc. Pre-populate sys.modules with MagicMock stubs so lancedb_utils tests can run in lightweight CI without those deps. Made-with: Cursor

Centralises gold_to_doc_page, hit_key_and_distance, estimate_processed_pages, and print_pages_per_second that were duplicated across batch, inprocess, online, and fused pipeline examples. Fixes broken imports in fused_pipeline.py that referenced non-existent functions in batch_pipeline.py. Made-with: Cursor

Extracts duplicated detection summary computation and printing into a shared utils/detection_summary.py module, replacing ~200 lines of near-identical logic in batch_pipeline.py and inprocess.py with thin wrappers around the shared implementation. Made-with: Cursor

jdye64 added 7 commits March 6, 2026 18:38

Consolidate LanceDB row construction, schema, and table creation

7fe8d21

Extracts duplicated LanceDB row-building, schema definition, and table-creation logic from batch.py and inprocess.py into a shared ingest_modes/lancedb_utils.py module. Made-with: Cursor

Merge branch 'main' into consolidate/pr2-lancedb-utils

5e734c4

Fix lint: remove unused imports, apply black formatting

13933f7

- Remove unused Path import and unused _extract_* aliases from inprocess.py - Remove unused pytest import from test_lancedb_utils.py - Apply black formatting to set literal and DataFrame constructor Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate/pr4 detection summary#1511

Consolidate/pr4 detection summary#1511
jdye64 wants to merge 7 commits intoNVIDIA:mainfrom
jdye64:consolidate/pr4-detection-summary

jdye64 commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdye64 commented Mar 7, 2026

Summary

Changes

Stats

Test plan

Stack

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant