Skip to content

Consolidate/pr4 detection summary#1511

Draft
jdye64 wants to merge 7 commits intoNVIDIA:mainfrom
jdye64:consolidate/pr4-detection-summary
Draft

Consolidate/pr4 detection summary#1511
jdye64 wants to merge 7 commits intoNVIDIA:mainfrom
jdye64:consolidate/pr4-detection-summary

Conversation

@jdye64
Copy link
Collaborator

@jdye64 jdye64 commented Mar 7, 2026

Summary

Extract duplicated detection-summary logic from batch_pipeline.py and inprocess.py into a single shared module at utils/detection_summary.py.

Both pipelines contained ~100 lines of near-identical code to collect per-model detection totals (PageElements v3, OCR table/chart/infographic), deduplicate by (source_id, page_number), and pretty-print the result. This PR replaces that with thin wrappers around a shared compute_detection_summary() core that accepts a generic row iterator, plus two adapters:

  • iter_lancedb_rows — reads from a LanceDB table (batch pipeline path)
  • iter_dataframe_rows — reads from an in-memory pandas DataFrame (inprocess pipeline path)

Changes

  • New nemo_retriever/utils/detection_summary.py — shared detection summary computation (compute_detection_summary), LanceDB/DataFrame iterators, print_detection_summary, and convenience wrappers (collect_detection_summary_from_lancedb, collect_detection_summary_from_df).
  • Simplified examples/batch_pipeline.py_collect_detection_summary and _print_detection_summary now delegate to the shared module (~90 lines removed).
  • Simplified ingest_modes/inprocess.py_collect_summary_from_df and _print_detection_summary now delegate to the shared module (~100 lines removed).
  • Cleaned up utils/hf_model_registry.py — removed stale nvidia/llama-nemotron-embed-1b-v2 revision entry.
  • Updated tests/test_create_local_embedder.py — assertion updated to match corrected alias target (nvidia/llama-3.2-nv-embedqa-1b-v2).
  • Simplified tests/test_lancedb_utils.py — trimmed heavy-module stub list (now stubs the four ingest_modes siblings directly instead of all transitive deps); removed unnecessary importorskip("pyarrow").

Stats

 6 files changed, 198 insertions(+), 237 deletions(-)

Test plan

  • Existing unit tests pass (test_lancedb_utils.py, test_create_local_embedder.py)
  • pre-commit run --all-files passes (black, flake8, trailing whitespace)
  • Run batch pipeline end-to-end and verify detection summary JSON output matches previous format
  • Run inprocess pipeline end-to-end and verify printed detection summary matches previous output

Stack

This is PR 4 in the consolidation series:

  1. PR1 — embedder factory (merged)
  2. PR2 — LanceDB utils (merged)
  3. PR3 — recall helpers (consolidate/pr3-recall-helpers)
  4. PR4 — detection summary (consolidate/pr4-detection-summary) ← this PR

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

jdye64 added 7 commits March 6, 2026 18:38
Centralises the resolve → branch → construct pattern for local HF embedding
models (VL and non-VL) that was duplicated across batch, inprocess, fused,
gpu_pool, recall, retriever, and text_embed code paths into a single
`create_local_embedder` factory function.

Made-with: Cursor
Extracts duplicated LanceDB row-building, schema definition, and
table-creation logic from batch.py and inprocess.py into a shared
ingest_modes/lancedb_utils.py module.

Made-with: Cursor
- Remove unused Path import and unused _extract_* aliases from inprocess.py
- Remove unused pytest import from test_lancedb_utils.py
- Apply black formatting to set literal and DataFrame constructor

Made-with: Cursor
…import

The ingest_modes __init__.py eagerly imports batch/fused/inprocess/online
which pull in ray, torch, etc. Pre-populate sys.modules with MagicMock
stubs so lancedb_utils tests can run in lightweight CI without those deps.

Made-with: Cursor
Centralises gold_to_doc_page, hit_key_and_distance, estimate_processed_pages,
and print_pages_per_second that were duplicated across batch, inprocess,
online, and fused pipeline examples. Fixes broken imports in fused_pipeline.py
that referenced non-existent functions in batch_pipeline.py.

Made-with: Cursor
Extracts duplicated detection summary computation and printing into a
shared utils/detection_summary.py module, replacing ~200 lines of
near-identical logic in batch_pipeline.py and inprocess.py with thin
wrappers around the shared implementation.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant