Draft
Conversation
Centralises the resolve → branch → construct pattern for local HF embedding models (VL and non-VL) that was duplicated across batch, inprocess, fused, gpu_pool, recall, retriever, and text_embed code paths into a single `create_local_embedder` factory function. Made-with: Cursor
Extracts duplicated LanceDB row-building, schema definition, and table-creation logic from batch.py and inprocess.py into a shared ingest_modes/lancedb_utils.py module. Made-with: Cursor
- Remove unused Path import and unused _extract_* aliases from inprocess.py - Remove unused pytest import from test_lancedb_utils.py - Apply black formatting to set literal and DataFrame constructor Made-with: Cursor
…import The ingest_modes __init__.py eagerly imports batch/fused/inprocess/online which pull in ray, torch, etc. Pre-populate sys.modules with MagicMock stubs so lancedb_utils tests can run in lightweight CI without those deps. Made-with: Cursor
Centralises gold_to_doc_page, hit_key_and_distance, estimate_processed_pages, and print_pages_per_second that were duplicated across batch, inprocess, online, and fused pipeline examples. Fixes broken imports in fused_pipeline.py that referenced non-existent functions in batch_pipeline.py. Made-with: Cursor
Extracts duplicated detection summary computation and printing into a shared utils/detection_summary.py module, replacing ~200 lines of near-identical logic in batch_pipeline.py and inprocess.py with thin wrappers around the shared implementation. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extract duplicated detection-summary logic from
batch_pipeline.pyandinprocess.pyinto a single shared module atutils/detection_summary.py.Both pipelines contained ~100 lines of near-identical code to collect per-model detection totals (PageElements v3, OCR table/chart/infographic), deduplicate by
(source_id, page_number), and pretty-print the result. This PR replaces that with thin wrappers around a sharedcompute_detection_summary()core that accepts a generic row iterator, plus two adapters:iter_lancedb_rows— reads from a LanceDB table (batch pipeline path)iter_dataframe_rows— reads from an in-memory pandas DataFrame (inprocess pipeline path)Changes
nemo_retriever/utils/detection_summary.py— shared detection summary computation (compute_detection_summary), LanceDB/DataFrame iterators,print_detection_summary, and convenience wrappers (collect_detection_summary_from_lancedb,collect_detection_summary_from_df).examples/batch_pipeline.py—_collect_detection_summaryand_print_detection_summarynow delegate to the shared module (~90 lines removed).ingest_modes/inprocess.py—_collect_summary_from_dfand_print_detection_summarynow delegate to the shared module (~100 lines removed).utils/hf_model_registry.py— removed stalenvidia/llama-nemotron-embed-1b-v2revision entry.tests/test_create_local_embedder.py— assertion updated to match corrected alias target (nvidia/llama-3.2-nv-embedqa-1b-v2).tests/test_lancedb_utils.py— trimmed heavy-module stub list (now stubs the fouringest_modessiblings directly instead of all transitive deps); removed unnecessaryimportorskip("pyarrow").Stats
Test plan
test_lancedb_utils.py,test_create_local_embedder.py)pre-commit run --all-filespasses (black, flake8, trailing whitespace)Stack
This is PR 4 in the consolidation series:
PR1 — embedder factory(merged)PR2 — LanceDB utils(merged)consolidate/pr3-recall-helpers)consolidate/pr4-detection-summary) ← this PRChecklist