[retriever] feat: fused split taskpool + TaskPoolStrategy switch#1451
Draft
jioffe502 wants to merge 5 commits intoNVIDIA:mainfrom
Draft
[retriever] feat: fused split taskpool + TaskPoolStrategy switch#1451jioffe502 wants to merge 5 commits intoNVIDIA:mainfrom
jioffe502 wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
Enable combined dense vector + full-text search via --hybrid flag in the batch pipeline. When enabled, ingestion creates both IVF_HNSW_SQ and FTS indices, and recall uses RRF reranking to merge results. On jp20 (20 PDFs, 115 queries): Dense: recall@1=0.61, recall@5=0.90, recall@10=0.96 Hybrid: recall@1=0.65, recall@5=0.94, recall@10=0.96 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
- Fuse PDFSplit + PDFExtraction into single PDFSplitAndExtractActor that opens each multi-page PDF once (eliminates redundant single-page PDF serialization) - Render pages to numpy arrays instead of PNG→base64 encoding; downstream page_elements and OCR stages accept numpy directly, only encoding to base64 for remote NIM endpoints - Switch PDF extraction from TaskPoolStrategy to ActorPoolStrategy (eliminates per-task process creation + library import overhead) - Increase pdf_split_batch_size default from 1 to 4 - Remove repartition barrier between extract and page_elements stages - Add BGR→RGB channel swap via OpenCV (aligned with api/ pdfium.py) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com> (cherry picked from commit d2dc2c7a5202fb52e7103d53a20a762f4414efa1)
Keep page-image transport as base64 to avoid object-store bloat and spill pressure while preserving fused split+extract and ActorPool improvements. Signed-off-by: Jacob Ioffe <jioffe@nvidia.com> Made-with: Cursor (cherry picked from commit e84da71b618fd34a4666b6676c2b8866e8721ded)
- Run split_and_extract_pdf with TaskPoolStrategy(size=workers) - Keep GPU page-elements and OCR stages on ActorPoolStrategy - Remove unused PDFSplitAndExtractActor wrapper class Signed-off-by: Jacob Ioffe <jioffe@nvidia.com> Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Collapse PDF split + extraction into a single stage to remove redundant PDF re-open/serialization work. Keep CPU extraction on TaskPoolStrategy for simpler, more deterministic throughput while preserving existing GPU actor behavior.
Checklist