Skip to content

[retriever] feat: fused split taskpool + TaskPoolStrategy switch#1451

Draft
jioffe502 wants to merge 5 commits intoNVIDIA:mainfrom
jioffe502:feat/fused-split-taskpool-minimal
Draft

[retriever] feat: fused split taskpool + TaskPoolStrategy switch#1451
jioffe502 wants to merge 5 commits intoNVIDIA:mainfrom
jioffe502:feat/fused-split-taskpool-minimal

Conversation

@jioffe502
Copy link
Collaborator

Description

Collapse PDF split + extraction into a single stage to remove redundant PDF re-open/serialization work. Keep CPU extraction on TaskPoolStrategy for simpler, more deterministic throughput while preserving existing GPU actor behavior.

  • Replace split -> single-page serialize -> extract with fused split_and_extract_pdf (open each PDF once per batch), remove the old split/extract boundary overhead.
  • Use TaskPoolStrategy(size=pdf_extract_workers) for the fused CPU stage; leave page-elements/OCR GPU stages on ActorPoolStrategy unchanged.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

jioffe502 and others added 5 commits February 26, 2026 03:04
Enable combined dense vector + full-text search via --hybrid flag in
the batch pipeline. When enabled, ingestion creates both IVF_HNSW_SQ
and FTS indices, and recall uses RRF reranking to merge results.

On jp20 (20 PDFs, 115 queries):
  Dense:  recall@1=0.61, recall@5=0.90, recall@10=0.96
  Hybrid: recall@1=0.65, recall@5=0.94, recall@10=0.96

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
- Fuse PDFSplit + PDFExtraction into single PDFSplitAndExtractActor that
  opens each multi-page PDF once (eliminates redundant single-page PDF
  serialization)
- Render pages to numpy arrays instead of PNG→base64 encoding; downstream
  page_elements and OCR stages accept numpy directly, only encoding to
  base64 for remote NIM endpoints
- Switch PDF extraction from TaskPoolStrategy to ActorPoolStrategy
  (eliminates per-task process creation + library import overhead)
- Increase pdf_split_batch_size default from 1 to 4
- Remove repartition barrier between extract and page_elements stages
- Add BGR→RGB channel swap via OpenCV (aligned with api/ pdfium.py)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
(cherry picked from commit d2dc2c7a5202fb52e7103d53a20a762f4414efa1)
Keep page-image transport as base64 to avoid object-store bloat and spill pressure while preserving fused split+extract and ActorPool improvements.

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Made-with: Cursor
(cherry picked from commit e84da71b618fd34a4666b6676c2b8866e8721ded)
- Run split_and_extract_pdf with TaskPoolStrategy(size=workers)
- Keep GPU page-elements and OCR stages on ActorPoolStrategy
- Remove unused PDFSplitAndExtractActor wrapper class

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Made-with: Cursor
@jioffe502 jioffe502 requested a review from a team as a code owner February 27, 2026 18:50
@jioffe502 jioffe502 changed the title Feat: fused split taskpool minimal Feat: fused split taskpool + TaskPoolStrategy switch Feb 27, 2026
@jioffe502 jioffe502 marked this pull request as draft February 27, 2026 18:50
@jioffe502 jioffe502 changed the title Feat: fused split taskpool + TaskPoolStrategy switch [retriever] feat: fused split taskpool + TaskPoolStrategy switch Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant