[retriever] feat: fused split taskpool + TaskPoolStrategy switch by jioffe502 · Pull Request #1451 · NVIDIA/NeMo-Retriever

jioffe502 · 2026-02-27T18:50:03Z

Description

Collapse PDF split + extraction into a single stage to remove redundant PDF re-open/serialization work. Keep CPU extraction on TaskPoolStrategy for simpler, more deterministic throughput while preserving existing GPU actor behavior.

Replace split -> single-page serialize -> extract with fused split_and_extract_pdf (open each PDF once per batch), remove the old split/extract boundary overhead.
Use TaskPoolStrategy(size=pdf_extract_workers) for the fused CPU stage; leave page-elements/OCR GPU stages on ActorPoolStrategy unchanged.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

Enable combined dense vector + full-text search via --hybrid flag in the batch pipeline. When enabled, ingestion creates both IVF_HNSW_SQ and FTS indices, and recall uses RRF reranking to merge results. On jp20 (20 PDFs, 115 queries): Dense: recall@1=0.61, recall@5=0.90, recall@10=0.96 Hybrid: recall@1=0.65, recall@5=0.94, recall@10=0.96 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

- Fuse PDFSplit + PDFExtraction into single PDFSplitAndExtractActor that opens each multi-page PDF once (eliminates redundant single-page PDF serialization) - Render pages to numpy arrays instead of PNG→base64 encoding; downstream page_elements and OCR stages accept numpy directly, only encoding to base64 for remote NIM endpoints - Switch PDF extraction from TaskPoolStrategy to ActorPoolStrategy (eliminates per-task process creation + library import overhead) - Increase pdf_split_batch_size default from 1 to 4 - Remove repartition barrier between extract and page_elements stages - Add BGR→RGB channel swap via OpenCV (aligned with api/ pdfium.py) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com> (cherry picked from commit d2dc2c7a5202fb52e7103d53a20a762f4414efa1)

Keep page-image transport as base64 to avoid object-store bloat and spill pressure while preserving fused split+extract and ActorPool improvements. Signed-off-by: Jacob Ioffe <jioffe@nvidia.com> Made-with: Cursor (cherry picked from commit e84da71b618fd34a4666b6676c2b8866e8721ded)

- Run split_and_extract_pdf with TaskPoolStrategy(size=workers) - Keep GPU page-elements and OCR stages on ActorPoolStrategy - Remove unused PDFSplitAndExtractActor wrapper class Signed-off-by: Jacob Ioffe <jioffe@nvidia.com> Made-with: Cursor

…skpool-minimal

jioffe502 and others added 5 commits February 26, 2026 03:04

Merge remote-tracking branch 'upstream/main' into feat/fused-split-ta…

c881937

…skpool-minimal

jioffe502 requested a review from a team as a code owner February 27, 2026 18:50

jioffe502 requested a review from charlesbluca February 27, 2026 18:50

jioffe502 changed the title ~~Feat: fused split taskpool minimal~~ Feat: fused split taskpool + TaskPoolStrategy switch Feb 27, 2026

jioffe502 marked this pull request as draft February 27, 2026 18:50

jioffe502 changed the title ~~Feat: fused split taskpool + TaskPoolStrategy switch~~ [retriever] feat: fused split taskpool + TaskPoolStrategy switch Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[retriever] feat: fused split taskpool + TaskPoolStrategy switch#1451

[retriever] feat: fused split taskpool + TaskPoolStrategy switch#1451
jioffe502 wants to merge 5 commits intoNVIDIA:mainfrom
jioffe502:feat/fused-split-taskpool-minimal

jioffe502 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jioffe502 commented Feb 27, 2026

Description

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant