Skip to content

Conversation

@ayushag-nv
Copy link
Contributor

@ayushag-nv ayushag-nv commented Oct 30, 2025

Overview:

Introduces a new --multimodal-encode-prefill-worker flag to support models that require integrated image encoding (e.g., Llama 4), where the same worker handles image encoding, prefill, and decode operations.

Changes

  • New flag: --multimodal-encode-prefill-worker
    • Alternative to --multimodal-worker for models requiring inline image processing
    • Registers on encoder.generate endpoint
    • Handles complete inference pipeline: encode → prefill → decode
  • Consolidated implementation: Both --multimodal-worker and --multimodal-encode-prefill-worker now use the same init_multimodal_worker() function, eliminating code duplication
  • Updated launch script: agg_multimodal_llama.sh now uses the new flag

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features
    • Added support for a dedicated multimodal encode-prefill worker mode for multimodal model processing.
    • Introduced new configuration option to enable separate encode-prefill handling in multimodal pipelines.

@ayushag-nv ayushag-nv requested review from a team as code owners October 30, 2025 08:00
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 30, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 30, 2025

Walkthrough

Introduced a new multimodal-encode-prefill-worker option to the vLLM backend system. Updated the launcher script to use this worker type, added corresponding CLI flag and Config field to argument parsing, and extended main worker routing logic to support the new worker mode.

Changes

Cohort / File(s) Summary
Launcher script
components/backends/vllm/launch/agg_multimodal_llama.sh
Replaced separate processor and encode_prefill worker launches with unified multimodal processing flow; added gpu-memory-utilization parameter to encode-prefill path.
Argument parsing and configuration
components/src/dynamo/vllm/args.py
Added multimodal_encode_prefill_worker CLI flag and corresponding boolean field to Config class; updated multimodal flag exclusivity check to include new option.
Worker routing and handler initialization
components/src/dynamo/vllm/main.py
Extended multimodal initialization path to trigger when either multimodal_worker or multimodal_encode_prefill_worker is set; replaced client parameter with downstream_client in MultimodalPDWorkerHandler constructor; set downstream_client to None for aggregated mode.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Public API change: New multimodal_encode_prefill_worker boolean field added to Config class; verify no downstream dependencies are broken
  • Handler signature refactor: MultimodalPDWorkerHandler now receives downstream_client instead of client; confirm all call sites updated consistently and aggregated mode behavior is correct
  • Routing logic correctness: Verify the extended multimodal initialization path correctly handles both worker types and their respective operational modes

Poem

🐰 A new worker hops into the scene,
Encode-prefill dreams now lean and clean,
Config fields bloom, flags take flight,
Routing paths adjusted just right! ✨
One router binds them all with grace,
Multimodal magic fills the space!

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description provides a solid Overview section explaining the new --multimodal-encode-prefill-worker flag and its purpose, but significantly falls short in completing the template structure. The "Details" section contains only the template comment placeholder with no actual description of changes, and the "Where should the reviewer start?" section is similarly empty with only the template prompt. Additionally, the Related Issues section references a placeholder "#xxx" rather than an actual issue number. While the Overview section is informative, two of the four main template sections are entirely unfilled, making the description largely incomplete. Complete the PR description by filling in the Details section with a concise explanation of the implementation changes, add specific file recommendations under "Where should the reviewer start?" (consider highlighting components/src/dynamo/vllm/args.py, components/src/dynamo/vllm/main.py, and the launch script), and replace the "#xxx" placeholder with an actual GitHub issue number or remove it if no specific issue is being closed.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "chore: support for agg llama4 mulimodal" directly relates to the main objective of this changeset, which is to introduce support for aggregated Llama 4 multimodal processing through a new worker flag and launch script updates. The title accurately captures the primary intent despite containing a minor typo ("mulimodal" instead of "multimodal"). It is concise, specific enough for scanning history, and clearly conveys that this PR adds multimodal support for Llama 4 in an aggregated setup.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/src/dynamo/vllm/args.py (1)

74-74: Add missing Config field declaration.

The multimodal_encode_prefill_worker field is assigned at line 248 but not declared in the Config class. While Python allows dynamic attribute assignment, this is inconsistent with the pattern used for other multimodal flags (lines 69-71) and could cause confusion or type checking issues.

Apply this diff to add the field declaration:

     multimodal_processor: bool = False
     multimodal_encode_worker: bool = False
     multimodal_worker: bool = False
+    multimodal_encode_prefill_worker: bool = False
     mm_prompt_template: str = "USER: <image>\n<prompt> ASSISTANT:"
🧹 Nitpick comments (1)
components/src/dynamo/vllm/main.py (1)

577-584: Formatting change: No functional impact.

The endpoint serving calls have been reformatted to place handler.generate, metrics_labels, and handler.clear_kv_blocks on separate lines. This improves readability without changing behavior.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6bec8b3 and bd265b4.

📒 Files selected for processing (3)
  • components/backends/vllm/launch/agg_multimodal_llama.sh (1 hunks)
  • components/src/dynamo/vllm/args.py (4 hunks)
  • components/src/dynamo/vllm/main.py (4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
📚 Learning: 2025-09-16T19:47:30.312Z
Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#3067
File: lib/llm/src/preprocessor/prompt/template/oai.rs:87-134
Timestamp: 2025-09-16T19:47:30.312Z
Learning: In Dynamo, multimodal requests (containing image_url or other non-text content) are processed through a completely different workflow than text-only requests, so the may_be_fix_msg_content function in lib/llm/src/preprocessor/prompt/template/oai.rs will only encounter text-only content arrays.

Applied to files:

  • components/backends/vllm/launch/agg_multimodal_llama.sh
🧬 Code graph analysis (1)
components/src/dynamo/vllm/main.py (3)
components/src/dynamo/vllm/multimodal_handlers/worker_handler.py (1)
  • MultimodalPDWorkerHandler (84-260)
lib/bindings/python/src/dynamo/_core.pyi (3)
  • component (88-92)
  • generate (1363-1402)
  • serve_endpoint (140-152)
lib/bindings/python/rust/lib.rs (3)
  • component (790-796)
  • generate (828-840)
  • serve_endpoint (705-758)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (6)
components/backends/vllm/launch/agg_multimodal_llama.sh (1)

14-17: LGTM! Clear documentation of the architectural choice.

The comment effectively explains why Llama 4 requires integrated encoding, and the worker configuration parameters are appropriate for the aggregated encode+prefill+decode workflow.

components/src/dynamo/vllm/args.py (2)

160-164: LGTM! CLI flag properly documented.

The help text clearly distinguishes this worker type from the other multimodal options and explains its use case for models like Llama 4 that require integrated encoding.


222-224: Both worker types intentionally share the encoder.generate endpoint—this is the intended design.

The mutual exclusivity check (lines 204-213) ensures these worker types cannot run in the same process. However, the verification reveals they follow different initialization code paths:

  • --multimodal-encode-workerinit_multimodal_encode_worker() with EncodeWorkerHandler
  • --multimodal-encode-prefill-workerinit_multimodal_worker() with a different handler

Both register on encoder.generate because they serve the same semantic role (encoding), but use different handler implementations. The processor (lines 450-470) connects to encoder.generate generically, unaware of which handler type is serving it. This distributed system pattern—allowing different implementations to share an endpoint across processes—provides deployment flexibility while the intra-process mutual exclusivity check prevents conflicts. No issues found.

components/src/dynamo/vllm/main.py (3)

107-109: LGTM! Routing logic correctly updated.

The condition properly routes both multimodal_worker and multimodal_encode_prefill_worker to the unified initialization function, consolidating the code path as intended.


540-548: LGTM! Docstring effectively documents the dual-mode design.

The updated documentation clearly explains the two supported modes and their operational differences, which will help maintainers understand the architectural choice.


557-559: Track disaggregated mode implementation.

The TODO indicates that disaggregated mode (prefill → decode split) is not yet implemented for the multimodal worker. The current aggregated mode (P+D in single worker) should work correctly with downstream_client=None, but ensure this limitation is tracked in the referenced GitHub issue.

According to the PR summary, this PR references closing issue #xxx. Please verify:

  1. Does the issue description clearly state that only aggregated mode is supported in this PR?
  2. Is there a follow-up issue to track disaggregated mode implementation?

Based on learnings: The MultimodalPDWorkerHandler constructor (from relevant code snippets) has decode_worker_client: Client = None as an optional parameter, so passing None is safe. However, verify that the handler's logic properly checks for None before attempting to use the client for disaggregated operations.

Signed-off-by: ayushag <[email protected]>
@krishung5
Copy link
Contributor

From @ayushag-nv : validated llama4 on h100x8.

@ayushag-nv
Copy link
Contributor Author

/ok to test b9c50d7

@ayushag-nv ayushag-nv enabled auto-merge (squash) October 30, 2025 19:27
@krishung5
Copy link
Contributor

/ok to test b9c50d7

@ayushag-nv
Copy link
Contributor Author

/ok to test af4eb7d

@ayushag-nv
Copy link
Contributor Author

/ok to test 9dcea6e

@ayushag-nv ayushag-nv merged commit 22d910a into main Oct 30, 2025
23 of 34 checks passed
@ayushag-nv ayushag-nv deleted the ayushag/mm-agg-llama4 branch October 30, 2025 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants