chore: support for agg llama4 mulimodal #3984

ayushag-nv · 2025-10-30T08:00:16Z

Overview:

Introduces a new --multimodal-encode-prefill-worker flag to support models that require integrated image encoding (e.g., Llama 4), where the same worker handles image encoding, prefill, and decode operations.

Changes

New flag: --multimodal-encode-prefill-worker
- Alternative to --multimodal-worker for models requiring inline image processing
- Registers on encoder.generate endpoint
- Handles complete inference pipeline: encode → prefill → decode
Consolidated implementation: Both --multimodal-worker and --multimodal-encode-prefill-worker now use the same init_multimodal_worker() function, eliminating code duplication
Updated launch script: agg_multimodal_llama.sh now uses the new flag

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Added support for a dedicated multimodal encode-prefill worker mode for multimodal model processing.
- Introduced new configuration option to enable separate encode-prefill handling in multimodal pipelines.

Signed-off-by: ayushag <[email protected]>

copy-pr-bot · 2025-10-30T08:00:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-10-30T08:05:34Z

Walkthrough

Introduced a new multimodal-encode-prefill-worker option to the vLLM backend system. Updated the launcher script to use this worker type, added corresponding CLI flag and Config field to argument parsing, and extended main worker routing logic to support the new worker mode.

Changes

Cohort / File(s)	Summary
Launcher script `components/backends/vllm/launch/agg_multimodal_llama.sh`	Replaced separate processor and encode_prefill worker launches with unified multimodal processing flow; added gpu-memory-utilization parameter to encode-prefill path.
Argument parsing and configuration `components/src/dynamo/vllm/args.py`	Added `multimodal_encode_prefill_worker` CLI flag and corresponding boolean field to Config class; updated multimodal flag exclusivity check to include new option.
Worker routing and handler initialization `components/src/dynamo/vllm/main.py`	Extended multimodal initialization path to trigger when either multimodal_worker or multimodal_encode_prefill_worker is set; replaced `client` parameter with `downstream_client` in MultimodalPDWorkerHandler constructor; set downstream_client to None for aggregated mode.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Public API change: New multimodal_encode_prefill_worker boolean field added to Config class; verify no downstream dependencies are broken
Handler signature refactor: MultimodalPDWorkerHandler now receives downstream_client instead of client; confirm all call sites updated consistently and aggregated mode behavior is correct
Routing logic correctness: Verify the extended multimodal initialization path correctly handles both worker types and their respective operational modes

Poem

🐰 A new worker hops into the scene,
Encode-prefill dreams now lean and clean,
Config fields bloom, flags take flight,
Routing paths adjusted just right! ✨
One router binds them all with grace,
Multimodal magic fills the space!

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description provides a solid Overview section explaining the new --multimodal-encode-prefill-worker flag and its purpose, but significantly falls short in completing the template structure. The "Details" section contains only the template comment placeholder with no actual description of changes, and the "Where should the reviewer start?" section is similarly empty with only the template prompt. Additionally, the Related Issues section references a placeholder "#xxx" rather than an actual issue number. While the Overview section is informative, two of the four main template sections are entirely unfilled, making the description largely incomplete.	Complete the PR description by filling in the Details section with a concise explanation of the implementation changes, add specific file recommendations under "Where should the reviewer start?" (consider highlighting components/src/dynamo/vllm/args.py, components/src/dynamo/vllm/main.py, and the launch script), and replace the "#xxx" placeholder with an actual GitHub issue number or remove it if no specific issue is being closed.
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "chore: support for agg llama4 mulimodal" directly relates to the main objective of this changeset, which is to introduce support for aggregated Llama 4 multimodal processing through a new worker flag and launch script updates. The title accurately captures the primary intent despite containing a minor typo ("mulimodal" instead of "multimodal"). It is concise, specific enough for scanning history, and clearly conveys that this PR adds multimodal support for Llama 4 in an aggregated setup.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/src/dynamo/vllm/args.py (1)
74-74: Add missing Config field declaration.

The multimodal_encode_prefill_worker field is assigned at line 248 but not declared in the Config class. While Python allows dynamic attribute assignment, this is inconsistent with the pattern used for other multimodal flags (lines 69-71) and could cause confusion or type checking issues.

Apply this diff to add the field declaration:
     multimodal_processor: bool = False
     multimodal_encode_worker: bool = False
     multimodal_worker: bool = False
+    multimodal_encode_prefill_worker: bool = False
     mm_prompt_template: str = "USER: <image>\n<prompt> ASSISTANT:"

🧹 Nitpick comments (1)

components/src/dynamo/vllm/main.py (1)

577-584: Formatting change: No functional impact.

The endpoint serving calls have been reformatted to place handler.generate, metrics_labels, and handler.clear_kv_blocks on separate lines. This improves readability without changing behavior.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6bec8b3 and bd265b4.

📒 Files selected for processing (3)

components/backends/vllm/launch/agg_multimodal_llama.sh (1 hunks)
components/src/dynamo/vllm/args.py (4 hunks)
components/src/dynamo/vllm/main.py (4 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.

📚 Learning: 2025-09-16T19:47:30.312Z

Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#3067
File: lib/llm/src/preprocessor/prompt/template/oai.rs:87-134
Timestamp: 2025-09-16T19:47:30.312Z
Learning: In Dynamo, multimodal requests (containing image_url or other non-text content) are processed through a completely different workflow than text-only requests, so the may_be_fix_msg_content function in lib/llm/src/preprocessor/prompt/template/oai.rs will only encounter text-only content arrays.

Applied to files:

components/backends/vllm/launch/agg_multimodal_llama.sh

🧬 Code graph analysis (1)

components/src/dynamo/vllm/main.py (3)

components/src/dynamo/vllm/multimodal_handlers/worker_handler.py (1)

MultimodalPDWorkerHandler (84-260)

lib/bindings/python/src/dynamo/_core.pyi (3)

component (88-92)

generate (1363-1402)

serve_endpoint (140-152)

lib/bindings/python/rust/lib.rs (3)

component (790-796)

generate (828-840)

serve_endpoint (705-758)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (6)

components/backends/vllm/launch/agg_multimodal_llama.sh (1)

14-17: LGTM! Clear documentation of the architectural choice.

The comment effectively explains why Llama 4 requires integrated encoding, and the worker configuration parameters are appropriate for the aggregated encode+prefill+decode workflow.

components/src/dynamo/vllm/args.py (2)

160-164: LGTM! CLI flag properly documented.

The help text clearly distinguishes this worker type from the other multimodal options and explains its use case for models like Llama 4 that require integrated encoding.

222-224: Both worker types intentionally share the encoder.generate endpoint—this is the intended design.

The mutual exclusivity check (lines 204-213) ensures these worker types cannot run in the same process. However, the verification reveals they follow different initialization code paths:

--multimodal-encode-worker → init_multimodal_encode_worker() with EncodeWorkerHandler

--multimodal-encode-prefill-worker → init_multimodal_worker() with a different handler

Both register on encoder.generate because they serve the same semantic role (encoding), but use different handler implementations. The processor (lines 450-470) connects to encoder.generate generically, unaware of which handler type is serving it. This distributed system pattern—allowing different implementations to share an endpoint across processes—provides deployment flexibility while the intra-process mutual exclusivity check prevents conflicts. No issues found.

components/src/dynamo/vllm/main.py (3)

107-109: LGTM! Routing logic correctly updated.

The condition properly routes both multimodal_worker and multimodal_encode_prefill_worker to the unified initialization function, consolidating the code path as intended.

540-548: LGTM! Docstring effectively documents the dual-mode design.

The updated documentation clearly explains the two supported modes and their operational differences, which will help maintainers understand the architectural choice.

557-559: Track disaggregated mode implementation.

The TODO indicates that disaggregated mode (prefill → decode split) is not yet implemented for the multimodal worker. The current aggregated mode (P+D in single worker) should work correctly with downstream_client=None, but ensure this limitation is tracked in the referenced GitHub issue.

According to the PR summary, this PR references closing issue #xxx. Please verify:

Does the issue description clearly state that only aggregated mode is supported in this PR?

Is there a follow-up issue to track disaggregated mode implementation?

Based on learnings: The MultimodalPDWorkerHandler constructor (from relevant code snippets) has decode_worker_client: Client = None as an optional parameter, so passing None is safe. However, verify that the handler's logic properly checks for None before attempting to use the client for disaggregated operations.

Signed-off-by: ayushag <[email protected]>

components/backends/vllm/launch/agg_multimodal_llama.sh

krishung5 · 2025-10-30T19:01:21Z

From @ayushag-nv : validated llama4 on h100x8.

ayushag-nv · 2025-10-30T19:07:53Z

/ok to test b9c50d7

krishung5 · 2025-10-30T19:37:41Z

/ok to test b9c50d7

ayushag-nv · 2025-10-30T20:11:54Z

/ok to test af4eb7d

ayushag-nv · 2025-10-30T21:17:01Z

/ok to test 9dcea6e

chore: support for agg llama4 mulimodal

bd265b4

Signed-off-by: ayushag <[email protected]>

ayushag-nv requested review from a team as code owners October 30, 2025 08:00

pull-request-size bot added the size/M label Oct 30, 2025

github-actions bot added the chore label Oct 30, 2025

ayushag-nv requested review from krishung5 and rmccorm4 October 30, 2025 08:00

coderabbitai bot reviewed Oct 30, 2025

View reviewed changes

fix: args

b9c50d7

Signed-off-by: ayushag <[email protected]>

krishung5 reviewed Oct 30, 2025

View reviewed changes

components/backends/vllm/launch/agg_multimodal_llama.sh Show resolved Hide resolved

krishung5 approved these changes Oct 30, 2025

View reviewed changes

ayushag-nv enabled auto-merge (squash) October 30, 2025 19:27

Merge branch 'main' into ayushag/mm-agg-llama4

af4eb7d

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 20:12 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 20:17 Inactive

Merge branch 'main' into ayushag/mm-agg-llama4

9dcea6e

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 21:17 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 21:23 Inactive

ayushag-nv merged commit 22d910a into main Oct 30, 2025
23 of 34 checks passed

ayushag-nv deleted the ayushag/mm-agg-llama4 branch October 30, 2025 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: support for agg llama4 mulimodal #3984

chore: support for agg llama4 mulimodal #3984

ayushag-nv commented Oct 30, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Oct 30, 2025

Uh oh!

coderabbitai bot commented Oct 30, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

krishung5 commented Oct 30, 2025

Uh oh!

ayushag-nv commented Oct 30, 2025

Uh oh!

krishung5 commented Oct 30, 2025

Uh oh!

ayushag-nv commented Oct 30, 2025

Uh oh!

ayushag-nv commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chore: support for agg llama4 mulimodal #3984

chore: support for agg llama4 mulimodal #3984

Conversation

ayushag-nv commented Oct 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 30, 2025

Uh oh!

coderabbitai bot commented Oct 30, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

krishung5 commented Oct 30, 2025

Uh oh!

ayushag-nv commented Oct 30, 2025

Uh oh!

krishung5 commented Oct 30, 2025

Uh oh!

ayushag-nv commented Oct 30, 2025

Uh oh!

ayushag-nv commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ayushag-nv commented Oct 30, 2025 •

edited by coderabbitai bot

Loading