- 
                Notifications
    You must be signed in to change notification settings 
- Fork 663
chore: support for agg llama4 mulimodal #3984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: ayushag <[email protected]>
| WalkthroughIntroduced a new multimodal-encode-prefill-worker option to the vLLM backend system. Updated the launcher script to use this worker type, added corresponding CLI flag and Config field to argument parsing, and extended main worker routing logic to support the new worker mode. Changes
 Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 
 Poem
 Pre-merge checks❌ Failed checks (2 warnings)
 ✅ Passed checks (1 passed)
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (1)
components/src/dynamo/vllm/args.py (1)
74-74: Add missing Config field declaration.The
multimodal_encode_prefill_workerfield is assigned at line 248 but not declared in theConfigclass. While Python allows dynamic attribute assignment, this is inconsistent with the pattern used for other multimodal flags (lines 69-71) and could cause confusion or type checking issues.Apply this diff to add the field declaration:
multimodal_processor: bool = False multimodal_encode_worker: bool = False multimodal_worker: bool = False + multimodal_encode_prefill_worker: bool = False mm_prompt_template: str = "USER: <image>\n<prompt> ASSISTANT:"
🧹 Nitpick comments (1)
components/src/dynamo/vllm/main.py (1)
577-584: Formatting change: No functional impact.The endpoint serving calls have been reformatted to place
handler.generate,metrics_labels, andhandler.clear_kv_blockson separate lines. This improves readability without changing behavior.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
- components/backends/vllm/launch/agg_multimodal_llama.sh(1 hunks)
- components/src/dynamo/vllm/args.py(4 hunks)
- components/src/dynamo/vllm/main.py(4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
📚 Learning: 2025-09-16T19:47:30.312Z
Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#3067
File: lib/llm/src/preprocessor/prompt/template/oai.rs:87-134
Timestamp: 2025-09-16T19:47:30.312Z
Learning: In Dynamo, multimodal requests (containing image_url or other non-text content) are processed through a completely different workflow than text-only requests, so the may_be_fix_msg_content function in lib/llm/src/preprocessor/prompt/template/oai.rs will only encounter text-only content arrays.
Applied to files:
- components/backends/vllm/launch/agg_multimodal_llama.sh
🧬 Code graph analysis (1)
components/src/dynamo/vllm/main.py (3)
components/src/dynamo/vllm/multimodal_handlers/worker_handler.py (1)
MultimodalPDWorkerHandler(84-260)lib/bindings/python/src/dynamo/_core.pyi (3)
component(88-92)
generate(1363-1402)
serve_endpoint(140-152)lib/bindings/python/rust/lib.rs (3)
component(790-796)
generate(828-840)
serve_endpoint(705-758)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (6)
components/backends/vllm/launch/agg_multimodal_llama.sh (1)
14-17: LGTM! Clear documentation of the architectural choice.The comment effectively explains why Llama 4 requires integrated encoding, and the worker configuration parameters are appropriate for the aggregated encode+prefill+decode workflow.
components/src/dynamo/vllm/args.py (2)
160-164: LGTM! CLI flag properly documented.The help text clearly distinguishes this worker type from the other multimodal options and explains its use case for models like Llama 4 that require integrated encoding.
222-224: Both worker types intentionally share theencoder.generateendpoint—this is the intended design.The mutual exclusivity check (lines 204-213) ensures these worker types cannot run in the same process. However, the verification reveals they follow different initialization code paths:
--multimodal-encode-worker→init_multimodal_encode_worker()withEncodeWorkerHandler
--multimodal-encode-prefill-worker→init_multimodal_worker()with a different handlerBoth register on
encoder.generatebecause they serve the same semantic role (encoding), but use different handler implementations. The processor (lines 450-470) connects toencoder.generategenerically, unaware of which handler type is serving it. This distributed system pattern—allowing different implementations to share an endpoint across processes—provides deployment flexibility while the intra-process mutual exclusivity check prevents conflicts. No issues found.components/src/dynamo/vllm/main.py (3)
107-109: LGTM! Routing logic correctly updated.The condition properly routes both
multimodal_workerandmultimodal_encode_prefill_workerto the unified initialization function, consolidating the code path as intended.
540-548: LGTM! Docstring effectively documents the dual-mode design.The updated documentation clearly explains the two supported modes and their operational differences, which will help maintainers understand the architectural choice.
557-559: Track disaggregated mode implementation.The TODO indicates that disaggregated mode (prefill → decode split) is not yet implemented for the multimodal worker. The current aggregated mode (P+D in single worker) should work correctly with
downstream_client=None, but ensure this limitation is tracked in the referenced GitHub issue.According to the PR summary, this PR references closing issue #xxx. Please verify:
- Does the issue description clearly state that only aggregated mode is supported in this PR?
- Is there a follow-up issue to track disaggregated mode implementation?
Based on learnings: The MultimodalPDWorkerHandler constructor (from relevant code snippets) has
decode_worker_client: Client = Noneas an optional parameter, so passingNoneis safe. However, verify that the handler's logic properly checks forNonebefore attempting to use the client for disaggregated operations.
Signed-off-by: ayushag <[email protected]>
| From @ayushag-nv : validated llama4 on h100x8. | 
| /ok to test b9c50d7 | 
| /ok to test b9c50d7 | 
| /ok to test af4eb7d | 
| /ok to test 9dcea6e | 
Overview:
Introduces a new --multimodal-encode-prefill-worker flag to support models that require integrated image encoding (e.g., Llama 4), where the same worker handles image encoding, prefill, and decode operations.
Changes
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit