- 
                Notifications
    You must be signed in to change notification settings 
- Fork 663
fix: add prefill metrics support for TensorRT-LLM disaggregated mode #3983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: add prefill metrics support for TensorRT-LLM disaggregated mode #3983
Conversation
- Fix KV cache config to preserve cache_transceiver_config from YAML - Add disagg_same_gpu.sh with metrics enabled on ports 8081/8082 - Add test validation for prefill (8081) and decode (8082) metrics - Add port parameter to metric_payload_default for multi-port testing Signed-off-by: Keiven Chang <[email protected]>
| WalkthroughThe changes introduce a new disaggregated inference workflow script for single-GPU deployments, update event and metrics handling in the runtime configuration, rename test configurations to reflect the new script, and extend the metrics payload builder with a configurable port parameter. Changes
 Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 
 Poem
 Pre-merge checks❌ Failed checks (1 warning)
 ✅ Passed checks (2 passed)
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
components/src/dynamo/trtllm/main.py (1)
244-256: Clarify the asymmetric cache_transceiver_config handling.The comment at line 244 states "preserving cache_transceiver_config from YAML," but this only applies when
current_kv_configis a dict (lines 252-256). When it's aKvCacheConfigobject (lines 246-251), the code explicitly creates a new dict with only two fields, discarding anycache_transceiver_configthat may have been configured.This asymmetry could confuse future maintainers. Consider either:
- Updating the line 244 comment to clarify this conditional preservation
- Documenting why
cache_transceiver_configshould not be preserved fromKvCacheConfigobjects- # Add it to kv_cache_config while preserving cache_transceiver_config from YAML + # Add event_buffer_max_size to kv_cache_config + # Note: cache_transceiver_config is preserved only when loading from YAML (dict case) current_kv_config = arg_map["kv_cache_config"]components/backends/trtllm/launch/disagg_same_gpu.sh (1)
29-30: Document engine config path divergence from disagg.sh.This script uses
trtllm-smallengine configs whereasdisagg.shusestrtllmconfigs. This difference is intentional for single-GPU testing (smaller memory footprint), but consider adding a comment to prevent confusion.+# Use trtllm-small configs for reduced memory footprint on single GPU export PREFILL_ENGINE_ARGS=${PREFILL_ENGINE_ARGS:-"$DYNAMO_HOME/recipes/qwen3/trtllm-small/prefill.yaml"} export DECODE_ENGINE_ARGS=${DECODE_ENGINE_ARGS:-"$DYNAMO_HOME/recipes/qwen3/trtllm-small/decode.yaml"}
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
- lib/bindings/python/Cargo.lockis excluded by- !**/*.lock
📒 Files selected for processing (5)
- components/backends/trtllm/launch/disagg.sh(1 hunks)
- components/backends/trtllm/launch/disagg_same_gpu.sh(1 hunks)
- components/src/dynamo/trtllm/main.py(1 hunks)
- tests/serve/test_trtllm.py(1 hunks)
- tests/utils/payload_builder.py(2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.
📚 Learning: 2025-08-25T23:24:42.076Z
Learnt from: tzulingk
PR: ai-dynamo/dynamo#2666
File: components/backends/trtllm/src/dynamo/trtllm/publisher.py:0-0
Timestamp: 2025-08-25T23:24:42.076Z
Learning: WorkerMetricsPublisher.create_endpoint method signature has been updated in _core.pyi to include metrics_labels parameter: `def create_endpoint(self, component: str, metrics_labels: Optional[List[Tuple[str, str]]] = None)`, making the metrics_labels parameter optional with default value of None.
Applied to files:
- tests/utils/payload_builder.py
📚 Learning: 2025-09-24T17:26:17.225Z
Learnt from: tzulingk
PR: ai-dynamo/dynamo#3194
File: tests/fault_tolerance/deploy/client.py:196-216
Timestamp: 2025-09-24T17:26:17.225Z
Learning: In tests/fault_tolerance/deploy/client.py, when no pods are ready, the port defaults to 0, creating URL "http://localhost:0/{endpoint}". The requests.post() call will raise ConnectionError or ConnectTimeout, which are caught by requests.RequestException in _single_request function.
Applied to files:
- tests/utils/payload_builder.py
📚 Learning: 2025-07-31T11:26:48.422Z
Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#2217
File: components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml:18-0
Timestamp: 2025-07-31T11:26:48.422Z
Learning: TRTLLM LLM-API expects all caps for backend field names in configuration files. When migrating TRTLLM configurations, backend values like "WideEP" should be changed to "WIDEEP" to comply with the API requirements.
Applied to files:
- tests/serve/test_trtllm.py
🧬 Code graph analysis (2)
components/backends/trtllm/launch/disagg_same_gpu.sh (1)
components/backends/trtllm/launch/disagg.sh (1)
cleanup(19-24)
tests/serve/test_trtllm.py (1)
tests/utils/payload_builder.py (1)
metric_payload_default(65-80)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: trtllm (arm64)
- GitHub Check: operator (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
tests/utils/payload_builder.py (1)
65-80: LGTM! Clean extension for multi-port metrics testing.The addition of the
portparameter with a sensible default (8081) enables the test infrastructure to validate metrics from both prefill and decode workers on separate ports, as required by the disaggregated mode enhancements.tests/serve/test_trtllm.py (1)
51-64: LGTM! Test configuration correctly supports dual metrics validation.The configuration changes properly reflect the new disaggregated same-GPU workflow:
- Script reference updated to
disagg_same_gpu.sh- GPU requirements reduced from
gpu_2togpu_1(appropriate for single-GPU testing)- Dual metric payloads added for ports 8081 (prefill) and 8082 (decode) to validate metrics from both workers
components/backends/trtllm/launch/disagg_same_gpu.sh (2)
37-42: LGTM! Cleanup logic is correct.The cleanup function properly handles both background processes (DYNAMO_PID and PREFILL_PID). The decode worker runs in the foreground (no
&on line 73), so it doesn't require explicit cleanup in the trap—the script will naturally terminate when the decode worker exits.
50-73: LGTM! Dual worker setup with metrics support is well-structured.The script correctly:
- Launches both workers on the same GPU with separate system ports (8081 for prefill, 8082 for decode)
- Enables metrics via
--publish-events-and-metricson both workers- Runs decode in foreground to keep the script alive
- Uses appropriate environment variables for coordination
This aligns perfectly with the PR objectives for prefill metrics support in disaggregated mode.
| FREE_GPU_GB=$(python3 -c "import torch; print(torch.cuda.mem_get_info()[0]/1024**3)" 2>/dev/null) | ||
| if [ $? -ne 0 ]; then | ||
| echo "Error: Failed to check GPU memory. Is PyTorch with CUDA available?" | ||
| exit 1 | ||
| fi | ||
|  | ||
| REQUIRED_GB=20 | ||
| if (( $(echo "$FREE_GPU_GB < $REQUIRED_GB" | bc -l) )); then | ||
| echo "Error: Insufficient GPU memory. Required: ${REQUIRED_GB}GB, Available: ${FREE_GPU_GB}GB" | ||
| echo "Please free up GPU memory before running disaggregated mode on single GPU." | ||
| exit 1 | ||
| fi | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify dependencies are available in test environment.
The GPU memory check requires:
- PyTorch with CUDA support (line 9)
- The bcutility for floating-point comparison (line 16)
🏁 Script executed:
#!/bin/bash
# Verify PyTorch CUDA and bc are available in the test container
echo "Checking PyTorch CUDA availability..."
python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" 2>&1
echo -e "\nChecking bc utility..."
which bc
bc --versionLength of output: 265
Add error handling for missing bc utility.
Verification confirmed that bc is not available in the test environment. The code has error handling when PyTorch fails (lines 10–11) but lacks error handling for the bc floating-point comparison on line 16. If bc is unavailable at runtime, the script will fail with an unclear error.
Either add a check for bc availability (similar to the PyTorch check on lines 10–11) or use bash arithmetic instead:
# Alternative using bash arithmetic (avoids bc dependency)
if (( FREE_GPU_GB < REQUIRED_GB )); then🤖 Prompt for AI Agents
In components/backends/trtllm/launch/disagg_same_gpu.sh around lines 9 to 20,
the script uses bc for the floating-point comparison but does not check that bc
is installed, which causes an unclear failure if bc is missing; either add a
pre-check for bc (exit with a clear error if not found) before performing the
comparison, or avoid the bc dependency by converting/freezing to an integer
comparison or using bash arithmetic (e.g., scale-free integer MB comparison or
compare rounded GB values) so the script reliably determines if FREE_GPU_GB <
REQUIRED_GB and exits with a clear error message when insufficient memory.
- Add aggregated mode config with 0.24 GPU memory fraction - Add prefill/decode configs for disaggregated mode on shared GPU - All configs use FP16, chunked prefill, and max_batch_size=4 Signed-off-by: Keiven Chang <[email protected]>
- Restore regular disaggregated test from main (requires 2 GPUs, no metrics) - Keep disaggregated_same_gpu test (requires 1 GPU, with metrics validation) Signed-off-by: Keiven Chang <[email protected]>
Overview:
Adds prefill metrics support for TensorRT-LLM disaggregated mode. Previously, only agg metrics were validated; now both prefill and decode workers expose metrics on separate ports (8081 and 8082).
Details:
main.pyto preservecache_transceiver_configfrom YAML when addingevent_buffer_max_size, ensuring disaggregated mode coordination works correctlydisagg_same_gpu.shfor single-GPU disaggregated testing--publish-events-and-metricsflagtest_trtllm.pyto validate metrics from both ports (8081 for prefill, 8082 for decode)metric_payload_default()to accept port parameter for multi-port metrics testingWhere should the reviewer start?
components/src/dynamo/trtllm/main.py- KV cache config preservation logiccomponents/backends/trtllm/launch/disagg_same_gpu.sh- new single-GPU disaggregated launch script with metricstests/serve/test_trtllm.py- disaggregated_same_gpu test configuration with dual metrics validationRelated Issues:
DIS-911
/coderabbit profile chill
Summary by CodeRabbit
New Features
Infrastructure