fix: add prefill metrics support for TensorRT-LLM disaggregated mode #3983

keivenchang · 2025-10-30T02:45:31Z

Overview:

Adds prefill metrics support for TensorRT-LLM disaggregated mode. Previously, only agg metrics were validated; now both prefill and decode workers expose metrics on separate ports (8081 and 8082).

Details:

KV Cache Config Fix: Modified main.py to preserve cache_transceiver_config from YAML when adding event_buffer_max_size, ensuring disaggregated mode coordination works correctly
New Launch Script: Added disagg_same_gpu.sh for single-GPU disaggregated testing
Dual Metrics Support: Both prefill and decode workers now publish metrics with --publish-events-and-metrics flag
Test Enhancement: Updated test_trtllm.py to validate metrics from both ports (8081 for prefill, 8082 for decode)
Test Infrastructure: Extended metric_payload_default() to accept port parameter for multi-port metrics testing

Where should the reviewer start?

components/src/dynamo/trtllm/main.py - KV cache config preservation logic
components/backends/trtllm/launch/disagg_same_gpu.sh - new single-GPU disaggregated launch script with metrics
tests/serve/test_trtllm.py - disaggregated_same_gpu test configuration with dual metrics validation

Related Issues:

DIS-911

/coderabbit profile chill

Summary by CodeRabbit

New Features
- Disaggregated inference now supports single-GPU execution with integrated GPU memory validation to ensure sufficient resources are available before inference begins.
Infrastructure
- Enhanced the metrics collection system to support multiple backend metric endpoints for improved visibility into system performance.

- Fix KV cache config to preserve cache_transceiver_config from YAML - Add disagg_same_gpu.sh with metrics enabled on ports 8081/8082 - Add test validation for prefill (8081) and decode (8082) metrics - Add port parameter to metric_payload_default for multi-port testing Signed-off-by: Keiven Chang <[email protected]>

coderabbitai · 2025-10-30T02:51:43Z

Walkthrough

The changes introduce a new disaggregated inference workflow script for single-GPU deployments, update event and metrics handling in the runtime configuration, rename test configurations to reflect the new script, and extend the metrics payload builder with a configurable port parameter.

Changes

Cohort / File(s)	Summary
Single-GPU disaggregated inference `components/backends/trtllm/launch/disagg.sh`, `components/backends/trtllm/launch/disagg_same_gpu.sh`	New shell script implements disaggregated prefill and decode workers on a single GPU with GPU memory validation, signal handlers, and inter-process coordination via separate CUDA devices and ports. Existing disagg.sh file receives formatting adjustment for end-of-file marker.
Event and metrics configuration `components/src/dynamo/trtllm/main.py`	KvCacheConfig handling modified to ensure event_buffer_max_size is populated in both object and dict forms when publishing events and metrics; cache_transceiver_config preservation depends on the config form.
Test configuration and payload utilities `tests/serve/test_trtllm.py`, `tests/utils/payload_builder.py`	Test configuration renamed from disaggregated to disaggregated_same_gpu and updated to use new script with single GPU and dual metric payload ports (8081, 8082). Function signature of metric_payload_default extended with configurable port parameter (default 8081).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

components/backends/trtllm/launch/disagg_same_gpu.sh — Complex shell script with multiple coordinated background processes, GPU memory validation, and signal handling; requires verification of GPU sharing semantics and process lifecycle management.
components/src/dynamo/trtllm/main.py — Logic change in KvCacheConfig handling; verify correctness of conditional branches and preservation of cache_transceiver_config across both code paths.
tests/serve/test_trtllm.py — Validate that metric payload port assignments (8081, 8082) align with backend expectations and don't conflict with existing infrastructure.

Poem

🐰 A rabbit hops through GPU dreams,
Prefill and decode split at the seams,
One card, two workers, metrics aligned,
Disaggregated workflows, refined and refined! ✨
With ports and payloads, coordination's the key,
Single-GPU inference—efficient as can be!

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "fix: add prefill metrics support for TensorRT-LLM disaggregated mode" directly and accurately summarizes the main objective of the changeset. It clearly identifies the specific feature being added (prefill metrics support), specifies the target system (TensorRT-LLM disaggregated mode), and uses appropriate labeling (fix). The title is concise, specific, and aligns well with the primary purpose described in the PR objectives and the changes across all affected files, including the new launch script, main.py modifications, and test enhancements.
Description Check	✅ Passed	The pull request description comprehensively follows the repository template and includes all required sections: Overview (describing the prefill metrics support addition), Details (organized with specific bullet points for each change), Where should the reviewer start? (with direct file recommendations), and Related Issues (referencing DIS-911). The description provides substantive information beyond template minimums, including specific technical details about KV cache config preservation, the new launch script, dual metrics ports, and test infrastructure changes, making it clear and actionable for reviewers.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

components/src/dynamo/trtllm/main.py (1)
244-256: Clarify the asymmetric cache_transceiver_config handling.

The comment at line 244 states "preserving cache_transceiver_config from YAML," but this only applies when current_kv_config is a dict (lines 252-256). When it's a KvCacheConfig object (lines 246-251), the code explicitly creates a new dict with only two fields, discarding any cache_transceiver_config that may have been configured.

This asymmetry could confuse future maintainers. Consider either:

Updating the line 244 comment to clarify this conditional preservation

Documenting why cache_transceiver_config should not be preserved from KvCacheConfig objects
-        # Add it to kv_cache_config while preserving cache_transceiver_config from YAML
+        # Add event_buffer_max_size to kv_cache_config
+        # Note: cache_transceiver_config is preserved only when loading from YAML (dict case)
         current_kv_config = arg_map["kv_cache_config"]
components/backends/trtllm/launch/disagg_same_gpu.sh (1)
29-30: Document engine config path divergence from disagg.sh.

This script uses trtllm-small engine configs whereas disagg.sh uses trtllm configs. This difference is intentional for single-GPU testing (smaller memory footprint), but consider adding a comment to prevent confusion.
+# Use trtllm-small configs for reduced memory footprint on single GPU
 export PREFILL_ENGINE_ARGS=${PREFILL_ENGINE_ARGS:-"$DYNAMO_HOME/recipes/qwen3/trtllm-small/prefill.yaml"}
 export DECODE_ENGINE_ARGS=${DECODE_ENGINE_ARGS:-"$DYNAMO_HOME/recipes/qwen3/trtllm-small/decode.yaml"}

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6bec8b3 and c66e824.

⛔ Files ignored due to path filters (1)

lib/bindings/python/Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

components/backends/trtllm/launch/disagg.sh (1 hunks)
components/backends/trtllm/launch/disagg_same_gpu.sh (1 hunks)
components/src/dynamo/trtllm/main.py (1 hunks)
tests/serve/test_trtllm.py (1 hunks)
tests/utils/payload_builder.py (2 hunks)

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.

Learnt from: keivenchang
PR: ai-dynamo/dynamo#3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.

📚 Learning: 2025-08-25T23:24:42.076Z

Learnt from: tzulingk
PR: ai-dynamo/dynamo#2666
File: components/backends/trtllm/src/dynamo/trtllm/publisher.py:0-0
Timestamp: 2025-08-25T23:24:42.076Z
Learning: WorkerMetricsPublisher.create_endpoint method signature has been updated in _core.pyi to include metrics_labels parameter: `def create_endpoint(self, component: str, metrics_labels: Optional[List[Tuple[str, str]]] = None)`, making the metrics_labels parameter optional with default value of None.

Applied to files:

tests/utils/payload_builder.py

📚 Learning: 2025-09-24T17:26:17.225Z

Learnt from: tzulingk
PR: ai-dynamo/dynamo#3194
File: tests/fault_tolerance/deploy/client.py:196-216
Timestamp: 2025-09-24T17:26:17.225Z
Learning: In tests/fault_tolerance/deploy/client.py, when no pods are ready, the port defaults to 0, creating URL "http://localhost:0/{endpoint}". The requests.post() call will raise ConnectionError or ConnectTimeout, which are caught by requests.RequestException in _single_request function.

Applied to files:

tests/utils/payload_builder.py

📚 Learning: 2025-07-31T11:26:48.422Z

Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#2217
File: components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml:18-0
Timestamp: 2025-07-31T11:26:48.422Z
Learning: TRTLLM LLM-API expects all caps for backend field names in configuration files. When migrating TRTLLM configurations, backend values like "WideEP" should be changed to "WIDEEP" to comply with the API requirements.

Applied to files:

tests/serve/test_trtllm.py

🧬 Code graph analysis (2)

components/backends/trtllm/launch/disagg_same_gpu.sh (1)

components/backends/trtllm/launch/disagg.sh (1)

cleanup (19-24)

tests/serve/test_trtllm.py (1)

tests/utils/payload_builder.py (1)

metric_payload_default (65-80)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: trtllm (arm64)
GitHub Check: operator (arm64)
GitHub Check: sglang (amd64)
GitHub Check: sglang (arm64)
GitHub Check: vllm (arm64)
GitHub Check: operator (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (4)

tests/utils/payload_builder.py (1)

65-80: LGTM! Clean extension for multi-port metrics testing.

The addition of the port parameter with a sensible default (8081) enables the test infrastructure to validate metrics from both prefill and decode workers on separate ports, as required by the disaggregated mode enhancements.

tests/serve/test_trtllm.py (1)

51-64: LGTM! Test configuration correctly supports dual metrics validation.

The configuration changes properly reflect the new disaggregated same-GPU workflow:

Script reference updated to disagg_same_gpu.sh

GPU requirements reduced from gpu_2 to gpu_1 (appropriate for single-GPU testing)

Dual metric payloads added for ports 8081 (prefill) and 8082 (decode) to validate metrics from both workers

components/backends/trtllm/launch/disagg_same_gpu.sh (2)

37-42: LGTM! Cleanup logic is correct.

The cleanup function properly handles both background processes (DYNAMO_PID and PREFILL_PID). The decode worker runs in the foreground (no & on line 73), so it doesn't require explicit cleanup in the trap—the script will naturally terminate when the decode worker exits.

50-73: LGTM! Dual worker setup with metrics support is well-structured.

The script correctly:

Launches both workers on the same GPU with separate system ports (8081 for prefill, 8082 for decode)

Enables metrics via --publish-events-and-metrics on both workers

Runs decode in foreground to keep the script alive

Uses appropriate environment variables for coordination

This aligns perfectly with the PR objectives for prefill metrics support in disaggregated mode.

coderabbitai · 2025-10-30T02:51:47Z

components/backends/trtllm/launch/disagg_same_gpu.sh

+FREE_GPU_GB=$(python3 -c "import torch; print(torch.cuda.mem_get_info()[0]/1024**3)" 2>/dev/null)
+if [ $? -ne 0 ]; then
+    echo "Error: Failed to check GPU memory. Is PyTorch with CUDA available?"
+    exit 1
+fi
+
+REQUIRED_GB=20
+if (( $(echo "$FREE_GPU_GB < $REQUIRED_GB" | bc -l) )); then
+    echo "Error: Insufficient GPU memory. Required: ${REQUIRED_GB}GB, Available: ${FREE_GPU_GB}GB"
+    echo "Please free up GPU memory before running disaggregated mode on single GPU."
+    exit 1
+fi


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify dependencies are available in test environment.

The GPU memory check requires:

PyTorch with CUDA support (line 9)

The bc utility for floating-point comparison (line 16)

🏁 Script executed:

#!/bin/bash # Verify PyTorch CUDA and bc are available in the test container echo "Checking PyTorch CUDA availability..." python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" 2>&1 echo -e "\nChecking bc utility..." which bc bc --version

Length of output: 265

Add error handling for missing bc utility.

Verification confirmed that bc is not available in the test environment. The code has error handling when PyTorch fails (lines 10–11) but lacks error handling for the bc floating-point comparison on line 16. If bc is unavailable at runtime, the script will fail with an unclear error.

Either add a check for bc availability (similar to the PyTorch check on lines 10–11) or use bash arithmetic instead:

# Alternative using bash arithmetic (avoids bc dependency) if (( FREE_GPU_GB < REQUIRED_GB )); then

🤖 Prompt for AI Agents

In components/backends/trtllm/launch/disagg_same_gpu.sh around lines 9 to 20, the script uses bc for the floating-point comparison but does not check that bc is installed, which causes an unclear failure if bc is missing; either add a pre-check for bc (exit with a clear error if not found) before performing the comparison, or avoid the bc dependency by converting/freezing to an integer comparison or using bash arithmetic (e.g., scale-free integer MB comparison or compare rounded GB values) so the script reliably determines if FREE_GPU_GB < REQUIRED_GB and exits with a clear error message when insufficient memory.

- Add aggregated mode config with 0.24 GPU memory fraction - Add prefill/decode configs for disaggregated mode on shared GPU - All configs use FP16, chunked prefill, and max_batch_size=4 Signed-off-by: Keiven Chang <[email protected]>

- Restore regular disaggregated test from main (requires 2 GPUs, no metrics) - Keep disaggregated_same_gpu test (requires 1 GPU, with metrics validation) Signed-off-by: Keiven Chang <[email protected]>

keivenchang requested review from a team as code owners October 30, 2025 02:45

keivenchang self-assigned this Oct 30, 2025

pull-request-size bot added the size/M label Oct 30, 2025

github-actions bot added the fix label Oct 30, 2025

coderabbitai bot reviewed Oct 30, 2025

View reviewed changes

keivenchang requested a review from a team as a code owner October 30, 2025 16:12

pull-request-size bot added size/L and removed size/M labels Oct 30, 2025

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 16:12 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 16:13 Inactive

add disaggregated test config alongside disaggregated_same_gpu

9aed3d2

- Restore regular disaggregated test from main (requires 2 GPUs, no metrics) - Keep disaggregated_same_gpu test (requires 1 GPU, with metrics validation) Signed-off-by: Keiven Chang <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 17:36 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 17:40 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add prefill metrics support for TensorRT-LLM disaggregated mode #3983

fix: add prefill metrics support for TensorRT-LLM disaggregated mode #3983

keivenchang commented Oct 30, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 30, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: add prefill metrics support for TensorRT-LLM disaggregated mode #3983

Are you sure you want to change the base?

fix: add prefill metrics support for TensorRT-LLM disaggregated mode #3983

Conversation

keivenchang commented Oct 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 30, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

keivenchang commented Oct 30, 2025 •

edited by coderabbitai bot

Loading