Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Oct 30, 2025

Overview:

Adds prefill metrics support for TensorRT-LLM disaggregated mode. Previously, only agg metrics were validated; now both prefill and decode workers expose metrics on separate ports (8081 and 8082).

Details:

  • KV Cache Config Fix: Modified main.py to preserve cache_transceiver_config from YAML when adding event_buffer_max_size, ensuring disaggregated mode coordination works correctly
  • New Launch Script: Added disagg_same_gpu.sh for single-GPU disaggregated testing
  • Dual Metrics Support: Both prefill and decode workers now publish metrics with --publish-events-and-metrics flag
  • Test Enhancement: Updated test_trtllm.py to validate metrics from both ports (8081 for prefill, 8082 for decode)
  • Test Infrastructure: Extended metric_payload_default() to accept port parameter for multi-port metrics testing

Where should the reviewer start?

  • components/src/dynamo/trtllm/main.py - KV cache config preservation logic
  • components/backends/trtllm/launch/disagg_same_gpu.sh - new single-GPU disaggregated launch script with metrics
  • tests/serve/test_trtllm.py - disaggregated_same_gpu test configuration with dual metrics validation

Related Issues:

DIS-911

/coderabbit profile chill

Summary by CodeRabbit

  • New Features

    • Disaggregated inference now supports single-GPU execution with integrated GPU memory validation to ensure sufficient resources are available before inference begins.
  • Infrastructure

    • Enhanced the metrics collection system to support multiple backend metric endpoints for improved visibility into system performance.

- Fix KV cache config to preserve cache_transceiver_config from YAML
- Add disagg_same_gpu.sh with metrics enabled on ports 8081/8082
- Add test validation for prefill (8081) and decode (8082) metrics
- Add port parameter to metric_payload_default for multi-port testing

Signed-off-by: Keiven Chang <[email protected]>
@keivenchang keivenchang requested review from a team as code owners October 30, 2025 02:45
@keivenchang keivenchang self-assigned this Oct 30, 2025
@github-actions github-actions bot added the fix label Oct 30, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 30, 2025

Walkthrough

The changes introduce a new disaggregated inference workflow script for single-GPU deployments, update event and metrics handling in the runtime configuration, rename test configurations to reflect the new script, and extend the metrics payload builder with a configurable port parameter.

Changes

Cohort / File(s) Summary
Single-GPU disaggregated inference
components/backends/trtllm/launch/disagg.sh, components/backends/trtllm/launch/disagg_same_gpu.sh
New shell script implements disaggregated prefill and decode workers on a single GPU with GPU memory validation, signal handlers, and inter-process coordination via separate CUDA devices and ports. Existing disagg.sh file receives formatting adjustment for end-of-file marker.
Event and metrics configuration
components/src/dynamo/trtllm/main.py
KvCacheConfig handling modified to ensure event_buffer_max_size is populated in both object and dict forms when publishing events and metrics; cache_transceiver_config preservation depends on the config form.
Test configuration and payload utilities
tests/serve/test_trtllm.py, tests/utils/payload_builder.py
Test configuration renamed from disaggregated to disaggregated_same_gpu and updated to use new script with single GPU and dual metric payload ports (8081, 8082). Function signature of metric_payload_default extended with configurable port parameter (default 8081).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • components/backends/trtllm/launch/disagg_same_gpu.sh — Complex shell script with multiple coordinated background processes, GPU memory validation, and signal handling; requires verification of GPU sharing semantics and process lifecycle management.
  • components/src/dynamo/trtllm/main.py — Logic change in KvCacheConfig handling; verify correctness of conditional branches and preservation of cache_transceiver_config across both code paths.
  • tests/serve/test_trtllm.py — Validate that metric payload port assignments (8081, 8082) align with backend expectations and don't conflict with existing infrastructure.

Poem

🐰 A rabbit hops through GPU dreams,
Prefill and decode split at the seams,
One card, two workers, metrics aligned,
Disaggregated workflows, refined and refined! ✨
With ports and payloads, coordination's the key,
Single-GPU inference—efficient as can be!

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "fix: add prefill metrics support for TensorRT-LLM disaggregated mode" directly and accurately summarizes the main objective of the changeset. It clearly identifies the specific feature being added (prefill metrics support), specifies the target system (TensorRT-LLM disaggregated mode), and uses appropriate labeling (fix). The title is concise, specific, and aligns well with the primary purpose described in the PR objectives and the changes across all affected files, including the new launch script, main.py modifications, and test enhancements.
Description Check ✅ Passed The pull request description comprehensively follows the repository template and includes all required sections: Overview (describing the prefill metrics support addition), Details (organized with specific bullet points for each change), Where should the reviewer start? (with direct file recommendations), and Related Issues (referencing DIS-911). The description provides substantive information beyond template minimums, including specific technical details about KV cache config preservation, the new launch script, dual metrics ports, and test infrastructure changes, making it clear and actionable for reviewers.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
components/src/dynamo/trtllm/main.py (1)

244-256: Clarify the asymmetric cache_transceiver_config handling.

The comment at line 244 states "preserving cache_transceiver_config from YAML," but this only applies when current_kv_config is a dict (lines 252-256). When it's a KvCacheConfig object (lines 246-251), the code explicitly creates a new dict with only two fields, discarding any cache_transceiver_config that may have been configured.

This asymmetry could confuse future maintainers. Consider either:

  1. Updating the line 244 comment to clarify this conditional preservation
  2. Documenting why cache_transceiver_config should not be preserved from KvCacheConfig objects
-        # Add it to kv_cache_config while preserving cache_transceiver_config from YAML
+        # Add event_buffer_max_size to kv_cache_config
+        # Note: cache_transceiver_config is preserved only when loading from YAML (dict case)
         current_kv_config = arg_map["kv_cache_config"]
components/backends/trtllm/launch/disagg_same_gpu.sh (1)

29-30: Document engine config path divergence from disagg.sh.

This script uses trtllm-small engine configs whereas disagg.sh uses trtllm configs. This difference is intentional for single-GPU testing (smaller memory footprint), but consider adding a comment to prevent confusion.

+# Use trtllm-small configs for reduced memory footprint on single GPU
 export PREFILL_ENGINE_ARGS=${PREFILL_ENGINE_ARGS:-"$DYNAMO_HOME/recipes/qwen3/trtllm-small/prefill.yaml"}
 export DECODE_ENGINE_ARGS=${DECODE_ENGINE_ARGS:-"$DYNAMO_HOME/recipes/qwen3/trtllm-small/decode.yaml"}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6bec8b3 and c66e824.

⛔ Files ignored due to path filters (1)
  • lib/bindings/python/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • components/backends/trtllm/launch/disagg.sh (1 hunks)
  • components/backends/trtllm/launch/disagg_same_gpu.sh (1 hunks)
  • components/src/dynamo/trtllm/main.py (1 hunks)
  • tests/serve/test_trtllm.py (1 hunks)
  • tests/utils/payload_builder.py (2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.
📚 Learning: 2025-08-25T23:24:42.076Z
Learnt from: tzulingk
PR: ai-dynamo/dynamo#2666
File: components/backends/trtllm/src/dynamo/trtllm/publisher.py:0-0
Timestamp: 2025-08-25T23:24:42.076Z
Learning: WorkerMetricsPublisher.create_endpoint method signature has been updated in _core.pyi to include metrics_labels parameter: `def create_endpoint(self, component: str, metrics_labels: Optional[List[Tuple[str, str]]] = None)`, making the metrics_labels parameter optional with default value of None.

Applied to files:

  • tests/utils/payload_builder.py
📚 Learning: 2025-09-24T17:26:17.225Z
Learnt from: tzulingk
PR: ai-dynamo/dynamo#3194
File: tests/fault_tolerance/deploy/client.py:196-216
Timestamp: 2025-09-24T17:26:17.225Z
Learning: In tests/fault_tolerance/deploy/client.py, when no pods are ready, the port defaults to 0, creating URL "http://localhost:0/{endpoint}". The requests.post() call will raise ConnectionError or ConnectTimeout, which are caught by requests.RequestException in _single_request function.

Applied to files:

  • tests/utils/payload_builder.py
📚 Learning: 2025-07-31T11:26:48.422Z
Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#2217
File: components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml:18-0
Timestamp: 2025-07-31T11:26:48.422Z
Learning: TRTLLM LLM-API expects all caps for backend field names in configuration files. When migrating TRTLLM configurations, backend values like "WideEP" should be changed to "WIDEEP" to comply with the API requirements.

Applied to files:

  • tests/serve/test_trtllm.py
🧬 Code graph analysis (2)
components/backends/trtllm/launch/disagg_same_gpu.sh (1)
components/backends/trtllm/launch/disagg.sh (1)
  • cleanup (19-24)
tests/serve/test_trtllm.py (1)
tests/utils/payload_builder.py (1)
  • metric_payload_default (65-80)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
tests/utils/payload_builder.py (1)

65-80: LGTM! Clean extension for multi-port metrics testing.

The addition of the port parameter with a sensible default (8081) enables the test infrastructure to validate metrics from both prefill and decode workers on separate ports, as required by the disaggregated mode enhancements.

tests/serve/test_trtllm.py (1)

51-64: LGTM! Test configuration correctly supports dual metrics validation.

The configuration changes properly reflect the new disaggregated same-GPU workflow:

  • Script reference updated to disagg_same_gpu.sh
  • GPU requirements reduced from gpu_2 to gpu_1 (appropriate for single-GPU testing)
  • Dual metric payloads added for ports 8081 (prefill) and 8082 (decode) to validate metrics from both workers
components/backends/trtllm/launch/disagg_same_gpu.sh (2)

37-42: LGTM! Cleanup logic is correct.

The cleanup function properly handles both background processes (DYNAMO_PID and PREFILL_PID). The decode worker runs in the foreground (no & on line 73), so it doesn't require explicit cleanup in the trap—the script will naturally terminate when the decode worker exits.


50-73: LGTM! Dual worker setup with metrics support is well-structured.

The script correctly:

  • Launches both workers on the same GPU with separate system ports (8081 for prefill, 8082 for decode)
  • Enables metrics via --publish-events-and-metrics on both workers
  • Runs decode in foreground to keep the script alive
  • Uses appropriate environment variables for coordination

This aligns perfectly with the PR objectives for prefill metrics support in disaggregated mode.

Comment on lines +9 to +20
FREE_GPU_GB=$(python3 -c "import torch; print(torch.cuda.mem_get_info()[0]/1024**3)" 2>/dev/null)
if [ $? -ne 0 ]; then
echo "Error: Failed to check GPU memory. Is PyTorch with CUDA available?"
exit 1
fi

REQUIRED_GB=20
if (( $(echo "$FREE_GPU_GB < $REQUIRED_GB" | bc -l) )); then
echo "Error: Insufficient GPU memory. Required: ${REQUIRED_GB}GB, Available: ${FREE_GPU_GB}GB"
echo "Please free up GPU memory before running disaggregated mode on single GPU."
exit 1
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify dependencies are available in test environment.

The GPU memory check requires:

  1. PyTorch with CUDA support (line 9)
  2. The bc utility for floating-point comparison (line 16)

🏁 Script executed:

#!/bin/bash
# Verify PyTorch CUDA and bc are available in the test container

echo "Checking PyTorch CUDA availability..."
python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" 2>&1

echo -e "\nChecking bc utility..."
which bc
bc --version

Length of output: 265


Add error handling for missing bc utility.

Verification confirmed that bc is not available in the test environment. The code has error handling when PyTorch fails (lines 10–11) but lacks error handling for the bc floating-point comparison on line 16. If bc is unavailable at runtime, the script will fail with an unclear error.

Either add a check for bc availability (similar to the PyTorch check on lines 10–11) or use bash arithmetic instead:

# Alternative using bash arithmetic (avoids bc dependency)
if (( FREE_GPU_GB < REQUIRED_GB )); then
🤖 Prompt for AI Agents
In components/backends/trtllm/launch/disagg_same_gpu.sh around lines 9 to 20,
the script uses bc for the floating-point comparison but does not check that bc
is installed, which causes an unclear failure if bc is missing; either add a
pre-check for bc (exit with a clear error if not found) before performing the
comparison, or avoid the bc dependency by converting/freezing to an integer
comparison or using bash arithmetic (e.g., scale-free integer MB comparison or
compare rounded GB values) so the script reliably determines if FREE_GPU_GB <
REQUIRED_GB and exits with a clear error message when insufficient memory.

- Add aggregated mode config with 0.24 GPU memory fraction
- Add prefill/decode configs for disaggregated mode on shared GPU
- All configs use FP16, chunked prefill, and max_batch_size=4

Signed-off-by: Keiven Chang <[email protected]>
- Restore regular disaggregated test from main (requires 2 GPUs, no metrics)
- Keep disaggregated_same_gpu test (requires 1 GPU, with metrics validation)

Signed-off-by: Keiven Chang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants