test: Add KV transfer cancellation test on TRT-LLM #4547

kthui · 2025-11-22T03:43:44Z

Overview:

Add KV transfer cancellation test on TRT-LLM

Details:

Cancel a request during KV transfer (50% - 80% success rate).
Send a new request verifying the system is functional.

Expected behavior:

[PYTHON3] [TensorRT-LLM][DEBUG][0] Start sending KV cache for request ID: 2048.
...
[PYTHON3] [TensorRT-LLM][DEBUG][0] Start receiving KV cache for request ID: 2048, context request ID: 2048.
...
[PYTHON3] 2025-11-20T20:27:07.103415Z DEBUG handler_base._handle_cancellation: Aborted Request ID: 22bddae2-e335-412c-8902-d13c6b7b133c
...
[PYTHON3] [TensorRT-LLM][DEBUG][0] End receiving KV cache for request ID: 2048, context request ID: 2048.
...
[PYTHON3] [TensorRT-LLM][DEBUG] Request 2048 finished by cancel

Where should the reviewer start?

N/A

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

resolves #4178

Summary by CodeRabbit

Tests
- Added comprehensive test coverage for request cancellation during the KV transfer phase in TensorRT-LLM workflows, ensuring proper resource cleanup, worker stability, and successful handling of subsequent requests.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Jacky <[email protected]>

coderabbitai · 2025-11-25T23:01:13Z

Walkthrough

A new end-to-end test was added to validate request cancellation behavior during the KV transfer phase in the TensorRT-LLM workflow. The test orchestrates prefill and decode workers, triggers a cancellable request, cancels during KV transfer, and verifies proper logging, cleanup messages, and continued worker functionality.

Changes

Cohort / File(s)	Summary
TensorRT-LLM Cancellation Tests `tests/fault_tolerance/cancellation/test_trtllm.py`	Added `test_request_cancellation_trtllm_kv_transfer_cancel()` function to test cancellation during the KV transfer phase between prefill and decode. Validates abort logging, kill message emission, and worker continuation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Test follows established patterns from existing cancellation test scenarios
Single file addition with focused scope
Logic centers on orchestrating worker processes and validating cancellation side effects
Primary review concern: verify cancellation mechanics during KV transfer phase are correctly exercised and that assertions properly validate expected outcomes

Poem

🐰✨ A curious path through transfer's dance,
Where keys and values prance and advance!
Cancel mid-stride, watch workers survive—
Our test hops through chaos and keeps systems alive!
thump-thump 🎉

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding a KV transfer cancellation test for TRT-LLM.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The PR description follows the required template structure with Overview, Details, Where to start, and Related Issues sections, though some sections are minimal.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

tests/fault_tolerance/cancellation/test_trtllm.py (1)
443-455: Good defensive verification of worker health.

Verifying that workers remain functional after KV transfer cancellation is a good practice, especially given the complexity of this cancellation scenario. The test appropriately confirms the decode worker can handle subsequent requests.

Optionally, consider also verifying the prefill worker remains functional by checking for its "Prefill Request ID" log:
# Verify prefill worker is also functional
_, prefill_log_offset = poll_for_pattern(
    process=prefill_worker,
    pattern="Prefill Request ID: ",
    log_offset=prefill_log_offset,
    match_type="contains",
)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a38479 and b34f4d9.

📒 Files selected for processing (1)

tests/fault_tolerance/cancellation/test_trtllm.py (1 hunks)

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py:106-147
Timestamp: 2025-09-25T00:49:16.914Z
Learning: In TensorRT-LLM cancellation implementation, there are two distinct request IDs: the external/user-facing request_id from the incoming request, and the internal_request_id from TRT-LLM's generation_result that's needed for executor.abort_request() calls.

Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3391
File: tests/fault_tolerance/cancellation/utils.py:323-390
Timestamp: 2025-10-03T01:53:15.023Z
Learning: In `tests/fault_tolerance/cancellation/utils.py`, the `poll_for_pattern` function's default `max_wait_ms` of 500ms is intentionally set to detect failures in cancellation signal propagation to TRT-LLM. This timeout covers only the time for the cancellation signal to reach TRT-LLM (not any generation routine), and if cancellation takes longer than 0.5s to propagate, it should be considered a test failure.

📚 Learning: 2025-09-25T00:49:16.914Z

Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py:106-147
Timestamp: 2025-09-25T00:49:16.914Z
Learning: In TensorRT-LLM cancellation implementation, there are two distinct request IDs: the external/user-facing request_id from the incoming request, and the internal_request_id from TRT-LLM's generation_result that's needed for executor.abort_request() calls.

Applied to files:

tests/fault_tolerance/cancellation/test_trtllm.py

📚 Learning: 2025-10-03T01:53:15.023Z

Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3391
File: tests/fault_tolerance/cancellation/utils.py:323-390
Timestamp: 2025-10-03T01:53:15.023Z
Learning: In `tests/fault_tolerance/cancellation/utils.py`, the `poll_for_pattern` function's default `max_wait_ms` of 500ms is intentionally set to detect failures in cancellation signal propagation to TRT-LLM. This timeout covers only the time for the cancellation signal to reach TRT-LLM (not any generation routine), and if cancellation takes longer than 0.5s to propagate, it should be considered a test failure.

Applied to files:

tests/fault_tolerance/cancellation/test_trtllm.py

🪛 Ruff (0.14.5)

tests/fault_tolerance/cancellation/test_trtllm.py

372-372: Unused function argument: runtime_services

(ARG001)

372-372: Unused function argument: predownload_models

(ARG001)

407-407: Unpacked variable prefill_log_offset is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

434-434: Unpacked variable frontend_log_offset is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: vllm (amd64)

🔇 Additional comments (2)

tests/fault_tolerance/cancellation/test_trtllm.py (2)

367-379: LGTM!

The test signature, markers, and docstring appropriately describe the test's purpose of verifying cancellation during the KV transfer phase.

381-394: LGTM!

The test setup follows the established pattern used in other disaggregated cancellation tests.

tests/fault_tolerance/cancellation/test_trtllm.py

…ncel window Signed-off-by: Jacky <[email protected]>

nnshah1 · 2025-11-26T01:17:26Z

tests/fault_tolerance/cancellation/test_trtllm.py

+                    log_offset=decode_log_offset,
+                )
+
+                # Verify frontend log has kill message


is there any log we can find from the prefill worker to indicate the transfer has stopped / broken?

no, I found the transfer always succeeded.

which makes sense because the cancellation signal propagates into the TRT-LLM engine, but we wait until the engine gracefully exits the generate loop before returning from the request, so the engine can choose to finish receiving kv cache and then exit the request.

nnshah1

LGTM - but want to understand if we confirm something in the prefill worker log -

kthui self-assigned this Nov 22, 2025

pull-request-size bot added the size/M label Nov 22, 2025

github-actions bot added the test label Nov 22, 2025

kthui force-pushed the jacky-ft-cancel-kv-transfer-trtllm branch from db0f1e8 to 6ac1065 Compare November 24, 2025 23:32

copy-pr-bot bot temporarily deployed to GITLAB November 24, 2025 23:32 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 24, 2025 23:35 Inactive

test: Add KV transfer cancellation test on TRT-LLM

d4a6f3b

Signed-off-by: Jacky <[email protected]>

kthui force-pushed the jacky-ft-cancel-kv-transfer-trtllm branch from 6ac1065 to b34f4d9 Compare November 25, 2025 22:07

copy-pr-bot bot temporarily deployed to GITLAB November 25, 2025 22:07 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 25, 2025 22:14 Inactive

kthui marked this pull request as ready for review November 25, 2025 22:57

kthui requested review from a team as code owners November 25, 2025 22:57

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

tests/fault_tolerance/cancellation/test_trtllm.py Show resolved Hide resolved

tests/fault_tolerance/cancellation/test_trtllm.py Outdated Show resolved Hide resolved

tests/fault_tolerance/cancellation/test_trtllm.py Show resolved Hide resolved

test: Poll for Decode start instead of start sending kv for larger ca…

3ba9cff

…ncel window Signed-off-by: Jacky <[email protected]>

kthui force-pushed the jacky-ft-cancel-kv-transfer-trtllm branch from b34f4d9 to 3ba9cff Compare November 25, 2025 23:32

copy-pr-bot bot temporarily deployed to GITLAB November 25, 2025 23:33 Inactive

nnshah1 reviewed Nov 26, 2025

View reviewed changes

nnshah1 approved these changes Nov 26, 2025

View reviewed changes

kthui enabled auto-merge (squash) November 26, 2025 01:52

kthui merged commit bbaab9f into main Nov 26, 2025
33 of 34 checks passed

kthui deleted the jacky-ft-cancel-kv-transfer-trtllm branch November 26, 2025 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: Add KV transfer cancellation test on TRT-LLM #4547

test: Add KV transfer cancellation test on TRT-LLM #4547

Uh oh!

kthui commented Nov 22, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 25, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nnshah1 Nov 26, 2025

Uh oh!

kthui Nov 26, 2025

Uh oh!

nnshah1 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

test: Add KV transfer cancellation test on TRT-LLM #4547

test: Add KV transfer cancellation test on TRT-LLM #4547

Uh oh!

Conversation

kthui commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nnshah1 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

kthui Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

nnshah1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kthui commented Nov 22, 2025 •

edited

Loading

coderabbitai bot commented Nov 25, 2025 •

edited

Loading