[https://nvbugs/5625990][fix] Fix block copy from GPU to GPU for partial reuse in the KV cache manager #9295

eopXD · 2025-11-19T06:22:23Z

NOTE: This MR is currently based and depends on #9056. This MR will be rebased and re-organized after #9056 lands.

Summary by CodeRabbit

Bug Fixes
- Improved synchronization between GPU memory transfer operations and buffer management to prevent race conditions during cache operations.
Refactor
- Enhanced KV cache management with refined block reuse tracking and GPU-to-GPU copy operations for optimized memory transfer performance.
- Strengthened resource cleanup and synchronization during model warmup initialization.

Description

KVCacheTransferManager::onboard only covers memory movement between
CPU and GPU and not GPU to GPU.

Use the mBufferManager to correctly copy block content.

Test Coverage

This fixes the incorrect partial block copy functionality exposed
through test case accuracy/test_llm_api_pytorch.py::TestGemma3_1BInstruct::test_auto_dtype_vswa_reuse_low_memory_available_partial_reuse

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: thorjohnsen <[email protected]>

…n/TensorRT-LLM into user/tjohnsen/fix_bug_5375590 .

Signed-off-by: thorjohnsen <[email protected]>

Signed-off-by: Iman Tabrizian <[email protected]> .

Signed-off-by: thorjohnsen <[email protected]> .

Signed-off-by: thorjohnsen <[email protected]>

…nsorRT-LLM into user/tjohnsen/fix_5627710 .

Signed-off-by: thorjohnsen <[email protected]>

…nsorRT-LLM into user/tjohnsen/fix_5627710 .

Signed-off-by: thorjohnsen <[email protected]>

Co-authored-by: Robin Kobus <[email protected]> Signed-off-by: Thor Johnsen <[email protected]>

…o try-thor-fix-transfer-manager

…y-thor-fix-transfer-manager

coderabbitai · 2025-11-19T06:31:07Z

📝 Walkthrough

Walkthrough

This PR enhances KV cache transfer synchronization across WindowBlockManager, BlockManager, and BaseKVCacheManager by introducing explicit sync pathways with the buffer manager, adds GPU-to-GPU copy capabilities, refines pending I/O tracking with separate read/write events, integrates synchronization into resource preparation and warmup, and exposes new Python bindings for sync and refresh operations.

Changes

Cohort / File(s)	Summary
KV Cache Manager Headers `cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`	Added `syncTransferManagerWithBufferManager()` public method to WindowBlockManager, BlockManager, BaseKVCacheManager (pure virtual), and KVCacheManager (override). Added private counters `mCopiedReusedPartialBlocks` and `mDirectlyReusedPartialBlocks` to KVCacheManager for tracking partial block reuse semantics.
KV Cache Transfer Manager Headers `cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h`	Added `syncWithBufferManager()`, `copyGPUtoGPU()`, `syncPendingOnboardTransfers()`, and `recordPendingOnboardTransfers()` public methods. Replaced `mPendingOffloads` map with separate `mPendingReads` and `mPendingWrites` maps keyed by block index for refined I/O tracking.
KV Cache Manager Implementation `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`	Implemented `BlockManager::syncTransferManagerWithBufferManager()` and `WindowBlockManager::syncTransferManagerWithBufferManager()` delegation methods. Updated partial block reuse logic to call `copyGPUtoGPU()` instead of `onboard()` and track outcomes with new counters. Enhanced destructor logging with partial reuse counters.
KV Cache Transfer Manager Implementation `cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp`	Implemented `copyGPUtoGPU()` with direct device-to-device copies bypassing buffer manager. Added `syncPendingOnboardTransfers()` and `recordPendingOnboardTransfers()` helpers for onboard/offload coordination. Enhanced `onboard()` and `offload()` with synchronization and pending event recording. Implemented `syncWithBufferManager()` for stream alignment and pending transfer cleanup. Updated `syncTransfers()` to coordinate with buffer manager.
KV Cache Allocation `cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp`	Added `kvCacheManager.syncTransferManagerWithBufferManager()` call at the start of `AllocateKvCache::operator()` for initialization synchronization.
Python Bindings `cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`, `cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp`	Added `PyKvCacheManager::syncTransferManagerWithBufferManager()` override in both nanobind and pybind implementations. Exposed Python bindings for `sync_transfer_manager_with_buffer_manager` and `refresh_blocks` on BaseKVCacheManager.
Torch Executor Warmup `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Added three new helper functions in warmup: `get_cuda_graph_warmup_request()`, `get_torch_compile_warmup_request()`, and `get_autotune_warmup_request()` for workload-aware warmup generation. Added `release_batch()` context manager for per-batch resource cleanup. Updated TODO comment referencing "star attention".
Resource Manager `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Added `self.impl.sync_transfer_manager_with_buffer_manager()` call before KV cache allocation in `prepare_resources()`. Added `self.impl.refresh_blocks()` call after context/generation prefilling for kernel readiness.
Test Infrastructure `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	Removed pytest skip decorator from `test_auto_dtype_vswa_reuse_low_memory_available_partial_reuse`, enabling previously skipped test execution.
Test Waives `tests/integration/test_lists/waives.txt`	Updated test exclusion list with multiple new SKIP entries across various test modules and paths.

Sequence Diagram(s)

sequenceDiagram
    participant AllocateKvCache
    participant KVCacheManager
    participant BlockManager
    participant TransferManager
    participant BufferManager
    
    AllocateKvCache->>KVCacheManager: syncTransferManagerWithBufferManager()
    KVCacheManager->>BlockManager: syncTransferManagerWithBufferManager()
    BlockManager->>TransferManager: syncWithBufferManager()
    TransferManager->>BufferManager: align streams & sync pending
    
    Note over TransferManager: Clear pending reads/writes
    
    rect rgba(100, 200, 100, 0.2)
        AllocateKvCache->>KVCacheManager: process context/generation requests
    end

sequenceDiagram
    participant PartialReuseLogic
    participant TransferManager
    participant GPU
    
    PartialReuseLogic->>PartialReuseLogic: detect partial block reuse
    
    alt Non-leaf/active block
        PartialReuseLogic->>TransferManager: syncPendingOnboardTransfers()
        TransferManager->>GPU: wait on pending writes/reads
        PartialReuseLogic->>TransferManager: copyGPUtoGPU()
        TransferManager->>GPU: device-to-device copy
        TransferManager->>TransferManager: recordPendingOnboardTransfers()
        PartialReuseLogic->>PartialReuseLogic: increment mCopiedReusedPartialBlocks
    else Leaf/private reuse
        PartialReuseLogic->>PartialReuseLogic: increment mDirectlyReusedPartialBlocks
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

kvCacheTransferManager.cpp — Dense synchronization coordination logic involving pending reads/writes maps, event handling, and stream alignment; requires careful verification of deadlock/race condition prevention.
Partial block reuse logic in kvCacheManager.cpp — Changed control flow replacing onboard() with copyGPUtoGPU(); verify correctness of counter increments and GPU copy path correctness.
Python binding synchronization in resource_manager.py — Verify timing and ordering of sync_transfer_manager_with_buffer_manager() and refresh_blocks() calls relative to KV cache allocation and prefilling.
Warmup request generation in model_engine.py — Three new helper functions with distinct logic paths; verify resource allocation/deallocation via release_batch() context manager and correctness of workload-aware request generation.
Cross-module interaction — Trace end-to-end flow from AllocateKvCache through TransferManager synchronization to ensure consistency across C++ and Python layers.

Suggested reviewers

tburt-nv
kxdc
lfr-0531

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.93% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main fix: adding GPU-to-GPU block copy functionality for partial reuse in the KV cache manager, and is clearly related to the changeset.
Description check	✅ Passed	The PR description includes all required sections: issue explanation (KVCacheTransferManager::onboard missing GPU-to-GPU support), solution (use mBufferManager for block copy), and test coverage (specific test case that validates the fix). However, note that dependency on PR #9056 is mentioned upfront.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp (1)
29-64: Also sync the cross KV manager before calling add_sequence.

Line 60 calls crossKvCacheManager->addSequence(...), but we never invoke the new syncTransferManagerWithBufferManager() on that manager. The header comment even states this sync must happen before the first addSequence in each step. Without it, the cross-cache path can still race its pending transfers, recreating the corruption we’re fixing here. Please sync both managers up front.
     kvCacheManager.syncTransferManagerWithBufferManager();
+    if (crossKvCacheManager)
+    {
+        crossKvCacheManager->syncTransferManagerWithBufferManager();
+    }

🧹 Nitpick comments (2)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

562-687: New warmup helper functions are currently unused and duplicate existing mechanisms

get_cuda_graph_warmup_request, get_torch_compile_warmup_request, get_autotune_warmup_request, and release_batch are defined inside warmup but never used; the actual warmup path still goes through _run_torch_compile_warmup, _run_autotuner_warmup, _run_cuda_graph_warmup and _release_batch_context.

This adds cognitive overhead and can drift from the logic in _create_warmup_request / _create_cuda_graph_warmup_request (which already encode more detailed constraints like beam width and fused drafting loops).

Consider either:

Wiring these helpers into the warmup orchestration and deprecating the older _create_* / _release_batch_context helpers, or

Removing these inner helpers for now to avoid dead code and duplication.

Also applies to: 688-699
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
1228-1277: Confirm copyGPUtoGPU behavior when matchingBlock is offloaded (non‑primary).

In the partial‑reuse path, you now do:
auto newBlock = getFreeBlock(...);
mTransferManager->copyGPUtoGPU(matchingBlock, newBlock, mPools);
...
matchingBlock = newBlock;
...
onboardBlock(sequence, matchingBlock, mode, directory);  // now a no‑op, newBlock is primary
matchingBlock here can in principle refer to a block that has previously been offloaded to the secondary pool but is still present in the reuse radix tree. If copyGPUtoGPU assumes both source and destination blocks are primary/GPU‑resident, this would be incorrect for offloaded matches.

Please double‑check that either:

copyGPUtoGPU explicitly supports copying from blocks whose MemoryPoolBlockIndex currently points to the secondary pool, or

this path is only ever reached when matchingBlock->isPrimary() is true, or

you add a small guard / early onboarding for non‑primary matches before invoking copyGPUtoGPU.

The new partial‑reuse counters (mCopiedReusedPartialBlocks and mDirectlyReusedPartialBlocks) match the intended semantics of the two branches.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a79c0df and 429afdb.

📒 Files selected for processing (11)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (5 hunks)
cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h (2 hunks)
cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp (1 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (6 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp (2 hunks)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (2 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (2 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (0 hunks)
tests/integration/test_lists/waives.txt (1 hunks)

💤 Files with no reviewable changes (1)

tests/integration/defs/accuracy/test_llm_api_pytorch.py

🧰 Additional context used

🧠 Learnings (16)

📓 Common learnings

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
tensorrt_llm/_torch/pyexecutor/model_engine.py
cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-21T09:41:49.347Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-15T06:46:54.897Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-20T06:56:02.889Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-20T06:48:45.368Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-06T08:18:28.669Z

Learnt from: zhengd-nv
Repo: NVIDIA/TensorRT-LLM PR: 6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-20T06:48:45.368Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.

Applied to files:

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-09-17T02:48:52.732Z

Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7781
File: tests/integration/test_lists/waives.txt:313-313
Timestamp: 2025-09-17T02:48:52.732Z
Learning: In TensorRT-LLM, `tests/integration/test_lists/waives.txt` is specifically for waiving/skipping tests, while other test list files like those in `test-db/` and `qa/` directories are for different test execution contexts (pre-merge, post-merge, QA tests). The same test appearing in both waives.txt and execution list files is intentional - the test is part of test suites but will be skipped due to the waiver.

Applied to files:

tests/integration/test_lists/waives.txt

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/test_lists/waives.txt

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tests/integration/test_lists/waives.txt

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/integration/test_lists/waives.txt

📚 Learning: 2025-08-29T14:07:45.863Z

Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

tests/integration/test_lists/waives.txt

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tests/integration/test_lists/waives.txt

📚 Learning: 2025-08-18T08:42:02.640Z

Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.

Applied to files:

tests/integration/test_lists/waives.txt

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

🧬 Code graph analysis (7)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)

syncTransferManagerWithBufferManager (1353-1359)

syncTransferManagerWithBufferManager (1353-1353)

syncTransferManagerWithBufferManager (1361-1364)

syncTransferManagerWithBufferManager (1361-1361)

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (2)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)

syncTransferManagerWithBufferManager (1979-1982)

refreshBlocks (1985-1988)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (8)

syncTransferManagerWithBufferManager (1353-1359)

syncTransferManagerWithBufferManager (1353-1353)

syncTransferManagerWithBufferManager (1361-1364)

syncTransferManagerWithBufferManager (1361-1361)

refreshBlocks (1366-1372)

refreshBlocks (1366-1366)

refreshBlocks (1374-1378)

refreshBlocks (1374-1374)

cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp (1)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)

offloadBlock (1112-1116)

offloadBlock (1112-1113)

offloadBlock (1118-1146)

offloadBlock (1118-1119)

tensorrt_llm/_torch/pyexecutor/model_engine.py (5)

tensorrt_llm/_torch/pyexecutor/scheduler.py (3)

batch_size (37-38)

ScheduledRequests (20-41)

all_requests (40-41)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (9)

get_num_free_blocks (806-814)

add_dummy_requests (78-79)

add_dummy_requests (460-538)

get_num_available_tokens (819-821)

free_resources (87-88)

free_resources (628-630)

free_resources (1295-1299)

free_resources (1362-1365)

free_resources (1490-1491)

tensorrt_llm/_torch/speculative/mtp.py (2)

add_dummy_requests (88-90)

free_resources (81-86)

tensorrt_llm/_torch/attention_backend/sparse/rocket.py (2)

add_dummy_requests (914-939)

free_resources (978-980)

tensorrt_llm/_torch/speculative/eagle3.py (2)

add_dummy_requests (96-98)

free_resources (90-94)

cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h (1)

cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp (14)

syncWithBufferManager (345-358)

syncWithBufferManager (345-345)

syncTransfers (360-373)

syncTransfers (360-360)

computeBlockPointer (89-98)

computeBlockPointer (89-90)

syncPendingOnboardTransfers (257-280)

syncPendingOnboardTransfers (257-257)

recordPendingOnboardTransfers (282-290)

recordPendingOnboardTransfers (282-282)

copyGPUtoGPU (101-111)

copyGPUtoGPU (101-102)

copyGPUtoGPU (301-307)

copyGPUtoGPU (301-302)

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (2)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)

syncTransferManagerWithBufferManager (1979-1982)

refreshBlocks (1985-1988)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (8)

syncTransferManagerWithBufferManager (1353-1359)

syncTransferManagerWithBufferManager (1353-1353)

syncTransferManagerWithBufferManager (1361-1364)

syncTransferManagerWithBufferManager (1361-1361)

refreshBlocks (1366-1372)

refreshBlocks (1366-1366)

refreshBlocks (1374-1378)

refreshBlocks (1374-1374)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

syncTransferManagerWithBufferManager (1979-1982)

🪛 Ruff (0.14.5)

tensorrt_llm/_torch/pyexecutor/model_engine.py

573-573: Undefined name use_mrope

(F821)

586-586: Undefined name use_mrope

(F821)

595-595: Undefined name spec_resource_manager

(F821)

596-596: Undefined name spec_resource_manager

(F821)

616-616: Undefined name spec_resource_manager

(F821)

617-617: Undefined name spec_resource_manager

(F821)

678-678: Undefined name spec_resource_manager

(F821)

679-679: Undefined name spec_resource_manager

(F821)

696-696: Undefined name spec_resource_manager

(F821)

697-697: Undefined name spec_resource_manager

(F821)

🔇 Additional comments (5)

tests/integration/test_lists/waives.txt (1)

38-444: Verify the rationale and scope of 407 new test waivers.

This change adds a substantial number (407) of new test skip entries spanning unrelated modules (multimodal, performance tests, disaggregated serving, Triton server, etc.). This scale is unexpected for a targeted fix to GPU→GPU KV cache block copy.

A few clarifications needed:

Relationship to PR #9056: The PR depends on #9056 and will be rebased. Are these waivers required due to changes in that PR, or are they specific to this KV cache fix?

Scope mismatch: Many waivers reference test modules unrelated to KV cache (e.g., test_multimodal.py, test_perf.py, triton_server/). Why do unrelated components require waiving?

Temporary vs. persistent: Are these waivers expected to be permanent or temporary blockers that will be resolved before merge?

Documentation: Consider adding a brief inline comment explaining the rationale for this bulk update (e.g., "Waivers inherited from PR #9056" or "Known transient failures pending infrastructure fixes").

For context, the PR objectives mention the fix is exposed by a single test case (test_auto_dtype_vswa_reuse_low_memory_available_partial_reuse), so the extensive waiver list warrants explanation in the commit message or PR description.

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)

243-247: New sync/refresh bindings are consistent with BaseKVCacheManager

The new syncTransferManagerWithBufferManager trampoline and the Python bindings for sync_transfer_manager_with_buffer_manager / refresh_blocks match the C++ interface and existing binding style (pure virtual override + GIL release). No issues from this change.

Also applies to: 493-495

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)

238-242: Nanobind trampoline and bindings look correct; double‑check NB_TRAMPOLINE count

The syncTransferManagerWithBufferManager override and the new nanobind methods are consistent with the C++ interface and other bindings. Please just confirm that the NB_TRAMPOLINE(tbk::BaseKVCacheManager, 30); count still matches the exact number of pure virtuals after this addition.

Also applies to: 489-491

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)

642-682: Partial-reuse counters and logging look consistent.

The new mCopiedReusedPartialBlocks / mDirectlyReusedPartialBlocks members are initialized in the ctor and used only in loadOrAllocateBlocks and the destructor logs, so they’re consistent and side‑effect free. No functional concerns from these additions.

Also applies to: 751-773

1353-1364: Sync plumbing between block managers and transfer manager is straightforward.

BlockManager::syncTransferManagerWithBufferManager() simply forwards to each WindowBlockManager, which in turn calls mTransferManager->syncWithBufferManager(). This matches the header‑side override and cleanly wires the sync API through the hierarchy. Looks good.

cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp

coderabbitai · 2025-11-19T06:31:11Z

cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp

+void KVCacheTransferManager::copyGPUtoGPU(
+    BlockPtr const& offloadedBlock, BlockPtr const& block, std::vector<KVCacheBlockPool> const& pools)
+{
+    syncPendingOnboardTransfers(offloadedBlock, block);
+    copyGPUtoGPU(offloadedBlock, block, pools);
+    recordPendingOnboardTransfers(offloadedBlock, block);


⚠️ Potential issue | 🔴 Critical

Merge the GPU-to-GPU copy helper instead of redefining it.

We already defined KVCacheTransferManager::copyGPUtoGPU(BlockPtr const&, BlockPtr const&, …) up at Line 100. This second definition has the same signature, so it will not compile (redefinition), and even if it did, the recursive call would never terminate. Fold the sync/record logic into the first implementation (or rename the helper) so there’s exactly one definition that performs the sync, copy, and bookkeeping.

-void KVCacheTransferManager::copyGPUtoGPU( - BlockPtr const& src, BlockPtr const& dst, std::vector<KVCacheBlockPool> const& pools) -{ - for (size_t poolIdx = 0; poolIdx < pools.size(); ++poolIdx) - { - auto srcPtr = computeBlockPointer(src, pools, poolIdx); - auto dstPtr = computeBlockPointer(dst, pools, poolIdx); - mBufferManager.copy(*srcPtr, *dstPtr); - } - TLLM_LOG_DEBUG("GPU-to-GPU copy for from block %d to block %d", src->getBlockId(), dst->getBlockId()); -} +void KVCacheTransferManager::copyGPUtoGPU( + BlockPtr const& src, BlockPtr const& dst, std::vector<KVCacheBlockPool> const& pools) +{ + syncPendingOnboardTransfers(src, dst); + for (size_t poolIdx = 0; poolIdx < pools.size(); ++poolIdx) + { + auto srcPtr = computeBlockPointer(src, pools, poolIdx); + auto dstPtr = computeBlockPointer(dst, pools, poolIdx); + mBufferManager.copy(*srcPtr, *dstPtr); + } + recordPendingOnboardTransfers(src, dst); + TLLM_LOG_DEBUG( + "GPU-to-GPU copy from block %d to block %d", src->getBlockId(), dst->getBlockId()); +}

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp around lines 301-306, there is a duplicate definition of KVCacheTransferManager::copyGPUtoGPU with the same signature that simply calls itself (causing redefinition/recursion). Remove this duplicate function and instead merge the syncPendingOnboardTransfers(offloadedBlock, block) and recordPendingOnboardTransfers(offloadedBlock, block) calls into the single existing copyGPUtoGPU implementation defined earlier (around line 100), or rename this helper and update all call sites; ensure the final copyGPUtoGPU performs sync, actual GPU-to-GPU copy, and bookkeeping without calling itself.

coderabbitai · 2025-11-19T06:31:11Z

tensorrt_llm/_torch/pyexecutor/model_engine.py

+        def get_cuda_graph_warmup_request(batch_size):
+            available_blocks = kv_cache_manager.get_num_free_blocks()
+            if available_blocks >= batch_size:
+                result = ScheduledRequests()
+                result.context_requests = []
+                # Add (batch_size - 1) dummy requests with seq_len=1.
+                # Should only need one more page per request.
+                requests = kv_cache_manager.add_dummy_requests(
+                    list(range(batch_size - 1)),
+                    is_gen=True,
+                    max_num_draft_tokens=self.max_draft_len,
+                    use_mrope=use_mrope,
+                )
+                available_tokens = kv_cache_manager.get_num_available_tokens(
+                    self.max_draft_len)
+
+                # Add one dummy request with the maximum possible sequence length.
+                # The sequence length is limited by both the max_seq_len and the number of available blocks.
+                token_num = max(1, min(available_tokens, self.max_seq_len - 1))
+                max_seq_len_request = kv_cache_manager.add_dummy_requests(
+                    request_ids=[batch_size - 1],
+                    token_nums=[token_num],
+                    is_gen=True,
+                    max_num_draft_tokens=self.max_draft_len,
+                    use_mrope=use_mrope,
+                )[0]
+                # Add the longest request before all other seq_len=1 request to simulate the padding CUDA graph case.
+                # This batch contains both the longest request and the shortest requests,
+                # it also contains the maximum number of requests and the maximum token number,
+                # which simulates the extreme case for the padding CUDA graph.
+                # Thus we can replay this CUDA graph in all other cases.
+                requests.insert(0, max_seq_len_request)
+                result.generation_requests = requests
+                if spec_resource_manager is not None:
+                    spec_resource_manager.add_dummy_requests(
+                        request_ids=list(range(batch_size)))
+            else:
+                result = None
+            return result
+
+        def get_torch_compile_warmup_request(batch_size,
+                                             num_tokens_per_request):
+            available_blocks = kv_cache_manager.get_num_free_blocks()
+            if available_blocks >= batch_size * math.ceil(
+                    num_tokens_per_request / kv_cache_manager.tokens_per_block):
+                # Should only need (at most) one more page per request.
+                is_gen = num_tokens_per_request == 1
+
+                requests = kv_cache_manager.add_dummy_requests(
+                    list(range(batch_size)), [num_tokens_per_request] *
+                    batch_size if not is_gen else None,
+                    is_gen=is_gen,
+                    max_num_draft_tokens=self.max_draft_len)
+
+                if spec_resource_manager is not None:
+                    spec_resource_manager.add_dummy_requests(
+                        request_ids=list(range(batch_size)))
+
+                result = ScheduledRequests()
+                result.context_requests = []
+                result.generation_requests = []
+                if is_gen:
+                    result.generation_requests = requests
+                else:
+                    result.context_requests = requests
+            else:
+                result = None
+            return result
+
+        def get_autotune_warmup_request():
+            available_tokens = kv_cache_manager.get_num_available_tokens(
+                self.max_draft_len)
+            num_tokens_per_request = min(
+                min(available_tokens, self.max_seq_len - 1),
+                self.max_num_tokens)
+            # Number of tokens required per request must be rounded up to whole number of blocks
+            num_tokens_required_per_request = (
+                (num_tokens_per_request + kv_cache_manager.tokens_per_block - 1)
+                // kv_cache_manager.tokens_per_block
+            ) * kv_cache_manager.tokens_per_block
+
+            available_blocks = kv_cache_manager.get_num_free_blocks()
+
+            maximum_tunable_num_tokens = min(
+                self.batch_size * num_tokens_per_request, self.max_num_tokens,
+                available_blocks * kv_cache_manager.tokens_per_block)
+
+            # Calculate number of full-length requests and remaining tokens
+            # Each request has num_tokens_per_request tokens, except possibly the last one
+            # Calculations are also limited by how many KV cache blocks are available
+            full_len_request_num = min(
+                maximum_tunable_num_tokens // num_tokens_per_request,
+                max(1, available_tokens // num_tokens_required_per_request))
+            remaining_tokens = min(
+                maximum_tunable_num_tokens % num_tokens_per_request,
+                max(
+                    0, available_tokens -
+                    full_len_request_num * num_tokens_required_per_request))
+
+            request_num = full_len_request_num if remaining_tokens == 0 else full_len_request_num + 1
+
+            requests = kv_cache_manager.add_dummy_requests(
+                request_ids=list(range(full_len_request_num)),
+                token_nums=[num_tokens_per_request] * full_len_request_num,
+                is_gen=False,
+                max_num_draft_tokens=self.max_draft_len)
+
+            if remaining_tokens > 0:
+                final_request = kv_cache_manager.add_dummy_requests(
+                    request_ids=[full_len_request_num],
+                    token_nums=[remaining_tokens],
+                    is_gen=False,
+                    max_num_draft_tokens=self.max_draft_len)
+
+                requests += final_request
+
+            if spec_resource_manager is not None:
+                spec_resource_manager.add_dummy_requests(
+                    request_ids=list(range(request_num)))
+
+            result = ScheduledRequests()
+            result.context_requests = requests
+            result.generation_requests = []
+
+            return result
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Inner warmup helpers use undefined names and have an unhandled zero‑tokens edge case

Two concrete issues in the new inner helpers:

Undefined variables (use_mrope, spec_resource_manager)

get_cuda_graph_warmup_request and get_torch_compile_warmup_request call kv_cache_manager.add_dummy_requests(..., use_mrope=use_mrope, ...), but use_mrope is never defined in warmup.

All three helpers and release_batch reference spec_resource_manager, which is also never defined in warmup.

If any of these helpers are called, they will raise a NameError.

Suggested fix (before the helper definitions in warmup):

kv_cache_manager = resource_manager.get_resource_manager(self.kv_cache_manager_key) if kv_cache_manager is None: logger.info("Skipping warm up as no KV Cache manager allocated.") return use_mrope = self.use_mrope spec_resource_manager = resource_manager.get_resource_manager( ResourceManagerType.SPEC_RESOURCE_MANAGER ) # The lifetime of model engine and kv cache manager can be different. self.cuda_graph_runner.padding_dummy_request = None def get_cuda_graph_warmup_request(batch_size): ...

Potential division‑by‑zero in get_autotune_warmup_request

When kv_cache_manager.get_num_available_tokens(self.max_draft_len) returns 0, num_tokens_per_request becomes 0, leading to maximum_tunable_num_tokens // num_tokens_per_request dividing by zero.

This is easy to hit in low‑memory / fully‑utilized KV cache situations.

Suggested minimal guard inside get_autotune_warmup_request:

available_tokens = kv_cache_manager.get_num_available_tokens(self.max_draft_len) if available_tokens <= 0: return None num_tokens_per_request = min( min(available_tokens, self.max_seq_len - 1), self.max_num_tokens, ) if num_tokens_per_request <= 0: return None

Without these fixes, wiring these helpers into the warmup flow will be fragile and can hard‑fail at runtime.

Also applies to: 688-697

--- 🏁 Script executed: ```shell #!/bin/bash # Extract the full warmup method and surrounding context fd -t f 'model_engine\.py' | head -1 | xargs cat | head -750 | tail -200

Length of output: 9494

🏁 Script executed:

#!/bin/bash # Search for use_mrope and spec_resource_manager usage in the file rg -n 'use_mrope|spec_resource_manager' tensorrt_llm/_torch/pyexecutor/model_engine.py -B2 -A2

Length of output: 11353

🏁 Script executed:

#!/bin/bash # Find where warmup method starts and its parameters rg -n 'def warmup' tensorrt_llm/_torch/pyexecutor/model_engine.py -A30

Length of output: 3293

🏁 Script executed:

#!/bin/bash # Search for resource_manager.get_resource_manager patterns in the file rg -n 'resource_manager\.get_resource_manager' tensorrt_llm/_torch/pyexecutor/model_engine.py -B2 -A2

Length of output: 3761

🏁 Script executed:

#!/bin/bash # Look for self.use_mrope references rg -n 'self\.use_mrope' tensorrt_llm/_torch/pyexecutor/model_engine.py -B1 -A1

Length of output: 1772

Inner warmup helpers use undefined names and have an unhandled zero‑tokens edge case

Two concrete issues in the new inner helpers have been verified:

Undefined variables (use_mrope, spec_resource_manager)

get_cuda_graph_warmup_request and get_torch_compile_warmup_request use use_mrope at lines 573 and 586, but this variable is never defined in the warmup method scope—it will raise NameError if called.

All three helpers and release_batch reference spec_resource_manager at lines 595–597, 616–618, 678–680, and 696–697, also never defined in warmup scope—will raise NameError if called.

Elsewhere in the codebase, spec_resource_manager is consistently retrieved via resource_manager.get_resource_manager(ResourceManagerType.SPEC_RESOURCE_MANAGER) (lines 903, 930, 1008, 1085) before use.

Add these before the helper definitions in warmup:

use_mrope = self.use_mrope spec_resource_manager = resource_manager.get_resource_manager( ResourceManagerType.SPEC_RESOURCE_MANAGER )

Division-by-zero in get_autotune_warmup_request

When kv_cache_manager.get_num_available_tokens(self.max_draft_len) returns 0, num_tokens_per_request becomes 0, causing maximum_tunable_num_tokens // num_tokens_per_request to divide by zero.

Add guards:

available_tokens = kv_cache_manager.get_num_available_tokens(self.max_draft_len) if available_tokens <= 0: return None num_tokens_per_request = min( min(available_tokens, self.max_seq_len - 1), self.max_num_tokens) if num_tokens_per_request <= 0: return None

These helpers will fail at runtime without these fixes.

🧰 Tools

🪛 Ruff (0.14.5)

573-573: Undefined name use_mrope

(F821)

586-586: Undefined name use_mrope

(F821)

595-595: Undefined name spec_resource_manager

(F821)

596-596: Undefined name spec_resource_manager

(F821)

616-616: Undefined name spec_resource_manager

(F821)

617-617: Undefined name spec_resource_manager

(F821)

678-678: Undefined name spec_resource_manager

(F821)

679-679: Undefined name spec_resource_manager

(F821)

…ial reuse in the KV cache manager `KVCacheTransferManager::onboard` only covers memory movement between CPU and GPU and not GPU to GPU. Use the `mBufferManager` to copy block content. This fixes the incorrect partial block copy functionality exposed through test case `accuracy/test_llm_api_pytorch.py::TestGemma3_1BInstruct::test_auto_dtype_vswa_reuse_low_memory_available_partial_reuse` Signed-off-by: eopXD <[email protected]>

nvpohanh · 2025-11-20T01:52:14Z

cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h

        std::string const& directory = "");

+    /*!
+     * \brief Directly copy a block from gpu to gpu.


Is this asynchronous from CPU's perspective? If so, we should mention that here. Something like:

\note The copy is launched on an internal stream asynchronously and this function may return before the copy is complete.

thorjohnsen and others added 29 commits July 9, 2025 16:23

Consider block granularity when computing remaining tokens

41c320b

Signed-off-by: thorjohnsen <[email protected]>

precommit run

f779d33

Signed-off-by: thorjohnsen <[email protected]>

Merge branch 'main' into user/tjohnsen/fix_bug_5375590

86ec0a3

Remove test waiver

9e7bcf4

Signed-off-by: thorjohnsen <[email protected]>

Merge branch 'user/tjohnsen/fix_bug_5375590' of github.com:thorjohnse…

98b5c71

…n/TensorRT-LLM into user/tjohnsen/fix_bug_5375590 .

Merge branch 'main' into user/tjohnsen/fix_bug_5375590

e715614

Remove test waiver

ff54927

Signed-off-by: thorjohnsen <[email protected]>

Merge branch 'main' into user/tjohnsen/fix_bug_5375590

79d37b1

Merge branch 'main' into user/tjohnsen/fix_bug_5375590

5d4df20

Add refresh blocks

1d484eb

Signed-off-by: Iman Tabrizian <[email protected]> .

Fix transfer manager synchronization issues

307479b

Signed-off-by: thorjohnsen <[email protected]> .

Fix merge issues

c4dc529

Signed-off-by: thorjohnsen <[email protected]>

Bug fix

8720584

Signed-off-by: thorjohnsen <[email protected]>

Another fix

99653bf

Signed-off-by: thorjohnsen <[email protected]>

Merge branch 'main' into user/tjohnsen/fix_5627710

ffb90a4

Merge branch 'main' into user/tjohnsen/fix_5627710

b471cdf

precommit run

05255ff

Signed-off-by: thorjohnsen <[email protected]>

Merge branch 'user/tjohnsen/fix_5627710' of github.com:thorjohnsen/Te…

9ac9534

…nsorRT-LLM into user/tjohnsen/fix_5627710 .

Merge branch 'main' into user/tjohnsen/fix_5627710

9aef49c

Fix broken pybind

1369f0f

Signed-off-by: thorjohnsen <[email protected]>

Merge branch 'user/tjohnsen/fix_5627710' of github.com:thorjohnsen/Te…

559beb0

…nsorRT-LLM into user/tjohnsen/fix_5627710 .

Move refreshBlocks call to account for addToken calls

1cfa88c

Signed-off-by: thorjohnsen <[email protected]>

precommit run

03d4aac

Signed-off-by: thorjohnsen <[email protected]>

Merge branch 'main' into user/tjohnsen/fix_5627710

262d34c

Merge branch 'main' into user/tjohnsen/fix_5627710

f4ae208

Update cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp

9f04777

Co-authored-by: Robin Kobus <[email protected]> Signed-off-by: Thor Johnsen <[email protected]>

Merge branch 'main' into user/tjohnsen/fix_5627710

ba32c3b

Merge remote-tracking branch 'thor/user/tjohnsen/fix_bug_5375590' int…

a179d7f

…o try-thor-fix-transfer-manager

Merge remote-tracking branch 'thor/user/tjohnsen/fix_5627710' into tr…

d59e2cb

…y-thor-fix-transfer-manager

eopXD requested a review from a team as a code owner November 19, 2025 06:22

eopXD requested a review from achartier November 19, 2025 06:22

eopXD self-assigned this Nov 19, 2025

eopXD added the KV-Cache Management kv-cache management for efficient LLM inference label Nov 19, 2025

eopXD marked this pull request as draft November 19, 2025 06:23

coderabbitai bot reviewed Nov 19, 2025

View reviewed changes

eopXD force-pushed the nvbugs5625990 branch from 429afdb to b09d5a6 Compare November 19, 2025 07:37

nvpohanh reviewed Nov 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[https://nvbugs/5625990][fix] Fix block copy from GPU to GPU for partial reuse in the KV cache manager #9295

[https://nvbugs/5625990][fix] Fix block copy from GPU to GPU for partial reuse in the KV cache manager #9295

Uh oh!

eopXD commented Nov 19, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 19, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Nov 19, 2025

Uh oh!

coderabbitai bot Nov 19, 2025

Uh oh!

nvpohanh Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[https://nvbugs/5625990][fix] Fix block copy from GPU to GPU for partial reuse in the KV cache manager #9295

Are you sure you want to change the base?

[https://nvbugs/5625990][fix] Fix block copy from GPU to GPU for partial reuse in the KV cache manager #9295

Uh oh!

Conversation

eopXD commented Nov 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Nov 19, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

nvpohanh Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eopXD commented Nov 19, 2025 •

edited by coderabbitai bot

Loading