Skip to content

Conversation

@eopXD
Copy link
Collaborator

@eopXD eopXD commented Nov 19, 2025

NOTE: This MR is currently based and depends on #9056. This MR will be rebased and re-organized after #9056 lands.

Summary by CodeRabbit

  • Bug Fixes

    • Improved synchronization between GPU memory transfer operations and buffer management to prevent race conditions during cache operations.
  • Refactor

    • Enhanced KV cache management with refined block reuse tracking and GPU-to-GPU copy operations for optimized memory transfer performance.
    • Strengthened resource cleanup and synchronization during model warmup initialization.

Description

KVCacheTransferManager::onboard only covers memory movement between
CPU and GPU and not GPU to GPU.

Use the mBufferManager to correctly copy block content.

Test Coverage

This fixes the incorrect partial block copy functionality exposed
through test case accuracy/test_llm_api_pytorch.py::TestGemma3_1BInstruct::test_auto_dtype_vswa_reuse_low_memory_available_partial_reuse

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

thorjohnsen and others added 29 commits July 9, 2025 16:23
Signed-off-by: thorjohnsen <[email protected]>
Signed-off-by: thorjohnsen <[email protected]>
…n/TensorRT-LLM into user/tjohnsen/fix_bug_5375590

.
Signed-off-by: thorjohnsen <[email protected]>
Signed-off-by: Iman Tabrizian <[email protected]>

.
Signed-off-by: thorjohnsen <[email protected]>
Signed-off-by: thorjohnsen <[email protected]>
Signed-off-by: thorjohnsen <[email protected]>
Signed-off-by: thorjohnsen <[email protected]>
Signed-off-by: thorjohnsen <[email protected]>
Signed-off-by: thorjohnsen <[email protected]>
@eopXD eopXD requested a review from a team as a code owner November 19, 2025 06:22
@eopXD eopXD requested a review from a team as a code owner November 19, 2025 06:22
@eopXD eopXD requested a review from achartier November 19, 2025 06:22
@eopXD eopXD self-assigned this Nov 19, 2025
@eopXD eopXD added the KV-Cache Management kv-cache management for efficient LLM inference label Nov 19, 2025
@eopXD eopXD marked this pull request as draft November 19, 2025 06:23
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 19, 2025

📝 Walkthrough

Walkthrough

This PR enhances KV cache transfer synchronization across WindowBlockManager, BlockManager, and BaseKVCacheManager by introducing explicit sync pathways with the buffer manager, adds GPU-to-GPU copy capabilities, refines pending I/O tracking with separate read/write events, integrates synchronization into resource preparation and warmup, and exposes new Python bindings for sync and refresh operations.

Changes

Cohort / File(s) Summary
KV Cache Manager Headers
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
Added syncTransferManagerWithBufferManager() public method to WindowBlockManager, BlockManager, BaseKVCacheManager (pure virtual), and KVCacheManager (override). Added private counters mCopiedReusedPartialBlocks and mDirectlyReusedPartialBlocks to KVCacheManager for tracking partial block reuse semantics.
KV Cache Transfer Manager Headers
cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
Added syncWithBufferManager(), copyGPUtoGPU(), syncPendingOnboardTransfers(), and recordPendingOnboardTransfers() public methods. Replaced mPendingOffloads map with separate mPendingReads and mPendingWrites maps keyed by block index for refined I/O tracking.
KV Cache Manager Implementation
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
Implemented BlockManager::syncTransferManagerWithBufferManager() and WindowBlockManager::syncTransferManagerWithBufferManager() delegation methods. Updated partial block reuse logic to call copyGPUtoGPU() instead of onboard() and track outcomes with new counters. Enhanced destructor logging with partial reuse counters.
KV Cache Transfer Manager Implementation
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
Implemented copyGPUtoGPU() with direct device-to-device copies bypassing buffer manager. Added syncPendingOnboardTransfers() and recordPendingOnboardTransfers() helpers for onboard/offload coordination. Enhanced onboard() and offload() with synchronization and pending event recording. Implemented syncWithBufferManager() for stream alignment and pending transfer cleanup. Updated syncTransfers() to coordinate with buffer manager.
KV Cache Allocation
cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp
Added kvCacheManager.syncTransferManagerWithBufferManager() call at the start of AllocateKvCache::operator() for initialization synchronization.
Python Bindings
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp, cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
Added PyKvCacheManager::syncTransferManagerWithBufferManager() override in both nanobind and pybind implementations. Exposed Python bindings for sync_transfer_manager_with_buffer_manager and refresh_blocks on BaseKVCacheManager.
Torch Executor Warmup
tensorrt_llm/_torch/pyexecutor/model_engine.py
Added three new helper functions in warmup: get_cuda_graph_warmup_request(), get_torch_compile_warmup_request(), and get_autotune_warmup_request() for workload-aware warmup generation. Added release_batch() context manager for per-batch resource cleanup. Updated TODO comment referencing "star attention".
Resource Manager
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Added self.impl.sync_transfer_manager_with_buffer_manager() call before KV cache allocation in prepare_resources(). Added self.impl.refresh_blocks() call after context/generation prefilling for kernel readiness.
Test Infrastructure
tests/integration/defs/accuracy/test_llm_api_pytorch.py
Removed pytest skip decorator from test_auto_dtype_vswa_reuse_low_memory_available_partial_reuse, enabling previously skipped test execution.
Test Waives
tests/integration/test_lists/waives.txt
Updated test exclusion list with multiple new SKIP entries across various test modules and paths.

Sequence Diagram(s)

sequenceDiagram
    participant AllocateKvCache
    participant KVCacheManager
    participant BlockManager
    participant TransferManager
    participant BufferManager
    
    AllocateKvCache->>KVCacheManager: syncTransferManagerWithBufferManager()
    KVCacheManager->>BlockManager: syncTransferManagerWithBufferManager()
    BlockManager->>TransferManager: syncWithBufferManager()
    TransferManager->>BufferManager: align streams & sync pending
    
    Note over TransferManager: Clear pending reads/writes
    
    rect rgba(100, 200, 100, 0.2)
        AllocateKvCache->>KVCacheManager: process context/generation requests
    end
Loading
sequenceDiagram
    participant PartialReuseLogic
    participant TransferManager
    participant GPU
    
    PartialReuseLogic->>PartialReuseLogic: detect partial block reuse
    
    alt Non-leaf/active block
        PartialReuseLogic->>TransferManager: syncPendingOnboardTransfers()
        TransferManager->>GPU: wait on pending writes/reads
        PartialReuseLogic->>TransferManager: copyGPUtoGPU()
        TransferManager->>GPU: device-to-device copy
        TransferManager->>TransferManager: recordPendingOnboardTransfers()
        PartialReuseLogic->>PartialReuseLogic: increment mCopiedReusedPartialBlocks
    else Leaf/private reuse
        PartialReuseLogic->>PartialReuseLogic: increment mDirectlyReusedPartialBlocks
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • kvCacheTransferManager.cpp — Dense synchronization coordination logic involving pending reads/writes maps, event handling, and stream alignment; requires careful verification of deadlock/race condition prevention.
  • Partial block reuse logic in kvCacheManager.cpp — Changed control flow replacing onboard() with copyGPUtoGPU(); verify correctness of counter increments and GPU copy path correctness.
  • Python binding synchronization in resource_manager.py — Verify timing and ordering of sync_transfer_manager_with_buffer_manager() and refresh_blocks() calls relative to KV cache allocation and prefilling.
  • Warmup request generation in model_engine.py — Three new helper functions with distinct logic paths; verify resource allocation/deallocation via release_batch() context manager and correctness of workload-aware request generation.
  • Cross-module interaction — Trace end-to-end flow from AllocateKvCache through TransferManager synchronization to ensure consistency across C++ and Python layers.

Suggested reviewers

  • tburt-nv
  • kxdc
  • lfr-0531

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.93% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main fix: adding GPU-to-GPU block copy functionality for partial reuse in the KV cache manager, and is clearly related to the changeset.
Description check ✅ Passed The PR description includes all required sections: issue explanation (KVCacheTransferManager::onboard missing GPU-to-GPU support), solution (use mBufferManager for block copy), and test coverage (specific test case that validates the fix). However, note that dependency on PR #9056 is mentioned upfront.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp (1)

29-64: Also sync the cross KV manager before calling add_sequence.

Line 60 calls crossKvCacheManager->addSequence(...), but we never invoke the new syncTransferManagerWithBufferManager() on that manager. The header comment even states this sync must happen before the first addSequence in each step. Without it, the cross-cache path can still race its pending transfers, recreating the corruption we’re fixing here. Please sync both managers up front.

     kvCacheManager.syncTransferManagerWithBufferManager();
+    if (crossKvCacheManager)
+    {
+        crossKvCacheManager->syncTransferManagerWithBufferManager();
+    }
🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

562-687: New warmup helper functions are currently unused and duplicate existing mechanisms

get_cuda_graph_warmup_request, get_torch_compile_warmup_request, get_autotune_warmup_request, and release_batch are defined inside warmup but never used; the actual warmup path still goes through _run_torch_compile_warmup, _run_autotuner_warmup, _run_cuda_graph_warmup and _release_batch_context.

This adds cognitive overhead and can drift from the logic in _create_warmup_request / _create_cuda_graph_warmup_request (which already encode more detailed constraints like beam width and fused drafting loops).

Consider either:

  • Wiring these helpers into the warmup orchestration and deprecating the older _create_* / _release_batch_context helpers, or
  • Removing these inner helpers for now to avoid dead code and duplication.

Also applies to: 688-699

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

1228-1277: Confirm copyGPUtoGPU behavior when matchingBlock is offloaded (non‑primary).

In the partial‑reuse path, you now do:

auto newBlock = getFreeBlock(...);
mTransferManager->copyGPUtoGPU(matchingBlock, newBlock, mPools);
...
matchingBlock = newBlock;
...
onboardBlock(sequence, matchingBlock, mode, directory);  // now a no‑op, newBlock is primary

matchingBlock here can in principle refer to a block that has previously been offloaded to the secondary pool but is still present in the reuse radix tree. If copyGPUtoGPU assumes both source and destination blocks are primary/GPU‑resident, this would be incorrect for offloaded matches.

Please double‑check that either:

  • copyGPUtoGPU explicitly supports copying from blocks whose MemoryPoolBlockIndex currently points to the secondary pool, or
  • this path is only ever reached when matchingBlock->isPrimary() is true, or
  • you add a small guard / early onboarding for non‑primary matches before invoking copyGPUtoGPU.

The new partial‑reuse counters (mCopiedReusedPartialBlocks and mDirectlyReusedPartialBlocks) match the intended semantics of the two branches.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a79c0df and 429afdb.

📒 Files selected for processing (11)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (5 hunks)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h (2 hunks)
  • cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp (1 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (6 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp (2 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (2 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (2 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (0 hunks)
  • tests/integration/test_lists/waives.txt (1 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧰 Additional context used
🧠 Learnings (16)
📓 Common learnings
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
Repo: NVIDIA/TensorRT-LLM PR: 6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-09-17T02:48:52.732Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7781
File: tests/integration/test_lists/waives.txt:313-313
Timestamp: 2025-09-17T02:48:52.732Z
Learning: In TensorRT-LLM, `tests/integration/test_lists/waives.txt` is specifically for waiving/skipping tests, while other test list files like those in `test-db/` and `qa/` directories are for different test execution contexts (pre-merge, post-merge, QA tests). The same test appearing in both waives.txt and execution list files is intentional - the test is part of test suites but will be skipped due to the waiver.

Applied to files:

  • tests/integration/test_lists/waives.txt
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/waives.txt
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

  • tests/integration/test_lists/waives.txt
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/integration/test_lists/waives.txt
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

  • tests/integration/test_lists/waives.txt
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/test_lists/waives.txt
📚 Learning: 2025-08-18T08:42:02.640Z
Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.

Applied to files:

  • tests/integration/test_lists/waives.txt
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
🧬 Code graph analysis (7)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
  • syncTransferManagerWithBufferManager (1353-1359)
  • syncTransferManagerWithBufferManager (1353-1353)
  • syncTransferManagerWithBufferManager (1361-1364)
  • syncTransferManagerWithBufferManager (1361-1361)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
  • syncTransferManagerWithBufferManager (1979-1982)
  • refreshBlocks (1985-1988)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (8)
  • syncTransferManagerWithBufferManager (1353-1359)
  • syncTransferManagerWithBufferManager (1353-1353)
  • syncTransferManagerWithBufferManager (1361-1364)
  • syncTransferManagerWithBufferManager (1361-1361)
  • refreshBlocks (1366-1372)
  • refreshBlocks (1366-1366)
  • refreshBlocks (1374-1378)
  • refreshBlocks (1374-1374)
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
  • offloadBlock (1112-1116)
  • offloadBlock (1112-1113)
  • offloadBlock (1118-1146)
  • offloadBlock (1118-1119)
tensorrt_llm/_torch/pyexecutor/model_engine.py (5)
tensorrt_llm/_torch/pyexecutor/scheduler.py (3)
  • batch_size (37-38)
  • ScheduledRequests (20-41)
  • all_requests (40-41)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (9)
  • get_num_free_blocks (806-814)
  • add_dummy_requests (78-79)
  • add_dummy_requests (460-538)
  • get_num_available_tokens (819-821)
  • free_resources (87-88)
  • free_resources (628-630)
  • free_resources (1295-1299)
  • free_resources (1362-1365)
  • free_resources (1490-1491)
tensorrt_llm/_torch/speculative/mtp.py (2)
  • add_dummy_requests (88-90)
  • free_resources (81-86)
tensorrt_llm/_torch/attention_backend/sparse/rocket.py (2)
  • add_dummy_requests (914-939)
  • free_resources (978-980)
tensorrt_llm/_torch/speculative/eagle3.py (2)
  • add_dummy_requests (96-98)
  • free_resources (90-94)
cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp (14)
  • syncWithBufferManager (345-358)
  • syncWithBufferManager (345-345)
  • syncTransfers (360-373)
  • syncTransfers (360-360)
  • computeBlockPointer (89-98)
  • computeBlockPointer (89-90)
  • syncPendingOnboardTransfers (257-280)
  • syncPendingOnboardTransfers (257-257)
  • recordPendingOnboardTransfers (282-290)
  • recordPendingOnboardTransfers (282-282)
  • copyGPUtoGPU (101-111)
  • copyGPUtoGPU (101-102)
  • copyGPUtoGPU (301-307)
  • copyGPUtoGPU (301-302)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
  • syncTransferManagerWithBufferManager (1979-1982)
  • refreshBlocks (1985-1988)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (8)
  • syncTransferManagerWithBufferManager (1353-1359)
  • syncTransferManagerWithBufferManager (1353-1353)
  • syncTransferManagerWithBufferManager (1361-1364)
  • syncTransferManagerWithBufferManager (1361-1361)
  • refreshBlocks (1366-1372)
  • refreshBlocks (1366-1366)
  • refreshBlocks (1374-1378)
  • refreshBlocks (1374-1374)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
  • syncTransferManagerWithBufferManager (1979-1982)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/pyexecutor/model_engine.py

573-573: Undefined name use_mrope

(F821)


586-586: Undefined name use_mrope

(F821)


595-595: Undefined name spec_resource_manager

(F821)


596-596: Undefined name spec_resource_manager

(F821)


616-616: Undefined name spec_resource_manager

(F821)


617-617: Undefined name spec_resource_manager

(F821)


678-678: Undefined name spec_resource_manager

(F821)


679-679: Undefined name spec_resource_manager

(F821)


696-696: Undefined name spec_resource_manager

(F821)


697-697: Undefined name spec_resource_manager

(F821)

🔇 Additional comments (5)
tests/integration/test_lists/waives.txt (1)

38-444: Verify the rationale and scope of 407 new test waivers.

This change adds a substantial number (407) of new test skip entries spanning unrelated modules (multimodal, performance tests, disaggregated serving, Triton server, etc.). This scale is unexpected for a targeted fix to GPU→GPU KV cache block copy.

A few clarifications needed:

  1. Relationship to PR #9056: The PR depends on #9056 and will be rebased. Are these waivers required due to changes in that PR, or are they specific to this KV cache fix?
  2. Scope mismatch: Many waivers reference test modules unrelated to KV cache (e.g., test_multimodal.py, test_perf.py, triton_server/). Why do unrelated components require waiving?
  3. Temporary vs. persistent: Are these waivers expected to be permanent or temporary blockers that will be resolved before merge?
  4. Documentation: Consider adding a brief inline comment explaining the rationale for this bulk update (e.g., "Waivers inherited from PR #9056" or "Known transient failures pending infrastructure fixes").

For context, the PR objectives mention the fix is exposed by a single test case (test_auto_dtype_vswa_reuse_low_memory_available_partial_reuse), so the extensive waiver list warrants explanation in the commit message or PR description.

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)

243-247: New sync/refresh bindings are consistent with BaseKVCacheManager

The new syncTransferManagerWithBufferManager trampoline and the Python bindings for sync_transfer_manager_with_buffer_manager / refresh_blocks match the C++ interface and existing binding style (pure virtual override + GIL release). No issues from this change.

Also applies to: 493-495

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)

238-242: Nanobind trampoline and bindings look correct; double‑check NB_TRAMPOLINE count

The syncTransferManagerWithBufferManager override and the new nanobind methods are consistent with the C++ interface and other bindings. Please just confirm that the NB_TRAMPOLINE(tbk::BaseKVCacheManager, 30); count still matches the exact number of pure virtuals after this addition.

Also applies to: 489-491

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)

642-682: Partial-reuse counters and logging look consistent.

The new mCopiedReusedPartialBlocks / mDirectlyReusedPartialBlocks members are initialized in the ctor and used only in loadOrAllocateBlocks and the destructor logs, so they’re consistent and side‑effect free. No functional concerns from these additions.

Also applies to: 751-773


1353-1364: Sync plumbing between block managers and transfer manager is straightforward.

BlockManager::syncTransferManagerWithBufferManager() simply forwards to each WindowBlockManager, which in turn calls mTransferManager->syncWithBufferManager(). This matches the header‑side override and cleanly wires the sync API through the hierarchy. Looks good.

Comment on lines 301 to 306
void KVCacheTransferManager::copyGPUtoGPU(
BlockPtr const& offloadedBlock, BlockPtr const& block, std::vector<KVCacheBlockPool> const& pools)
{
syncPendingOnboardTransfers(offloadedBlock, block);
copyGPUtoGPU(offloadedBlock, block, pools);
recordPendingOnboardTransfers(offloadedBlock, block);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Merge the GPU-to-GPU copy helper instead of redefining it.

We already defined KVCacheTransferManager::copyGPUtoGPU(BlockPtr const&, BlockPtr const&, …) up at Line 100. This second definition has the same signature, so it will not compile (redefinition), and even if it did, the recursive call would never terminate. Fold the sync/record logic into the first implementation (or rename the helper) so there’s exactly one definition that performs the sync, copy, and bookkeeping.

-void KVCacheTransferManager::copyGPUtoGPU(
-    BlockPtr const& src, BlockPtr const& dst, std::vector<KVCacheBlockPool> const& pools)
-{
-    for (size_t poolIdx = 0; poolIdx < pools.size(); ++poolIdx)
-    {
-        auto srcPtr = computeBlockPointer(src, pools, poolIdx);
-        auto dstPtr = computeBlockPointer(dst, pools, poolIdx);
-        mBufferManager.copy(*srcPtr, *dstPtr);
-    }
-    TLLM_LOG_DEBUG("GPU-to-GPU copy for from block %d to block %d", src->getBlockId(), dst->getBlockId());
-}
+void KVCacheTransferManager::copyGPUtoGPU(
+    BlockPtr const& src, BlockPtr const& dst, std::vector<KVCacheBlockPool> const& pools)
+{
+    syncPendingOnboardTransfers(src, dst);
+    for (size_t poolIdx = 0; poolIdx < pools.size(); ++poolIdx)
+    {
+        auto srcPtr = computeBlockPointer(src, pools, poolIdx);
+        auto dstPtr = computeBlockPointer(dst, pools, poolIdx);
+        mBufferManager.copy(*srcPtr, *dstPtr);
+    }
+    recordPendingOnboardTransfers(src, dst);
+    TLLM_LOG_DEBUG(
+        "GPU-to-GPU copy from block %d to block %d", src->getBlockId(), dst->getBlockId());
+}

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp around lines
301-306, there is a duplicate definition of KVCacheTransferManager::copyGPUtoGPU
with the same signature that simply calls itself (causing
redefinition/recursion). Remove this duplicate function and instead merge the
syncPendingOnboardTransfers(offloadedBlock, block) and
recordPendingOnboardTransfers(offloadedBlock, block) calls into the single
existing copyGPUtoGPU implementation defined earlier (around line 100), or
rename this helper and update all call sites; ensure the final copyGPUtoGPU
performs sync, actual GPU-to-GPU copy, and bookkeeping without calling itself.

Comment on lines +562 to +687
def get_cuda_graph_warmup_request(batch_size):
available_blocks = kv_cache_manager.get_num_free_blocks()
if available_blocks >= batch_size:
result = ScheduledRequests()
result.context_requests = []
# Add (batch_size - 1) dummy requests with seq_len=1.
# Should only need one more page per request.
requests = kv_cache_manager.add_dummy_requests(
list(range(batch_size - 1)),
is_gen=True,
max_num_draft_tokens=self.max_draft_len,
use_mrope=use_mrope,
)
available_tokens = kv_cache_manager.get_num_available_tokens(
self.max_draft_len)

# Add one dummy request with the maximum possible sequence length.
# The sequence length is limited by both the max_seq_len and the number of available blocks.
token_num = max(1, min(available_tokens, self.max_seq_len - 1))
max_seq_len_request = kv_cache_manager.add_dummy_requests(
request_ids=[batch_size - 1],
token_nums=[token_num],
is_gen=True,
max_num_draft_tokens=self.max_draft_len,
use_mrope=use_mrope,
)[0]
# Add the longest request before all other seq_len=1 request to simulate the padding CUDA graph case.
# This batch contains both the longest request and the shortest requests,
# it also contains the maximum number of requests and the maximum token number,
# which simulates the extreme case for the padding CUDA graph.
# Thus we can replay this CUDA graph in all other cases.
requests.insert(0, max_seq_len_request)
result.generation_requests = requests
if spec_resource_manager is not None:
spec_resource_manager.add_dummy_requests(
request_ids=list(range(batch_size)))
else:
result = None
return result

def get_torch_compile_warmup_request(batch_size,
num_tokens_per_request):
available_blocks = kv_cache_manager.get_num_free_blocks()
if available_blocks >= batch_size * math.ceil(
num_tokens_per_request / kv_cache_manager.tokens_per_block):
# Should only need (at most) one more page per request.
is_gen = num_tokens_per_request == 1

requests = kv_cache_manager.add_dummy_requests(
list(range(batch_size)), [num_tokens_per_request] *
batch_size if not is_gen else None,
is_gen=is_gen,
max_num_draft_tokens=self.max_draft_len)

if spec_resource_manager is not None:
spec_resource_manager.add_dummy_requests(
request_ids=list(range(batch_size)))

result = ScheduledRequests()
result.context_requests = []
result.generation_requests = []
if is_gen:
result.generation_requests = requests
else:
result.context_requests = requests
else:
result = None
return result

def get_autotune_warmup_request():
available_tokens = kv_cache_manager.get_num_available_tokens(
self.max_draft_len)
num_tokens_per_request = min(
min(available_tokens, self.max_seq_len - 1),
self.max_num_tokens)
# Number of tokens required per request must be rounded up to whole number of blocks
num_tokens_required_per_request = (
(num_tokens_per_request + kv_cache_manager.tokens_per_block - 1)
// kv_cache_manager.tokens_per_block
) * kv_cache_manager.tokens_per_block

available_blocks = kv_cache_manager.get_num_free_blocks()

maximum_tunable_num_tokens = min(
self.batch_size * num_tokens_per_request, self.max_num_tokens,
available_blocks * kv_cache_manager.tokens_per_block)

# Calculate number of full-length requests and remaining tokens
# Each request has num_tokens_per_request tokens, except possibly the last one
# Calculations are also limited by how many KV cache blocks are available
full_len_request_num = min(
maximum_tunable_num_tokens // num_tokens_per_request,
max(1, available_tokens // num_tokens_required_per_request))
remaining_tokens = min(
maximum_tunable_num_tokens % num_tokens_per_request,
max(
0, available_tokens -
full_len_request_num * num_tokens_required_per_request))

request_num = full_len_request_num if remaining_tokens == 0 else full_len_request_num + 1

requests = kv_cache_manager.add_dummy_requests(
request_ids=list(range(full_len_request_num)),
token_nums=[num_tokens_per_request] * full_len_request_num,
is_gen=False,
max_num_draft_tokens=self.max_draft_len)

if remaining_tokens > 0:
final_request = kv_cache_manager.add_dummy_requests(
request_ids=[full_len_request_num],
token_nums=[remaining_tokens],
is_gen=False,
max_num_draft_tokens=self.max_draft_len)

requests += final_request

if spec_resource_manager is not None:
spec_resource_manager.add_dummy_requests(
request_ids=list(range(request_num)))

result = ScheduledRequests()
result.context_requests = requests
result.generation_requests = []

return result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Inner warmup helpers use undefined names and have an unhandled zero‑tokens edge case

Two concrete issues in the new inner helpers:

  1. Undefined variables (use_mrope, spec_resource_manager)

    • get_cuda_graph_warmup_request and get_torch_compile_warmup_request call kv_cache_manager.add_dummy_requests(..., use_mrope=use_mrope, ...), but use_mrope is never defined in warmup.
    • All three helpers and release_batch reference spec_resource_manager, which is also never defined in warmup.
    • If any of these helpers are called, they will raise a NameError.

    Suggested fix (before the helper definitions in warmup):

    kv_cache_manager = resource_manager.get_resource_manager(self.kv_cache_manager_key)
    if kv_cache_manager is None:
        logger.info("Skipping warm up as no KV Cache manager allocated.")
        return
    
    use_mrope = self.use_mrope
    spec_resource_manager = resource_manager.get_resource_manager(
        ResourceManagerType.SPEC_RESOURCE_MANAGER
    )
    
    # The lifetime of model engine and kv cache manager can be different.
    self.cuda_graph_runner.padding_dummy_request = None
    
    def get_cuda_graph_warmup_request(batch_size):
        ...
  2. Potential division‑by‑zero in get_autotune_warmup_request

    • When kv_cache_manager.get_num_available_tokens(self.max_draft_len) returns 0, num_tokens_per_request becomes 0, leading to maximum_tunable_num_tokens // num_tokens_per_request dividing by zero.
    • This is easy to hit in low‑memory / fully‑utilized KV cache situations.

    Suggested minimal guard inside get_autotune_warmup_request:

    available_tokens = kv_cache_manager.get_num_available_tokens(self.max_draft_len)
    if available_tokens <= 0:
        return None
    
    num_tokens_per_request = min(
        min(available_tokens, self.max_seq_len - 1),
        self.max_num_tokens,
    )
    if num_tokens_per_request <= 0:
        return None

Without these fixes, wiring these helpers into the warmup flow will be fragile and can hard‑fail at runtime.

Also applies to: 688-697



---

🏁 Script executed:

```shell
#!/bin/bash
# Extract the full warmup method and surrounding context
fd -t f 'model_engine\.py' | head -1 | xargs cat | head -750 | tail -200

Length of output: 9494


🏁 Script executed:

#!/bin/bash
# Search for use_mrope and spec_resource_manager usage in the file
rg -n 'use_mrope|spec_resource_manager' tensorrt_llm/_torch/pyexecutor/model_engine.py -B2 -A2

Length of output: 11353


🏁 Script executed:

#!/bin/bash
# Find where warmup method starts and its parameters
rg -n 'def warmup' tensorrt_llm/_torch/pyexecutor/model_engine.py -A30

Length of output: 3293


🏁 Script executed:

#!/bin/bash
# Search for resource_manager.get_resource_manager patterns in the file
rg -n 'resource_manager\.get_resource_manager' tensorrt_llm/_torch/pyexecutor/model_engine.py -B2 -A2

Length of output: 3761


🏁 Script executed:

#!/bin/bash
# Look for self.use_mrope references
rg -n 'self\.use_mrope' tensorrt_llm/_torch/pyexecutor/model_engine.py -B1 -A1

Length of output: 1772


Inner warmup helpers use undefined names and have an unhandled zero‑tokens edge case

Two concrete issues in the new inner helpers have been verified:

  1. Undefined variables (use_mrope, spec_resource_manager)

    • get_cuda_graph_warmup_request and get_torch_compile_warmup_request use use_mrope at lines 573 and 586, but this variable is never defined in the warmup method scope—it will raise NameError if called.
    • All three helpers and release_batch reference spec_resource_manager at lines 595–597, 616–618, 678–680, and 696–697, also never defined in warmup scope—will raise NameError if called.
    • Elsewhere in the codebase, spec_resource_manager is consistently retrieved via resource_manager.get_resource_manager(ResourceManagerType.SPEC_RESOURCE_MANAGER) (lines 903, 930, 1008, 1085) before use.

    Add these before the helper definitions in warmup:

    use_mrope = self.use_mrope
    spec_resource_manager = resource_manager.get_resource_manager(
        ResourceManagerType.SPEC_RESOURCE_MANAGER
    )
  2. Division-by-zero in get_autotune_warmup_request

    • When kv_cache_manager.get_num_available_tokens(self.max_draft_len) returns 0, num_tokens_per_request becomes 0, causing maximum_tunable_num_tokens // num_tokens_per_request to divide by zero.
    • Add guards:
    available_tokens = kv_cache_manager.get_num_available_tokens(self.max_draft_len)
    if available_tokens <= 0:
        return None
    
    num_tokens_per_request = min(
        min(available_tokens, self.max_seq_len - 1),
        self.max_num_tokens)
    if num_tokens_per_request <= 0:
        return None

These helpers will fail at runtime without these fixes.

🧰 Tools
🪛 Ruff (0.14.5)

573-573: Undefined name use_mrope

(F821)


586-586: Undefined name use_mrope

(F821)


595-595: Undefined name spec_resource_manager

(F821)


596-596: Undefined name spec_resource_manager

(F821)


616-616: Undefined name spec_resource_manager

(F821)


617-617: Undefined name spec_resource_manager

(F821)


678-678: Undefined name spec_resource_manager

(F821)


679-679: Undefined name spec_resource_manager

(F821)

…ial reuse in the KV cache manager

`KVCacheTransferManager::onboard` only covers memory movement between
CPU and GPU and not GPU to GPU.

Use the `mBufferManager` to copy block content.

This fixes the incorrect partial block copy functionality exposed
through test case `accuracy/test_llm_api_pytorch.py::TestGemma3_1BInstruct::test_auto_dtype_vswa_reuse_low_memory_available_partial_reuse`

Signed-off-by: eopXD <[email protected]>
std::string const& directory = "");

/*!
* \brief Directly copy a block from gpu to gpu.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this asynchronous from CPU's perspective? If so, we should mention that here. Something like:

\note The copy is launched on an internal stream asynchronously and this function may return before the copy is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

KV-Cache Management kv-cache management for efficient LLM inference

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants