-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[https://nvbugs/5625990][fix] Fix block copy from GPU to GPU for partial reuse in the KV cache manager #9295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: thorjohnsen <[email protected]>
Signed-off-by: thorjohnsen <[email protected]>
Signed-off-by: thorjohnsen <[email protected]>
…n/TensorRT-LLM into user/tjohnsen/fix_bug_5375590 .
Signed-off-by: thorjohnsen <[email protected]>
Signed-off-by: Iman Tabrizian <[email protected]> .
Signed-off-by: thorjohnsen <[email protected]> .
Signed-off-by: thorjohnsen <[email protected]>
Signed-off-by: thorjohnsen <[email protected]>
Signed-off-by: thorjohnsen <[email protected]>
Signed-off-by: thorjohnsen <[email protected]>
…nsorRT-LLM into user/tjohnsen/fix_5627710 .
Signed-off-by: thorjohnsen <[email protected]>
…nsorRT-LLM into user/tjohnsen/fix_5627710 .
Signed-off-by: thorjohnsen <[email protected]>
Signed-off-by: thorjohnsen <[email protected]>
Co-authored-by: Robin Kobus <[email protected]> Signed-off-by: Thor Johnsen <[email protected]>
…o try-thor-fix-transfer-manager
…y-thor-fix-transfer-manager
📝 WalkthroughWalkthroughThis PR enhances KV cache transfer synchronization across WindowBlockManager, BlockManager, and BaseKVCacheManager by introducing explicit sync pathways with the buffer manager, adds GPU-to-GPU copy capabilities, refines pending I/O tracking with separate read/write events, integrates synchronization into resource preparation and warmup, and exposes new Python bindings for sync and refresh operations. Changes
Sequence Diagram(s)sequenceDiagram
participant AllocateKvCache
participant KVCacheManager
participant BlockManager
participant TransferManager
participant BufferManager
AllocateKvCache->>KVCacheManager: syncTransferManagerWithBufferManager()
KVCacheManager->>BlockManager: syncTransferManagerWithBufferManager()
BlockManager->>TransferManager: syncWithBufferManager()
TransferManager->>BufferManager: align streams & sync pending
Note over TransferManager: Clear pending reads/writes
rect rgba(100, 200, 100, 0.2)
AllocateKvCache->>KVCacheManager: process context/generation requests
end
sequenceDiagram
participant PartialReuseLogic
participant TransferManager
participant GPU
PartialReuseLogic->>PartialReuseLogic: detect partial block reuse
alt Non-leaf/active block
PartialReuseLogic->>TransferManager: syncPendingOnboardTransfers()
TransferManager->>GPU: wait on pending writes/reads
PartialReuseLogic->>TransferManager: copyGPUtoGPU()
TransferManager->>GPU: device-to-device copy
TransferManager->>TransferManager: recordPendingOnboardTransfers()
PartialReuseLogic->>PartialReuseLogic: increment mCopiedReusedPartialBlocks
else Leaf/private reuse
PartialReuseLogic->>PartialReuseLogic: increment mDirectlyReusedPartialBlocks
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp (1)
29-64: Also sync the cross KV manager before calling add_sequence.Line 60 calls
crossKvCacheManager->addSequence(...), but we never invoke the newsyncTransferManagerWithBufferManager()on that manager. The header comment even states this sync must happen before the firstaddSequencein each step. Without it, the cross-cache path can still race its pending transfers, recreating the corruption we’re fixing here. Please sync both managers up front.kvCacheManager.syncTransferManagerWithBufferManager(); + if (crossKvCacheManager) + { + crossKvCacheManager->syncTransferManagerWithBufferManager(); + }
🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
562-687: New warmup helper functions are currently unused and duplicate existing mechanisms
get_cuda_graph_warmup_request,get_torch_compile_warmup_request,get_autotune_warmup_request, andrelease_batchare defined insidewarmupbut never used; the actual warmup path still goes through_run_torch_compile_warmup,_run_autotuner_warmup,_run_cuda_graph_warmupand_release_batch_context.This adds cognitive overhead and can drift from the logic in
_create_warmup_request/_create_cuda_graph_warmup_request(which already encode more detailed constraints like beam width and fused drafting loops).Consider either:
- Wiring these helpers into the warmup orchestration and deprecating the older
_create_*/_release_batch_contexthelpers, or- Removing these inner helpers for now to avoid dead code and duplication.
Also applies to: 688-699
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
1228-1277: ConfirmcopyGPUtoGPUbehavior whenmatchingBlockis offloaded (non‑primary).In the partial‑reuse path, you now do:
auto newBlock = getFreeBlock(...); mTransferManager->copyGPUtoGPU(matchingBlock, newBlock, mPools); ... matchingBlock = newBlock; ... onboardBlock(sequence, matchingBlock, mode, directory); // now a no‑op, newBlock is primary
matchingBlockhere can in principle refer to a block that has previously been offloaded to the secondary pool but is still present in the reuse radix tree. IfcopyGPUtoGPUassumes both source and destination blocks are primary/GPU‑resident, this would be incorrect for offloaded matches.Please double‑check that either:
copyGPUtoGPUexplicitly supports copying from blocks whoseMemoryPoolBlockIndexcurrently points to the secondary pool, or- this path is only ever reached when
matchingBlock->isPrimary()is true, or- you add a small guard / early onboarding for non‑primary matches before invoking
copyGPUtoGPU.The new partial‑reuse counters (
mCopiedReusedPartialBlocksandmDirectlyReusedPartialBlocks) match the intended semantics of the two branches.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h(5 hunks)cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h(2 hunks)cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp(1 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp(6 hunks)cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp(2 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp(2 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(1 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(2 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py(0 hunks)tests/integration/test_lists/waives.txt(1 hunks)
💤 Files with no reviewable changes (1)
- tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧰 Additional context used
🧠 Learnings (16)
📓 Common learnings
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pycpp/include/tensorrt_llm/batch_manager/kvCacheManager.hcpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/allocateKvCache.cppcpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpptensorrt_llm/_torch/pyexecutor/model_engine.pycpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.hcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pycpp/include/tensorrt_llm/batch_manager/kvCacheManager.hcpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/allocateKvCache.cppcpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cppcpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.hcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pycpp/include/tensorrt_llm/batch_manager/kvCacheManager.hcpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/allocateKvCache.cppcpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cppcpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.hcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.hcpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/allocateKvCache.cppcpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cppcpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.hcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.hcpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/allocateKvCache.cppcpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cppcpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.hcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
Repo: NVIDIA/TensorRT-LLM PR: 6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.hcpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/allocateKvCache.cppcpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cppcpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.hcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-09-17T02:48:52.732Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7781
File: tests/integration/test_lists/waives.txt:313-313
Timestamp: 2025-09-17T02:48:52.732Z
Learning: In TensorRT-LLM, `tests/integration/test_lists/waives.txt` is specifically for waiving/skipping tests, while other test list files like those in `test-db/` and `qa/` directories are for different test execution contexts (pre-merge, post-merge, QA tests). The same test appearing in both waives.txt and execution list files is intentional - the test is part of test suites but will be skipped due to the waiver.
Applied to files:
tests/integration/test_lists/waives.txt
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/waives.txt
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").
Applied to files:
tests/integration/test_lists/waives.txt
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/integration/test_lists/waives.txt
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.
Applied to files:
tests/integration/test_lists/waives.txt
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/test_lists/waives.txt
📚 Learning: 2025-08-18T08:42:02.640Z
Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.
Applied to files:
tests/integration/test_lists/waives.txt
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
🧬 Code graph analysis (7)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
syncTransferManagerWithBufferManager(1353-1359)syncTransferManagerWithBufferManager(1353-1353)syncTransferManagerWithBufferManager(1361-1364)syncTransferManagerWithBufferManager(1361-1361)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
syncTransferManagerWithBufferManager(1979-1982)refreshBlocks(1985-1988)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (8)
syncTransferManagerWithBufferManager(1353-1359)syncTransferManagerWithBufferManager(1353-1353)syncTransferManagerWithBufferManager(1361-1364)syncTransferManagerWithBufferManager(1361-1361)refreshBlocks(1366-1372)refreshBlocks(1366-1366)refreshBlocks(1374-1378)refreshBlocks(1374-1374)
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
offloadBlock(1112-1116)offloadBlock(1112-1113)offloadBlock(1118-1146)offloadBlock(1118-1119)
tensorrt_llm/_torch/pyexecutor/model_engine.py (5)
tensorrt_llm/_torch/pyexecutor/scheduler.py (3)
batch_size(37-38)ScheduledRequests(20-41)all_requests(40-41)tensorrt_llm/_torch/pyexecutor/resource_manager.py (9)
get_num_free_blocks(806-814)add_dummy_requests(78-79)add_dummy_requests(460-538)get_num_available_tokens(819-821)free_resources(87-88)free_resources(628-630)free_resources(1295-1299)free_resources(1362-1365)free_resources(1490-1491)tensorrt_llm/_torch/speculative/mtp.py (2)
add_dummy_requests(88-90)free_resources(81-86)tensorrt_llm/_torch/attention_backend/sparse/rocket.py (2)
add_dummy_requests(914-939)free_resources(978-980)tensorrt_llm/_torch/speculative/eagle3.py (2)
add_dummy_requests(96-98)free_resources(90-94)
cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp (14)
syncWithBufferManager(345-358)syncWithBufferManager(345-345)syncTransfers(360-373)syncTransfers(360-360)computeBlockPointer(89-98)computeBlockPointer(89-90)syncPendingOnboardTransfers(257-280)syncPendingOnboardTransfers(257-257)recordPendingOnboardTransfers(282-290)recordPendingOnboardTransfers(282-282)copyGPUtoGPU(101-111)copyGPUtoGPU(101-102)copyGPUtoGPU(301-307)copyGPUtoGPU(301-302)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
syncTransferManagerWithBufferManager(1979-1982)refreshBlocks(1985-1988)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (8)
syncTransferManagerWithBufferManager(1353-1359)syncTransferManagerWithBufferManager(1353-1353)syncTransferManagerWithBufferManager(1361-1364)syncTransferManagerWithBufferManager(1361-1361)refreshBlocks(1366-1372)refreshBlocks(1366-1366)refreshBlocks(1374-1378)refreshBlocks(1374-1374)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
syncTransferManagerWithBufferManager(1979-1982)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/pyexecutor/model_engine.py
573-573: Undefined name use_mrope
(F821)
586-586: Undefined name use_mrope
(F821)
595-595: Undefined name spec_resource_manager
(F821)
596-596: Undefined name spec_resource_manager
(F821)
616-616: Undefined name spec_resource_manager
(F821)
617-617: Undefined name spec_resource_manager
(F821)
678-678: Undefined name spec_resource_manager
(F821)
679-679: Undefined name spec_resource_manager
(F821)
696-696: Undefined name spec_resource_manager
(F821)
697-697: Undefined name spec_resource_manager
(F821)
🔇 Additional comments (5)
tests/integration/test_lists/waives.txt (1)
38-444: Verify the rationale and scope of 407 new test waivers.This change adds a substantial number (407) of new test skip entries spanning unrelated modules (multimodal, performance tests, disaggregated serving, Triton server, etc.). This scale is unexpected for a targeted fix to GPU→GPU KV cache block copy.
A few clarifications needed:
- Relationship to PR #9056: The PR depends on #9056 and will be rebased. Are these waivers required due to changes in that PR, or are they specific to this KV cache fix?
- Scope mismatch: Many waivers reference test modules unrelated to KV cache (e.g.,
test_multimodal.py,test_perf.py,triton_server/). Why do unrelated components require waiving?- Temporary vs. persistent: Are these waivers expected to be permanent or temporary blockers that will be resolved before merge?
- Documentation: Consider adding a brief inline comment explaining the rationale for this bulk update (e.g., "Waivers inherited from PR #9056" or "Known transient failures pending infrastructure fixes").
For context, the PR objectives mention the fix is exposed by a single test case (
test_auto_dtype_vswa_reuse_low_memory_available_partial_reuse), so the extensive waiver list warrants explanation in the commit message or PR description.cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)
243-247: New sync/refresh bindings are consistent with BaseKVCacheManagerThe new
syncTransferManagerWithBufferManagertrampoline and the Python bindings forsync_transfer_manager_with_buffer_manager/refresh_blocksmatch the C++ interface and existing binding style (pure virtual override + GIL release). No issues from this change.Also applies to: 493-495
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)
238-242: Nanobind trampoline and bindings look correct; double‑check NB_TRAMPOLINE countThe
syncTransferManagerWithBufferManageroverride and the new nanobind methods are consistent with the C++ interface and other bindings. Please just confirm that theNB_TRAMPOLINE(tbk::BaseKVCacheManager, 30);count still matches the exact number of pure virtuals after this addition.Also applies to: 489-491
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)
642-682: Partial-reuse counters and logging look consistent.The new
mCopiedReusedPartialBlocks/mDirectlyReusedPartialBlocksmembers are initialized in the ctor and used only inloadOrAllocateBlocksand the destructor logs, so they’re consistent and side‑effect free. No functional concerns from these additions.Also applies to: 751-773
1353-1364: Sync plumbing between block managers and transfer manager is straightforward.
BlockManager::syncTransferManagerWithBufferManager()simply forwards to eachWindowBlockManager, which in turn callsmTransferManager->syncWithBufferManager(). This matches the header‑side override and cleanly wires the sync API through the hierarchy. Looks good.
| void KVCacheTransferManager::copyGPUtoGPU( | ||
| BlockPtr const& offloadedBlock, BlockPtr const& block, std::vector<KVCacheBlockPool> const& pools) | ||
| { | ||
| syncPendingOnboardTransfers(offloadedBlock, block); | ||
| copyGPUtoGPU(offloadedBlock, block, pools); | ||
| recordPendingOnboardTransfers(offloadedBlock, block); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge the GPU-to-GPU copy helper instead of redefining it.
We already defined KVCacheTransferManager::copyGPUtoGPU(BlockPtr const&, BlockPtr const&, …) up at Line 100. This second definition has the same signature, so it will not compile (redefinition), and even if it did, the recursive call would never terminate. Fold the sync/record logic into the first implementation (or rename the helper) so there’s exactly one definition that performs the sync, copy, and bookkeeping.
-void KVCacheTransferManager::copyGPUtoGPU(
- BlockPtr const& src, BlockPtr const& dst, std::vector<KVCacheBlockPool> const& pools)
-{
- for (size_t poolIdx = 0; poolIdx < pools.size(); ++poolIdx)
- {
- auto srcPtr = computeBlockPointer(src, pools, poolIdx);
- auto dstPtr = computeBlockPointer(dst, pools, poolIdx);
- mBufferManager.copy(*srcPtr, *dstPtr);
- }
- TLLM_LOG_DEBUG("GPU-to-GPU copy for from block %d to block %d", src->getBlockId(), dst->getBlockId());
-}
+void KVCacheTransferManager::copyGPUtoGPU(
+ BlockPtr const& src, BlockPtr const& dst, std::vector<KVCacheBlockPool> const& pools)
+{
+ syncPendingOnboardTransfers(src, dst);
+ for (size_t poolIdx = 0; poolIdx < pools.size(); ++poolIdx)
+ {
+ auto srcPtr = computeBlockPointer(src, pools, poolIdx);
+ auto dstPtr = computeBlockPointer(dst, pools, poolIdx);
+ mBufferManager.copy(*srcPtr, *dstPtr);
+ }
+ recordPendingOnboardTransfers(src, dst);
+ TLLM_LOG_DEBUG(
+ "GPU-to-GPU copy from block %d to block %d", src->getBlockId(), dst->getBlockId());
+}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp around lines
301-306, there is a duplicate definition of KVCacheTransferManager::copyGPUtoGPU
with the same signature that simply calls itself (causing
redefinition/recursion). Remove this duplicate function and instead merge the
syncPendingOnboardTransfers(offloadedBlock, block) and
recordPendingOnboardTransfers(offloadedBlock, block) calls into the single
existing copyGPUtoGPU implementation defined earlier (around line 100), or
rename this helper and update all call sites; ensure the final copyGPUtoGPU
performs sync, actual GPU-to-GPU copy, and bookkeeping without calling itself.
| def get_cuda_graph_warmup_request(batch_size): | ||
| available_blocks = kv_cache_manager.get_num_free_blocks() | ||
| if available_blocks >= batch_size: | ||
| result = ScheduledRequests() | ||
| result.context_requests = [] | ||
| # Add (batch_size - 1) dummy requests with seq_len=1. | ||
| # Should only need one more page per request. | ||
| requests = kv_cache_manager.add_dummy_requests( | ||
| list(range(batch_size - 1)), | ||
| is_gen=True, | ||
| max_num_draft_tokens=self.max_draft_len, | ||
| use_mrope=use_mrope, | ||
| ) | ||
| available_tokens = kv_cache_manager.get_num_available_tokens( | ||
| self.max_draft_len) | ||
|
|
||
| # Add one dummy request with the maximum possible sequence length. | ||
| # The sequence length is limited by both the max_seq_len and the number of available blocks. | ||
| token_num = max(1, min(available_tokens, self.max_seq_len - 1)) | ||
| max_seq_len_request = kv_cache_manager.add_dummy_requests( | ||
| request_ids=[batch_size - 1], | ||
| token_nums=[token_num], | ||
| is_gen=True, | ||
| max_num_draft_tokens=self.max_draft_len, | ||
| use_mrope=use_mrope, | ||
| )[0] | ||
| # Add the longest request before all other seq_len=1 request to simulate the padding CUDA graph case. | ||
| # This batch contains both the longest request and the shortest requests, | ||
| # it also contains the maximum number of requests and the maximum token number, | ||
| # which simulates the extreme case for the padding CUDA graph. | ||
| # Thus we can replay this CUDA graph in all other cases. | ||
| requests.insert(0, max_seq_len_request) | ||
| result.generation_requests = requests | ||
| if spec_resource_manager is not None: | ||
| spec_resource_manager.add_dummy_requests( | ||
| request_ids=list(range(batch_size))) | ||
| else: | ||
| result = None | ||
| return result | ||
|
|
||
| def get_torch_compile_warmup_request(batch_size, | ||
| num_tokens_per_request): | ||
| available_blocks = kv_cache_manager.get_num_free_blocks() | ||
| if available_blocks >= batch_size * math.ceil( | ||
| num_tokens_per_request / kv_cache_manager.tokens_per_block): | ||
| # Should only need (at most) one more page per request. | ||
| is_gen = num_tokens_per_request == 1 | ||
|
|
||
| requests = kv_cache_manager.add_dummy_requests( | ||
| list(range(batch_size)), [num_tokens_per_request] * | ||
| batch_size if not is_gen else None, | ||
| is_gen=is_gen, | ||
| max_num_draft_tokens=self.max_draft_len) | ||
|
|
||
| if spec_resource_manager is not None: | ||
| spec_resource_manager.add_dummy_requests( | ||
| request_ids=list(range(batch_size))) | ||
|
|
||
| result = ScheduledRequests() | ||
| result.context_requests = [] | ||
| result.generation_requests = [] | ||
| if is_gen: | ||
| result.generation_requests = requests | ||
| else: | ||
| result.context_requests = requests | ||
| else: | ||
| result = None | ||
| return result | ||
|
|
||
| def get_autotune_warmup_request(): | ||
| available_tokens = kv_cache_manager.get_num_available_tokens( | ||
| self.max_draft_len) | ||
| num_tokens_per_request = min( | ||
| min(available_tokens, self.max_seq_len - 1), | ||
| self.max_num_tokens) | ||
| # Number of tokens required per request must be rounded up to whole number of blocks | ||
| num_tokens_required_per_request = ( | ||
| (num_tokens_per_request + kv_cache_manager.tokens_per_block - 1) | ||
| // kv_cache_manager.tokens_per_block | ||
| ) * kv_cache_manager.tokens_per_block | ||
|
|
||
| available_blocks = kv_cache_manager.get_num_free_blocks() | ||
|
|
||
| maximum_tunable_num_tokens = min( | ||
| self.batch_size * num_tokens_per_request, self.max_num_tokens, | ||
| available_blocks * kv_cache_manager.tokens_per_block) | ||
|
|
||
| # Calculate number of full-length requests and remaining tokens | ||
| # Each request has num_tokens_per_request tokens, except possibly the last one | ||
| # Calculations are also limited by how many KV cache blocks are available | ||
| full_len_request_num = min( | ||
| maximum_tunable_num_tokens // num_tokens_per_request, | ||
| max(1, available_tokens // num_tokens_required_per_request)) | ||
| remaining_tokens = min( | ||
| maximum_tunable_num_tokens % num_tokens_per_request, | ||
| max( | ||
| 0, available_tokens - | ||
| full_len_request_num * num_tokens_required_per_request)) | ||
|
|
||
| request_num = full_len_request_num if remaining_tokens == 0 else full_len_request_num + 1 | ||
|
|
||
| requests = kv_cache_manager.add_dummy_requests( | ||
| request_ids=list(range(full_len_request_num)), | ||
| token_nums=[num_tokens_per_request] * full_len_request_num, | ||
| is_gen=False, | ||
| max_num_draft_tokens=self.max_draft_len) | ||
|
|
||
| if remaining_tokens > 0: | ||
| final_request = kv_cache_manager.add_dummy_requests( | ||
| request_ids=[full_len_request_num], | ||
| token_nums=[remaining_tokens], | ||
| is_gen=False, | ||
| max_num_draft_tokens=self.max_draft_len) | ||
|
|
||
| requests += final_request | ||
|
|
||
| if spec_resource_manager is not None: | ||
| spec_resource_manager.add_dummy_requests( | ||
| request_ids=list(range(request_num))) | ||
|
|
||
| result = ScheduledRequests() | ||
| result.context_requests = requests | ||
| result.generation_requests = [] | ||
|
|
||
| return result | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Inner warmup helpers use undefined names and have an unhandled zero‑tokens edge case
Two concrete issues in the new inner helpers:
-
Undefined variables (
use_mrope,spec_resource_manager)get_cuda_graph_warmup_requestandget_torch_compile_warmup_requestcallkv_cache_manager.add_dummy_requests(..., use_mrope=use_mrope, ...), butuse_mropeis never defined inwarmup.- All three helpers and
release_batchreferencespec_resource_manager, which is also never defined inwarmup. - If any of these helpers are called, they will raise a
NameError.
Suggested fix (before the helper definitions in
warmup):kv_cache_manager = resource_manager.get_resource_manager(self.kv_cache_manager_key) if kv_cache_manager is None: logger.info("Skipping warm up as no KV Cache manager allocated.") return use_mrope = self.use_mrope spec_resource_manager = resource_manager.get_resource_manager( ResourceManagerType.SPEC_RESOURCE_MANAGER ) # The lifetime of model engine and kv cache manager can be different. self.cuda_graph_runner.padding_dummy_request = None def get_cuda_graph_warmup_request(batch_size): ...
-
Potential division‑by‑zero in
get_autotune_warmup_request- When
kv_cache_manager.get_num_available_tokens(self.max_draft_len)returns0,num_tokens_per_requestbecomes0, leading tomaximum_tunable_num_tokens // num_tokens_per_requestdividing by zero. - This is easy to hit in low‑memory / fully‑utilized KV cache situations.
Suggested minimal guard inside
get_autotune_warmup_request:available_tokens = kv_cache_manager.get_num_available_tokens(self.max_draft_len) if available_tokens <= 0: return None num_tokens_per_request = min( min(available_tokens, self.max_seq_len - 1), self.max_num_tokens, ) if num_tokens_per_request <= 0: return None
- When
Without these fixes, wiring these helpers into the warmup flow will be fragile and can hard‑fail at runtime.
Also applies to: 688-697
---
🏁 Script executed:
```shell
#!/bin/bash
# Extract the full warmup method and surrounding context
fd -t f 'model_engine\.py' | head -1 | xargs cat | head -750 | tail -200
Length of output: 9494
🏁 Script executed:
#!/bin/bash
# Search for use_mrope and spec_resource_manager usage in the file
rg -n 'use_mrope|spec_resource_manager' tensorrt_llm/_torch/pyexecutor/model_engine.py -B2 -A2Length of output: 11353
🏁 Script executed:
#!/bin/bash
# Find where warmup method starts and its parameters
rg -n 'def warmup' tensorrt_llm/_torch/pyexecutor/model_engine.py -A30Length of output: 3293
🏁 Script executed:
#!/bin/bash
# Search for resource_manager.get_resource_manager patterns in the file
rg -n 'resource_manager\.get_resource_manager' tensorrt_llm/_torch/pyexecutor/model_engine.py -B2 -A2Length of output: 3761
🏁 Script executed:
#!/bin/bash
# Look for self.use_mrope references
rg -n 'self\.use_mrope' tensorrt_llm/_torch/pyexecutor/model_engine.py -B1 -A1Length of output: 1772
Inner warmup helpers use undefined names and have an unhandled zero‑tokens edge case
Two concrete issues in the new inner helpers have been verified:
-
Undefined variables (
use_mrope,spec_resource_manager)get_cuda_graph_warmup_requestandget_torch_compile_warmup_requestuseuse_mropeat lines 573 and 586, but this variable is never defined in thewarmupmethod scope—it will raiseNameErrorif called.- All three helpers and
release_batchreferencespec_resource_managerat lines 595–597, 616–618, 678–680, and 696–697, also never defined inwarmupscope—will raiseNameErrorif called. - Elsewhere in the codebase,
spec_resource_manageris consistently retrieved viaresource_manager.get_resource_manager(ResourceManagerType.SPEC_RESOURCE_MANAGER)(lines 903, 930, 1008, 1085) before use.
Add these before the helper definitions in
warmup:use_mrope = self.use_mrope spec_resource_manager = resource_manager.get_resource_manager( ResourceManagerType.SPEC_RESOURCE_MANAGER )
-
Division-by-zero in
get_autotune_warmup_request- When
kv_cache_manager.get_num_available_tokens(self.max_draft_len)returns 0,num_tokens_per_requestbecomes 0, causingmaximum_tunable_num_tokens // num_tokens_per_requestto divide by zero. - Add guards:
available_tokens = kv_cache_manager.get_num_available_tokens(self.max_draft_len) if available_tokens <= 0: return None num_tokens_per_request = min( min(available_tokens, self.max_seq_len - 1), self.max_num_tokens) if num_tokens_per_request <= 0: return None
- When
These helpers will fail at runtime without these fixes.
🧰 Tools
🪛 Ruff (0.14.5)
573-573: Undefined name use_mrope
(F821)
586-586: Undefined name use_mrope
(F821)
595-595: Undefined name spec_resource_manager
(F821)
596-596: Undefined name spec_resource_manager
(F821)
616-616: Undefined name spec_resource_manager
(F821)
617-617: Undefined name spec_resource_manager
(F821)
678-678: Undefined name spec_resource_manager
(F821)
679-679: Undefined name spec_resource_manager
(F821)
…ial reuse in the KV cache manager `KVCacheTransferManager::onboard` only covers memory movement between CPU and GPU and not GPU to GPU. Use the `mBufferManager` to copy block content. This fixes the incorrect partial block copy functionality exposed through test case `accuracy/test_llm_api_pytorch.py::TestGemma3_1BInstruct::test_auto_dtype_vswa_reuse_low_memory_available_partial_reuse` Signed-off-by: eopXD <[email protected]>
| std::string const& directory = ""); | ||
|
|
||
| /*! | ||
| * \brief Directly copy a block from gpu to gpu. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this asynchronous from CPU's perspective? If so, we should mention that here. Something like:
\note The copy is launched on an internal stream asynchronously and this function may return before the copy is complete.
NOTE: This MR is currently based and depends on #9056. This MR will be rebased and re-organized after #9056 lands.
Summary by CodeRabbit
Bug Fixes
Refactor
Description
KVCacheTransferManager::onboardonly covers memory movement betweenCPU and GPU and not GPU to GPU.
Use the
mBufferManagerto correctly copy block content.Test Coverage
This fixes the incorrect partial block copy functionality exposed
through test case
accuracy/test_llm_api_pytorch.py::TestGemma3_1BInstruct::test_auto_dtype_vswa_reuse_low_memory_available_partial_reusePR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.