[TRTLLM-8540][feat] Add support for disagg in DSv3.2 #8735

Tabrizian · 2025-10-28T16:55:59Z

Summary by CodeRabbit

New Features
- Added Indexer KCache support for optimized key-value cache management
- Enabled configurable indexer KCache block size and dimension settings
- Enhanced multi-GPU cache transfer capabilities with indexer KCache option
- Improved MLA (Multi-Head Latent Attention) cache handling with new transfer paths

gsm8k accuracy for disagg:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	_	95.7544	_	0.5554
		strict-match	5	exact_match	_	95.6785	_	0.5601

[10/30/2025-21:00:20] [TRT-LLM] [I] lm-eval gsm8k average accuracy: 95.72
[10/30/2025-21:00:20] [TRT-LLM] [I] Hypothesis testing report:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gpqa_diamond_cot_zeroshot_aa	1	strict-match	0	exact_match	↑	79.798	±	2.8606

[11/04/2025-23:53:54] [TRT-LLM] [I] lm-eval gpqa_diamond_cot_zeroshot_aa average accuracy: 79.80

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Tabrizian · 2025-10-31T22:14:52Z

/bot run

Tabrizian · 2025-10-31T22:19:56Z

/bot run --disable-fail-fast

coderabbitai · 2025-10-31T22:20:18Z

📝 Walkthrough

Walkthrough

This PR introduces comprehensive support for an "Indexer K-Cache" feature across the TensorRT-LLM batch manager and executor. Changes include exposing new indexer K-cache configuration accessors through the manager class hierarchy, extending cache state to track indexer K-cache settings, modifying cache transfer buffer management to support indexer pools, and updating cache split/concat operations with isIndexerKCache parameters. Serialization of cache state is extended to persist these new fields, and MLA cache formatting is refactored to handle multiple transfer buffer managers.

Changes

Cohort / File(s)	Summary
Indexer K-Cache Configuration Accessors `cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`	Added three new public accessor methods (`isEnableIndexerKCache()`, `getIndexerKCacheQuantBlockSize()`, `getIndexerKCacheIndexHeadDim()`) to WindowBlockManager, BlockManager, and KVCacheManager classes; added corresponding pure virtual methods to BaseKVCacheManager base class to standardize queries across the hierarchy.
Cache State Extension `cpp/include/tensorrt_llm/executor/dataTransceiverState.h`	Extended CacheState with three new configuration fields: `hasIndexerKCache` (bool), `indexerDimPerHead` (SizeType32), and `indexerKCacheQuantBlockSize` (SizeType32, default 128); updated three constructors to accept these parameters and added corresponding getter methods.
Cache Utilities with Indexer Pool Support `cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h`	Modified BlockRange to support optional indexer K-cache pool selection via updated `getBlockRangeForWindow()` with new `useIndexerKCache` parameter; updated pool count queries to use explicit flags; added `mIndexerKCachePool` member initialization.
Cache Transfer Buffer Configuration `cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h`, `cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp`	Extended CacheTransBufferManager constructor with `transferIndexerKCache` boolean parameter; added data type selection logic conditional on this flag; added public `getMaxNumTokens()` accessor.
Cache Manager Integration `cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp`, `cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`	Updated `getNumPools()` calls to use explicit boolean flags; modified MLACacheFormatter construction to accept vector of CacheTransBufferManager pointers instead of single pointer; updated CacheState instantiation to pass indexer K-cache parameters.
MLA Cache Formatting Refactor `cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.h`, `cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp`	Changed MLACacheFormatter constructor signature from single `CacheTransBufferManager` to `std::vector<CacheTransBufferManager>`; refactored transfer logic to support per-buffer-manager paths with dynamic buffer allocation, zero-copy handling, and per-transferer timing measurements.
Cache Split/Concat Operations `cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h`, `cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu`	Added optional `isIndexerKCache` parameter (default false) to `splitKVCacheDispatch()`, `concatKvCacheV2Dispatch()`, and related functions; conditional data type handling switches to UINT8 for indexer caches; dynamic per-head dimension adjustment for indexer-specific formats.
Serialization Support `cpp/tensorrt_llm/executor/serialization.cpp`	Extended CacheState serialization and deserialization to handle three new fields: `hasIndexerKCache`, `indexerDimPerHead`, and `indexerKCacheQuantBlockSize`; updated serialized size calculation and constructor invocation.
Comprehensive Test Updates `cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp`	Added three new test parameters (`isIndexerKCache`, `indexerDimPerHead`, `indexerKCacheQuantBlockSize`) to test tuple; extended `setUpCacheManager()` and helper methods (`fillBlockData`, `verifyBlockData`, `generateExpectedValue`) to handle indexer K-cache scenarios; expanded test instantiation macros to cover indexer K-cache combinations.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Setup
    participant CacheMgr as CacheManager
    participant KVState as CacheState
    participant MLAFormatter as MLACacheFormatter
    participant TransBuffer as CacheTransBufferManager

    Test->>CacheMgr: Initialize with indexerKCache config
    CacheMgr->>KVState: Create CacheState(hasIndexerKCache, indexerDimPerHead, ...)
    KVState-->>CacheMgr: Store configuration
    
    Test->>MLAFormatter: Create MLACacheFormatter(cacheManager, vector<TransBuffer*>)
    Note over MLAFormatter: Multiple buffers for primary + indexer paths
    
    MLAFormatter->>TransBuffer: Initialize primary buffer (transferIndexerKCache=false)
    MLAFormatter->>TransBuffer: Initialize indexer buffer (transferIndexerKCache=true)
    
    Test->>MLAFormatter: Transfer cache
    alt Use IndexerKCache Path
        MLAFormatter->>TransBuffer: getOrAllocateRecvBuffers (indexer buffer)
        TransBuffer-->>MLAFormatter: Buffer handles (UINT8 dtype)
    else Use Primary Path
        MLAFormatter->>TransBuffer: getOrAllocateRecvBuffers (primary buffer)
        TransBuffer-->>MLAFormatter: Buffer handles (original dtype)
    end
    
    MLAFormatter->>CacheMgr: Query isEnableIndexerKCache(), getIndexerKCacheQuantBlockSize()
    CacheMgr-->>MLAFormatter: Configuration flags

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

mlaCacheFormatter.cpp: Significant refactoring with per-buffer-manager logic paths, zero-copy handling, and dynamic buffer management; requires careful tracing of data flow across multiple transfer scenarios
cacheSplitConcat.cu: Complex conditional data type handling and per-head dimension adjustments based on isIndexerKCache flag; multiple function signature changes propagating through call chain
cacheTransceiverTest.cpp: Extensive test parameter matrix expansion with new indexer K-cache combinations; helper method updates affect multiple test paths
kvCacheUtils.h: Subtle changes to pool selection logic with new conditional branching for indexer cache pools
Interconnected parameter threading: New parameters propagate through multiple abstraction layers (CacheState → MLACacheFormatter → CacheTransBufferManager), requiring verification of correct plumbing across all call sites

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.16% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	⚠️ Warning	PR description lacks detailed explanation of changes, implementation approach, and test coverage validation despite extensive code modifications.	Add clear description of what IndexerKCache feature does, why it's needed for disagg in DSv3.2, specific test results for this feature, and link to relevant issue/ticket.
Title Check	❓ Inconclusive	The title "[TRTLLM-8540][feat] Add support for disagg in DSv3.2" follows the required format with a valid JIRA ticket and type indicator. However, the title is overly vague and high-level. The raw_summary reveals that the actual technical changes consist primarily of adding indexer KCache support infrastructure across multiple cache manager classes and related components—a specific implementation detail essential for understanding the changeset. The title describes the end goal (disagg support) rather than the primary technical mechanism (indexer KCache), making it less informative for developers scanning commit history.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd · 2025-10-31T22:20:53Z

PR_Github #23228 [ run ] triggered by Bot. Commit: 864914e

tensorrt-cicd · 2025-10-31T22:26:07Z

PR_Github #23229 [ run ] triggered by Bot. Commit: 864914e

tensorrt-cicd · 2025-10-31T22:26:09Z

PR_Github #23228 [ run ] completed with state ABORTED. Commit: 864914e

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (1)
1060-1073: Variable shadowing bug: conditional data type assignment is overwritten.

Lines 1060-1068 conditionally set cacheDataType based on isIndexerKCache, but line 1073 shadows this variable by declaring a new local auto cacheDataType, effectively discarding the conditional logic. This will cause incorrect data type handling when isIndexerKCache is true.

Apply this diff to fix the shadowing:
     for (auto const& [window, blocks] : kVCacheBlocksPerWindow)
     {
         auto cacheBlockSize = blocks.front()->getSize();
-        auto cacheDataType = blocks.front()->getDataType();
+        auto blockDataType = blocks.front()->getDataType();
         windowSizes.push_back(window);
Then ensure validation uses the outer cacheDataType:
         for (auto&& kvCacheBlock : blocks)
         {
-            TLLM_CHECK(kvCacheBlock->getDataType() == cacheDataType);
+            TLLM_CHECK(kvCacheBlock->getDataType() == blockDataType);
             TLLM_CHECK(kvCacheBlock->getSize() == cacheBlockSize);

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0dc746 and 864914e.

📒 Files selected for processing (13)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4 hunks)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (4 hunks)
cpp/include/tensorrt_llm/executor/dataTransceiverState.h (5 hunks)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (6 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (3 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1 hunks)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (4 hunks)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.h (2 hunks)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (11 hunks)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (1 hunks)
cpp/tensorrt_llm/executor/serialization.cpp (3 hunks)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (24 hunks)

🧰 Additional context used

📓 Path-based instructions (7)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}