Skip to content

Conversation

@XucSh
Copy link
Owner

@XucSh XucSh commented Dec 18, 2025

Description

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

Checklist

  • I have performed a self-review of my own code.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced zero-copy tensor retrieval operations for efficient direct buffer-based tensor access
    • Added tensor parallelism support for distributed tensor operations across multiple ranks with dedicated API variants
    • New operations enable both single and batch tensor retrieval workflows
  • Documentation

    • Updated Python API reference with comprehensive documentation for zero-copy tensor operations and tensor parallelism features

✏️ Tip: You can customize this high-level summary in your review settings.

zxpdemonio and others added 24 commits December 17, 2025 17:10
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
@coderabbitai
Copy link

coderabbitai bot commented Dec 18, 2025

Walkthrough

Introduces zero-copy tensor operations for PyTorch models via new APIs on MooncakeStore. Adds get_tensor_into, batch_get_tensor_into, and their tensor-parallelism (TP) variants to retrieve tensors directly into pre-allocated buffers. Includes supporting utilities for non-owning numpy array views and comprehensive test coverage.

Changes

Cohort / File(s) Summary
Zero-copy tensor API implementation
mooncake-integration/store/store_py.cpp
Introduces four new Python-exposed APIs: get_tensor_into(), batch_get_tensor_into(), get_tensor_into_with_tp(), and batch_get_tensor_into_with_tp(). Each handles direct buffer-based tensor retrieval with GIL management, metadata parsing, dtype conversion, and optional TP-aware keying for distributed tensor shards.
Non-owning array view utilities
mooncake-integration/integration_utils.h
Adds create_typed_array_view() function template and array_creators_view array (size 16) to create non-owning numpy array views across 15 dtype variants, complementing existing owning-buffer creators.
API documentation
docs/source/python-api-reference/mooncake-store.md
Documents new zero-copy tensor operations including parameter signatures, TP variants (get_tensor_into_with_tp, batch_get_tensor_into_with_tp), buffer pointers, sizes, tp\_rank, tp\_size, and split\_dim parameters.
Test suite and validation
scripts/test_tensor_api.py
Expands memory constants (to 16/8 GiB), adds DTYPE_MAP, verify_tensor_equality(), and parse_global_segment_size() utilities. Introduces extensive zero-copy tests: buffer registration/unregistration, batch operations, TP reconstruction, concurrency stress tests, and performance benchmarks.
Formatting corrections
mooncake-store/include/dummy_client.h, mooncake-store/src/dummy_client.cpp
End-of-file newline formatting fixes; no semantic changes.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Store as MooncakeStore
    participant Buf as Pre-allocated Buffer
    participant PyT as PyTorch Tensor

    App->>App: Allocate buffer (via register_buffer)
    App->>Store: get_tensor_into(key, buffer_ptr, size)
    Note over Store: Parse metadata, resolve dtype
    
    alt TP Mode (tp_size > 1)
        Store->>Store: Transform key → key_tp_rank
        Note over Store: Fetch tensor shard for rank
    else Standard Mode
        Note over Store: Fetch full tensor
    end
    
    Store->>Buf: Copy/stream tensor data into buffer
    Store->>Store: Apply dtype view (if BFLOAT16/FLOAT8)
    Store->>PyT: Reconstruct PyTorch tensor from buffer view
    PyT-->>App: Return tensor (zero-copy reference)
    
    Note over App,PyT: Batch variant: repeat for each key/buffer pair<br/>TP batch: shard keys per rank, fetch in parallel
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • store_py.cpp: Dense C++ implementation requiring careful review of GIL acquire/release patterns around I/O operations, TP keying logic via get_tp_key_name(), dtype-specific view handling (BFLOAT16, FLOAT8 variants), and error aggregation in batch paths.
  • test_tensor_api.py: Large test suite addition with extensive coverage of buffer lifecycle, TP reconstruction, batch operations, and concurrency scenarios—requires tracing through multiple test paths.
  • integration_utils.h: New non-owning array creator variant; verify consistency with existing array_creators patterns and correct dtype indexing across both arrays.
  • Heterogeneous changes: Mix of C++ core logic, C++ utilities, Python bindings, tests, and documentation increases cognitive load due to varied contexts.

Poem

🐰 Zero-copy whispers, through buffers we glide,
TP shards align, no data slides,
Tensors dance swift, in pre-blessed space,
Fast tracks for models, we've won the race! 🚀✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is entirely empty/incomplete, containing only the template placeholder comments with no actual content filled in and no type of change selected. Complete the description by: (1) filling in the Description section with details about zero-copy tensor APIs and TP support, (2) selecting the appropriate Type of Change (likely "New feature" under "Mooncake Store" or "Integration"), and (3) describing the testing performed for these changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 56.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title "Pr1192 test" is vague and non-descriptive, using a ticket number reference and generic term without conveying meaningful information about the changeset. Replace the title with a clear, specific description of the main change. For example: "Add zero-copy tensor APIs and TP variants to Mooncake Store" or "Implement get_tensor_into and batch operations with tensor parallelism support".
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr1192

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (4)
scripts/test_tensor_api.py (2)

45-92: Unused rtol and atol parameters.

These parameters are declared but never used. The function uses np.array_equal for exact comparison. If approximate comparison was intended, consider using np.allclose(orig_np, recv_np, rtol=rtol, atol=atol) or remove the unused parameters.

🔎 Option 1: Remove unused parameters
-def verify_tensor_equality(original, received, rtol=0, atol=0, verbose=True):
+def verify_tensor_equality(original, received, verbose=True):
🔎 Option 2: Use parameters for approximate comparison
-    if np.array_equal(orig_np, recv_np):
-
-        return True
+    if rtol == 0 and atol == 0:
+        if np.array_equal(orig_np, recv_np):
+            return True
+    else:
+        if np.allclose(orig_np, recv_np, rtol=rtol, atol=atol):
+            return True

29-43: DTYPE_MAP appears unused in the changed code.

If this is for future use, consider adding a comment explaining its purpose. Otherwise, it can be removed to avoid dead code.

mooncake-integration/store/store_py.cpp (2)

451-454: Redundant GIL acquire.

At line 453, py::gil_scoped_acquire acquire_gil; is called, but the GIL is already held at this point—the py::gil_scoped_release scope ended at line 449. This is harmless but adds confusion.

🔎 Suggested fix:
         py::list results_list;
         try {
-            py::gil_scoped_acquire acquire_gil;
             auto torch = torch_module();

549-573: Unused split_dim parameter.

The split_dim parameter is accepted but never used in this method. The existing get_tensor_with_tp (line 267) also has split_dim unused, so this may be intentional for API symmetry, but it could mislead users into thinking it affects behavior.

Consider either:

  1. Adding a comment explaining it's reserved for future use / API consistency
  2. Removing it if not needed
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31c45e4 and 40f1bc6.

📒 Files selected for processing (6)
  • docs/source/python-api-reference/mooncake-store.md (1 hunks)
  • mooncake-integration/integration_utils.h (1 hunks)
  • mooncake-integration/store/store_py.cpp (2 hunks)
  • mooncake-store/include/dummy_client.h (1 hunks)
  • mooncake-store/src/dummy_client.cpp (1 hunks)
  • scripts/test_tensor_api.py (13 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/source/python-api-reference/mooncake-store.md

1023-1023: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1024-1024: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1025-1025: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1029-1029: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1041-1041: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1042-1042: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1043-1043: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1047-1047: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1059-1059: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1060-1060: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1061-1061: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1062-1062: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1063-1063: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1064-1064: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1068-1068: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1080-1080: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1081-1081: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1082-1082: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1083-1083: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1084-1084: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1088-1088: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

🪛 Ruff (0.14.8)
scripts/test_tensor_api.py

45-45: Unused function argument: rtol

(ARG001)


45-45: Unused function argument: atol

(ARG001)


57-57: Avoid specifying long messages outside the exception class

(TRY003)


62-62: Do not catch blind exception: Exception

(BLE001)


444-444: f-string without any placeholders

Remove extraneous f prefix

(F541)


448-448: Loop control variable large_buffer not used within loop body

(B007)


448-448: Loop control variable size not used within loop body

Rename unused size to _size

(B007)


527-527: Loop control variable large_buffer not used within loop body

(B007)


527-527: Loop control variable size not used within loop body

Rename unused size to _size

(B007)


639-639: Loop control variable i not used within loop body

(B007)


709-709: Loop control variable i not used within loop body

(B007)


950-950: Do not catch blind exception: Exception

(BLE001)


959-959: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🔇 Additional comments (10)
mooncake-store/include/dummy_client.h (1)

231-231: LGTM — formatting-only change.

EOF newline adjustment; no functional impact.

mooncake-store/src/dummy_client.cpp (1)

758-758: LGTM — formatting-only change.

EOF newline adjustment; no functional impact.

mooncake-integration/integration_utils.h (1)

65-70: LGTM — non-owning array view implementation.

Using py::none() as base correctly creates a view without ownership. The caller is responsible for ensuring the buffer outlives the returned array, which aligns with the zero-copy contract.

scripts/test_tensor_api.py (7)

301-313: LGTM — proper buffer lifecycle management in TP consistency test.

Buffers are registered before use and unregistered after, following the correct zero-copy workflow.


315-343: LGTM — comprehensive zero-copy get_into test.

Proper buffer registration, operation, verification, and cleanup.


345-385: LGTM — efficient batch zero-copy test with single buffer allocation.

Good pattern of allocating one large contiguous buffer and using offset-based addressing for individual tensors.


387-451: LGTM — good multi-rank buffer management pattern.

Correctly keeps buffer objects alive until unregistration. Minor: Line 444 has an f-string without placeholders (can remove the f prefix).


611-751: LGTM — well-structured zero-copy benchmarks.

Proper buffer lifecycle management and result verification integrated into performance tests.


929-958: LGTM — extended dtype tests to cover zero-copy path.

Good addition to verify that zero-copy operations preserve dtype correctly.


24-25: Large default buffer sizes (16 GiB / 8 GB).

These are substantial memory requirements. Ensure the test environment has sufficient resources, or consider adding a smaller default with an option to override for high-memory environments.

Comment on lines +1013 to +1030
#### get_tensor_into()

Get a PyTorch tensor from the store directly into a pre-allocated buffer.

```python
def get_tensor_with_tp(self, key: str, buffer_ptr: int, size: int) -> torch.Tensor
```

**Parameters:**

- `key` (str): Base identifier of the tensor.
- `buffer_ptr` (int): The buffer pointer pre-allocated for tensor, and the buffer should be registered.
- `size` (int): The size of buffer.

**Returns:**

- `torch.Tensor`: The retrieved tensor (or shard). Returns `None` if not found.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Documentation has incorrect function signatures.

The section title is get_tensor_into() but the code block shows get_tensor_with_tp. Similarly, the next section batch_get_tensor() shows batch_get_tensor_with_tp in its signature. These appear to be copy-paste errors.

🔎 Apply this diff to fix the signatures:
 #### get_tensor_into()
 
 Get a PyTorch tensor from the store directly into a pre-allocated buffer.
 
 ```python
-def get_tensor_with_tp(self, key: str, buffer_ptr: int, size: int) -> torch.Tensor
+def get_tensor_into(self, key: str, buffer_ptr: int, size: int) -> torch.Tensor

```diff
-#### batch_get_tensor()
+#### batch_get_tensor_into()
 
 Get a batch of PyTorch tensor from the store directly into a pre-allocated buffer.
 
 ```python
-def batch_get_tensor_with_tp(self, base_keys: List[str], buffer_ptrs: List[int], sizes: List[int]) -> List[torch.Tensor]
+def batch_get_tensor_into(self, keys: List[str], buffer_ptrs: List[int], sizes: List[int]) -> List[torch.Tensor]
</details>


> Committable suggestion skipped: line range outside the PR's diff.

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

1023-1023: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

---

1024-1024: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

---

1025-1025: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

---

1029-1029: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

In docs/source/python-api-reference/mooncake-store.md around lines 1013 to 1030,
the function signatures are incorrect: the section titled get_tensor_into()
shows def get_tensor_with_tp(...) and batch_get_tensor() shows def
batch_get_tensor_with_tp(...). Update the signatures to match the section titles
and intended parameter names: rename get_tensor_with_tp to get_tensor_into(self,
key: str, buffer_ptr: int, size: int) -> torch.Tensor, and rename
batch_get_tensor_with_tp to batch_get_tensor_into with the correct parameter
names and types (e.g. keys: List[str], buffer_ptrs: List[int], sizes: List[int])
returning List[torch.Tensor]; ensure surrounding code blocks and descriptions
remain consistent.


</details>

<!-- fingerprinting:phantom:medusa:ocelot -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment on lines +1049 to +1069
#### get_tensor_into_with_tp()

Get a PyTorch tensor from the store, specifically retrieving the shard corresponding to the given Tensor Parallel rank, directly into the pre-allocated buffer.

```python
def get_tensor_with_tp(self, key: str, buffer_ptr: int, size: int, tp_rank: int = 0, tp_size: int = 1, split_dim: int = 0) -> torch.Tensor
```

**Parameters:**

- `key` (str): Base identifier of the tensor.
- `buffer_ptr` (int): The buffer pointer pre-allocated for tensor, and the buffer should be registered.
- `size` (int): The size of buffer.
- `tp_rank` (int): The tensor parallel rank to retrieve (default: 0). Fetches key `key_tp_{rank}` if `tp_size > 1`.
- `tp_size` (int): Total tensor parallel size (default: 1).
- `split_dim` (int): The dimension used during splitting (default: 0).

**Returns:**

- `torch.Tensor`: The retrieved tensor (or shard). Returns `None` if not found.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Function signature mismatch for get_tensor_into_with_tp.

The code block shows get_tensor_with_tp but should be get_tensor_into_with_tp to match the section title.

🔎 Apply this diff:
 ```python
-def get_tensor_with_tp(self, key: str, buffer_ptr: int, size: int, tp_rank: int = 0, tp_size: int = 1, split_dim: int = 0) -> torch.Tensor
+def get_tensor_into_with_tp(self, key: str, buffer_ptr: int, size: int, tp_rank: int = 0, tp_size: int = 1, split_dim: int = 0) -> torch.Tensor
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion
#### get_tensor_into_with_tp()

Get a PyTorch tensor from the store, specifically retrieving the shard corresponding to the given Tensor Parallel rank, directly into the pre-allocated buffer.

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

1059-1059: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1060-1060: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1061-1061: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1062-1062: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1063-1063: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1064-1064: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1068-1068: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

🤖 Prompt for AI Agents
In docs/source/python-api-reference/mooncake-store.md around lines 1049 to 1069
the function signature in the code block is incorrectly named get_tensor_with_tp
but the section is about get_tensor_into_with_tp; update the signature in the
code block to use get_tensor_into_with_tp with the same parameters and return
type so the docs header and function name match.

Comment on lines +1070 to +1089
#### batch_get_tensor_with_tp()

Get a batch of PyTorch tensor shards from the store for a given Tensor Parallel rank, directly into the pre-allocated buffer.

```python
def batch_get_tensor_with_tp(self, base_keys: List[str], buffer_ptrs: List[int], sizes: List[int], tp_rank: int = 0, tp_size: int = 1) -> List[torch.Tensor]
```

**Parameters:**

- `base_keys` (List[str]): List of base identifiers.
- `buffer_ptrs` (List[int]): List of the buffers pointer pre-allocated for tensor, and the buffers should be registered.
- `sizes` (List[int]): List of the size of buffers.
- `tp_rank` (int): The tensor parallel rank to retrieve (default: 0).
- `tp_size` (int): Total tensor parallel size (default: 1).

**Returns:**

- `List[torch.Tensor]`: List of retrieved tensors (or shards). Contains `None` for missing keys.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Section title should be batch_get_tensor_into_with_tp for the zero-copy API.

This section is under "PyTorch Tensor Operations (Zero Copy)" but uses batch_get_tensor_with_tp which is the non-zero-copy variant. Based on the test file patterns, the zero-copy variant should be batch_get_tensor_into_with_tp.

🔎 Apply this diff:
-#### batch_get_tensor_with_tp()
+#### batch_get_tensor_into_with_tp()
 
 Get a batch of PyTorch tensor shards from the store for a given Tensor Parallel rank, directly into the pre-allocated buffer.
 
 ```python
-def batch_get_tensor_with_tp(self, base_keys: List[str], buffer_ptrs: List[int], sizes: List[int], tp_rank: int = 0, tp_size: int = 1) -> List[torch.Tensor]
+def batch_get_tensor_into_with_tp(self, base_keys: List[str], buffer_ptrs: List[int], sizes: List[int], tp_rank: int = 0, tp_size: int = 1) -> List[torch.Tensor]
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion
#### batch_get_tensor_into_with_tp()

Get a batch of PyTorch tensor shards from the store for a given Tensor Parallel rank, directly into the pre-allocated buffer.

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

1080-1080: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1081-1081: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1082-1082: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1083-1083: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1084-1084: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


1088-1088: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

🤖 Prompt for AI Agents
In docs/source/python-api-reference/mooncake-store.md around lines 1070 to 1089,
the function name documented for the zero-copy PyTorch API is incorrect (uses
batch_get_tensor_with_tp). Update the section to use the zero-copy API name
batch_get_tensor_into_with_tp: change the function signature and any occurrences
of batch_get_tensor_with_tp in this block to batch_get_tensor_into_with_tp so it
matches the zero-copy tests and section heading, keeping parameters and return
description the same.

Comment on lines +72 to +92
static const std::array<ArrayCreatorFunc, 16> array_creators_view = {{
create_typed_array_view<float>, // FLOAT32 = 0
create_typed_array_view<double>, // FLOAT64 = 1
create_typed_array_view<int8_t>, // INT8 = 2
create_typed_array_view<uint8_t>, // UINT8 = 3
create_typed_array_view<int16_t>, // INT16 = 4
create_typed_array_view<uint16_t>, // UINT16 = 5
create_typed_array_view<int32_t>, // INT32 = 6
create_typed_array_view<uint32_t>, // UINT32 = 7
create_typed_array_view<int64_t>, // INT64 = 8
create_typed_array_view<uint64_t>, // UINT64 = 9
create_typed_array_view<bool>, // BOOL = 10
create_typed_array_view<uint16_t>, // FLOAT16 = 11 (using uint16_t as
// storage)
create_typed_array_view<uint16_t>, // BFLOAT16 = 12 (using uint16_t as
// storage)
create_typed_array_view<uint8_t>, // FLOAT8_E4M3 = 13 (using uint8_t as
// storage)
create_typed_array_view<uint8_t>, // FLOAT8_E5M2 = 14 (using uint8_t as
// storage)
}};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Size mismatch: array_creators_view has 16 entries but array_creators has 15.

Both arrays should map the same TensorDtype enum values (0–14), yet array_creators_view is declared with size 16 while array_creators uses size 15. This inconsistency could cause out-of-bounds access or confusion when indexing by dtype.

🔎 Apply this diff to fix the size:
-static const std::array<ArrayCreatorFunc, 16> array_creators_view = {{
+static const std::array<ArrayCreatorFunc, 15> array_creators_view = {{
     create_typed_array_view<float>,     // FLOAT32 = 0
     create_typed_array_view<double>,    // FLOAT64 = 1
     create_typed_array_view<int8_t>,    // INT8 = 2
     create_typed_array_view<uint8_t>,   // UINT8 = 3
     create_typed_array_view<int16_t>,   // INT16 = 4
     create_typed_array_view<uint16_t>,  // UINT16 = 5
     create_typed_array_view<int32_t>,   // INT32 = 6
     create_typed_array_view<uint32_t>,  // UINT32 = 7
     create_typed_array_view<int64_t>,   // INT64 = 8
     create_typed_array_view<uint64_t>,  // UINT64 = 9
     create_typed_array_view<bool>,      // BOOL = 10
     create_typed_array_view<uint16_t>,  // FLOAT16 = 11 (using uint16_t as
                                         // storage)
     create_typed_array_view<uint16_t>,  // BFLOAT16 = 12 (using uint16_t as
                                         // storage)
     create_typed_array_view<uint8_t>,   // FLOAT8_E4M3 = 13 (using uint8_t as
                                         // storage)
     create_typed_array_view<uint8_t>,   // FLOAT8_E5M2 = 14 (using uint8_t as
                                         // storage)
 }};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static const std::array<ArrayCreatorFunc, 16> array_creators_view = {{
create_typed_array_view<float>, // FLOAT32 = 0
create_typed_array_view<double>, // FLOAT64 = 1
create_typed_array_view<int8_t>, // INT8 = 2
create_typed_array_view<uint8_t>, // UINT8 = 3
create_typed_array_view<int16_t>, // INT16 = 4
create_typed_array_view<uint16_t>, // UINT16 = 5
create_typed_array_view<int32_t>, // INT32 = 6
create_typed_array_view<uint32_t>, // UINT32 = 7
create_typed_array_view<int64_t>, // INT64 = 8
create_typed_array_view<uint64_t>, // UINT64 = 9
create_typed_array_view<bool>, // BOOL = 10
create_typed_array_view<uint16_t>, // FLOAT16 = 11 (using uint16_t as
// storage)
create_typed_array_view<uint16_t>, // BFLOAT16 = 12 (using uint16_t as
// storage)
create_typed_array_view<uint8_t>, // FLOAT8_E4M3 = 13 (using uint8_t as
// storage)
create_typed_array_view<uint8_t>, // FLOAT8_E5M2 = 14 (using uint8_t as
// storage)
}};
static const std::array<ArrayCreatorFunc, 15> array_creators_view = {{
create_typed_array_view<float>, // FLOAT32 = 0
create_typed_array_view<double>, // FLOAT64 = 1
create_typed_array_view<int8_t>, // INT8 = 2
create_typed_array_view<uint8_t>, // UINT8 = 3
create_typed_array_view<int16_t>, // INT16 = 4
create_typed_array_view<uint16_t>, // UINT16 = 5
create_typed_array_view<int32_t>, // INT32 = 6
create_typed_array_view<uint32_t>, // UINT32 = 7
create_typed_array_view<int64_t>, // INT64 = 8
create_typed_array_view<uint64_t>, // UINT64 = 9
create_typed_array_view<bool>, // BOOL = 10
create_typed_array_view<uint16_t>, // FLOAT16 = 11 (using uint16_t as
// storage)
create_typed_array_view<uint16_t>, // BFLOAT16 = 12 (using uint16_t as
// storage)
create_typed_array_view<uint8_t>, // FLOAT8_E4M3 = 13 (using uint8_t as
// storage)
create_typed_array_view<uint8_t>, // FLOAT8_E5M2 = 14 (using uint8_t as
// storage)
}};
🤖 Prompt for AI Agents
In mooncake-integration/integration_utils.h around lines 72 to 92,
array_creators_view is declared with size 16 while the corresponding
array_creators has size 15; make the sizes consistent by changing the
array_creators_view declaration to std::array<ArrayCreatorFunc, 15> (or add the
missing 16th dtype entry if intended), and ensure the initializer list length
matches the array size so dtype indexing cannot go out-of-bounds.

Comment on lines +339 to +348
auto total_length = store_->get_into(key, buffer, size);
if (total_length <= 0) {
py::gil_scoped_acquire acquire_gil;
return pybind11::none();
}

TensorMetadata metadata;
// Copy data from buffer to contiguous memory
memcpy(&metadata, static_cast<char *>(buffer),
sizeof(TensorMetadata));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing buffer size validation before metadata read.

The code reads sizeof(TensorMetadata) bytes from the buffer at line 347, but there's no validation that the provided size parameter is at least sizeof(TensorMetadata). If the user provides a smaller buffer, this could read beyond the allocated memory.

Additionally, total_length should be checked against sizeof(TensorMetadata) (similar to batch_get_tensor_into at line 470) before the memcpy.

🔎 Suggested fix:
             auto total_length = store_->get_into(key, buffer, size);
             if (total_length <= 0) {
                 py::gil_scoped_acquire acquire_gil;
                 return pybind11::none();
             }
+            if (total_length <= static_cast<long>(sizeof(TensorMetadata))) {
+                py::gil_scoped_acquire acquire_gil;
+                LOG(ERROR) << "Invalid data format: insufficient data for metadata";
+                return pybind11::none();
+            }
 
             TensorMetadata metadata;
             // Copy data from buffer to contiguous memory
             memcpy(&metadata, static_cast<char *>(buffer),
                    sizeof(TensorMetadata));
🤖 Prompt for AI Agents
In mooncake-integration/store/store_py.cpp around lines 339 to 348, the code
copies sizeof(TensorMetadata) bytes from the buffer without validating that the
provided size and the returned total_length are large enough; add checks to
ensure both size >= sizeof(TensorMetadata) and total_length >=
sizeof(TensorMetadata) before performing the memcpy (mirroring the pattern used
in batch_get_tensor_into at line 470), and if the checks fail acquire the GIL
and return pybind11::none() to avoid reading past the buffer.

Comment on lines +414 to +431
pybind11::list batch_get_tensor_into(
const std::vector<std::string> &keys,
const std::vector<uintptr_t> &buffer_ptrs,
const std::vector<size_t> &sizes) {
std::vector<void *> buffers;
buffers.reserve(buffer_ptrs.size());
for (uintptr_t ptr : buffer_ptrs) {
buffers.push_back(reinterpret_cast<void *>(ptr));
}

if (!is_client_initialized()) {
LOG(ERROR) << "Client is not initialized";
py::list empty_list;
for (size_t i = 0; i < keys.size(); ++i) {
empty_list.append(to_py_ret(ErrorCode::INVALID_PARAMS));
}
return empty_list;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing input size validation.

The method doesn't validate that keys.size(), buffer_ptrs.size(), and sizes.size() are equal. If they differ, iterating through total_lengths (which corresponds to keys) while accessing buffers[i] could cause out-of-bounds access if buffer_ptrs is shorter than keys.

🔎 Suggested fix:
     pybind11::list batch_get_tensor_into(
         const std::vector<std::string> &keys,
         const std::vector<uintptr_t> &buffer_ptrs,
         const std::vector<size_t> &sizes) {
         std::vector<void *> buffers;
         buffers.reserve(buffer_ptrs.size());
         for (uintptr_t ptr : buffer_ptrs) {
             buffers.push_back(reinterpret_cast<void *>(ptr));
         }
 
+        if (keys.size() != buffer_ptrs.size() || keys.size() != sizes.size()) {
+            LOG(ERROR) << "Size mismatch: keys, buffer_ptrs, and sizes must have the same length";
+            py::list empty_list;
+            for (size_t i = 0; i < keys.size(); ++i) {
+                empty_list.append(to_py_ret(ErrorCode::INVALID_PARAMS));
+            }
+            return empty_list;
+        }
+
         if (!is_client_initialized()) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pybind11::list batch_get_tensor_into(
const std::vector<std::string> &keys,
const std::vector<uintptr_t> &buffer_ptrs,
const std::vector<size_t> &sizes) {
std::vector<void *> buffers;
buffers.reserve(buffer_ptrs.size());
for (uintptr_t ptr : buffer_ptrs) {
buffers.push_back(reinterpret_cast<void *>(ptr));
}
if (!is_client_initialized()) {
LOG(ERROR) << "Client is not initialized";
py::list empty_list;
for (size_t i = 0; i < keys.size(); ++i) {
empty_list.append(to_py_ret(ErrorCode::INVALID_PARAMS));
}
return empty_list;
}
pybind11::list batch_get_tensor_into(
const std::vector<std::string> &keys,
const std::vector<uintptr_t> &buffer_ptrs,
const std::vector<size_t> &sizes) {
std::vector<void *> buffers;
buffers.reserve(buffer_ptrs.size());
for (uintptr_t ptr : buffer_ptrs) {
buffers.push_back(reinterpret_cast<void *>(ptr));
}
if (keys.size() != buffer_ptrs.size() || keys.size() != sizes.size()) {
LOG(ERROR) << "Size mismatch: keys, buffer_ptrs, and sizes must have the same length";
py::list empty_list;
for (size_t i = 0; i < keys.size(); ++i) {
empty_list.append(to_py_ret(ErrorCode::INVALID_PARAMS));
}
return empty_list;
}
if (!is_client_initialized()) {
LOG(ERROR) << "Client is not initialized";
py::list empty_list;
for (size_t i = 0; i < keys.size(); ++i) {
empty_list.append(to_py_ret(ErrorCode::INVALID_PARAMS));
}
return empty_list;
}
🤖 Prompt for AI Agents
In mooncake-integration/store/store_py.cpp around lines 414 to 431, the function
lacks validation that keys.size(), buffer_ptrs.size(), and sizes.size() are
equal which can lead to out-of-bounds access; move or add a precondition check
before converting buffer_ptrs to void* and if the sizes differ, log an error and
return a py::list sized to keys.size() populated with
to_py_ret(ErrorCode::INVALID_PARAMS) for each entry (or otherwise consistently
handle the mismatch), ensuring no buffer access occurs when counts are
inconsistent.

Comment on lines +542 to +546
} catch (const pybind11::error_already_set &e) {
LOG(ERROR) << "Failed during batch tensor deserialization: "
<< e.what();
}
return results_list;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Incomplete results on exception.

If an exception occurs mid-iteration, the method returns a partial results_list without populating entries for unprocessed keys. Callers may not be able to correlate results with input keys.

Consider appending error codes for remaining entries before returning, or document this behavior clearly in the API.

🤖 Prompt for AI Agents
In mooncake-integration/store/store_py.cpp around lines 542 to 546, the catch
block logs the pybind11 exception and returns a partially populated results_list
which breaks correlation with input keys; modify the catch handling to append an
error result for each remaining unprocessed input (e.g., push back consistent
error-code/result objects for each missing key or index) before returning so the
returned list length and ordering match the input keys, and ensure the error
entries contain enough context (error code/message) for callers to detect
failures.

Comment on lines +579 to +596
if (!is_client_initialized()) {
LOG(ERROR) << "Client is not initialized";
py::list empty_list;
for (size_t i = 0; i < base_keys.size(); ++i) {
empty_list.append(py::none());
}
return empty_list;
}

if (use_dummy_client_) {
LOG(ERROR) << "batch_get_tensor_into_with_tp is not supported for "
"dummy client";
py::list empty_list;
for (size_t i = 0; i < base_keys.size(); ++i) {
empty_list.append(py::none());
}
return empty_list;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent error return type.

When the client is not initialized or dummy client is used, this method returns a list of py::none() values, but batch_get_tensor_into returns a list of ErrorCode::INVALID_PARAMS integers. This inconsistency could confuse callers who need to handle errors uniformly.

Consider aligning the error handling with batch_get_tensor_into:

🔎 Suggested fix:
         if (!is_client_initialized()) {
             LOG(ERROR) << "Client is not initialized";
             py::list empty_list;
             for (size_t i = 0; i < base_keys.size(); ++i) {
-                empty_list.append(py::none());
+                empty_list.append(to_py_ret(ErrorCode::INVALID_PARAMS));
             }
             return empty_list;
         }
 
         if (use_dummy_client_) {
             LOG(ERROR) << "batch_get_tensor_into_with_tp is not supported for "
                           "dummy client";
             py::list empty_list;
             for (size_t i = 0; i < base_keys.size(); ++i) {
-                empty_list.append(py::none());
+                empty_list.append(to_py_ret(ErrorCode::INVALID_PARAMS));
             }
             return empty_list;
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (!is_client_initialized()) {
LOG(ERROR) << "Client is not initialized";
py::list empty_list;
for (size_t i = 0; i < base_keys.size(); ++i) {
empty_list.append(py::none());
}
return empty_list;
}
if (use_dummy_client_) {
LOG(ERROR) << "batch_get_tensor_into_with_tp is not supported for "
"dummy client";
py::list empty_list;
for (size_t i = 0; i < base_keys.size(); ++i) {
empty_list.append(py::none());
}
return empty_list;
}
if (!is_client_initialized()) {
LOG(ERROR) << "Client is not initialized";
py::list empty_list;
for (size_t i = 0; i < base_keys.size(); ++i) {
empty_list.append(to_py_ret(ErrorCode::INVALID_PARAMS));
}
return empty_list;
}
if (use_dummy_client_) {
LOG(ERROR) << "batch_get_tensor_into_with_tp is not supported for "
"dummy client";
py::list empty_list;
for (size_t i = 0; i < base_keys.size(); ++i) {
empty_list.append(to_py_ret(ErrorCode::INVALID_PARAMS));
}
return empty_list;
}
🤖 Prompt for AI Agents
In mooncake-integration/store/store_py.cpp around lines 579 to 596, the function
returns a py::list of py::none() when the client is uninitialized or a dummy
client is used, but batch_get_tensor_into returns a list of
ErrorCode::INVALID_PARAMS integers; change these early-return branches to build
and return a py::list of integers with ErrorCode::INVALID_PARAMS (cast/convert
to Python int) for each entry in base_keys so the error return type matches
batch_get_tensor_into and callers can handle errors uniformly.

Comment on lines +1233 to +1234
"Get a PyTorch tensor from the store directly into a pre-allocated"
"buffer, optionally sliced for Tensor Parallelism.\n"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor typo in docstrings.

Missing space in documentation strings:

  • Line 1233-1234: "pre-allocated" "buffer" → should be "pre-allocated buffer"
  • Line 1246-1247: "directly into" "pre-allocated" → should be "directly into pre-allocated"
🔎 Suggested fix:
         .def(
             "get_tensor_into_with_tp",
             &MooncakeStorePyWrapper::get_tensor_into_with_tp, py::arg("key"),
             py::arg("buffer_ptr"), py::arg("size"), py::arg("tp_rank") = 0,
             py::arg("tp_size") = 1, py::arg("split_dim") = 0,
-            "Get a PyTorch tensor from the store directly into a pre-allocated"
-            "buffer, optionally sliced for Tensor Parallelism.\n"
+            "Get a PyTorch tensor from the store directly into a pre-allocated "
+            "buffer, optionally sliced for Tensor Parallelism.\n"
             "Args:\n"
             "  key: The key of the tensor.\n"
             "  buffer_ptr: The buffer pointer pre-allocated for tensor.\n"
             "  size: The size of buffer.\n"
             "  tp_rank: The current tensor parallel rank (default 0).\n"
             "  tp_size: The total tensor parallel size (default 1).\n"
             "  split_dim: The dimension to split the tensor along (default 0).")
         .def("batch_get_tensor_into_with_tp",
              &MooncakeStorePyWrapper::batch_get_tensor_into_with_tp,
              py::arg("base_keys"), py::arg("buffer_ptrs"), py::arg("sizes"),
              py::arg("tp_rank") = 0, py::arg("tp_size") = 1,
-             "Get a batch of PyTorch tensor shards from the store directly into"
-             "pre-allocated buffers for a given Tensor Parallel rank.")
+             "Get a batch of PyTorch tensor shards from the store directly into "
+             "pre-allocated buffers for a given Tensor Parallel rank.")

Also applies to: 1246-1247

🤖 Prompt for AI Agents
In mooncake-integration/store/store_py.cpp around lines 1233-1234 and 1246-1247,
fix the docstring concatenation typo where two adjacent string literals lack a
separating space: replace occurrences like `"pre-allocated" "buffer"` and
`"directly into" "pre-allocated"` with a single combined literal containing the
space (e.g., `"pre-allocated buffer"` and `"directly into pre-allocated"`),
ensuring the docstrings read correctly.

Comment on lines +519 to +524
recon = torch.cat(reconstruction_parts, dim=split_dim)
self.assertTrue(torch.equal(recon, original), f"Tensor {i} final reconstruction mismatch")
self.assertTrue(
torch.equal(recon, original),
f"Tensor {i} final reconstruction mismatch"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate assertion — remove redundant check.

The same torch.equal(recon, original) assertion appears twice consecutively.

🔎 Apply this diff:
             recon = torch.cat(reconstruction_parts, dim=split_dim)
             self.assertTrue(torch.equal(recon, original), f"Tensor {i} final reconstruction mismatch")
-            self.assertTrue(
-                torch.equal(recon, original),
-                f"Tensor {i} final reconstruction mismatch"
-            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
recon = torch.cat(reconstruction_parts, dim=split_dim)
self.assertTrue(torch.equal(recon, original), f"Tensor {i} final reconstruction mismatch")
self.assertTrue(
torch.equal(recon, original),
f"Tensor {i} final reconstruction mismatch"
)
recon = torch.cat(reconstruction_parts, dim=split_dim)
self.assertTrue(torch.equal(recon, original), f"Tensor {i} final reconstruction mismatch")
🤖 Prompt for AI Agents
In scripts/test_tensor_api.py around lines 519 to 524, there is a duplicated
assertion calling torch.equal(recon, original) twice in a row; remove the
redundant second assertion so the reconstruction check appears only once (keep
one assertTrue with the existing message and delete the duplicate block).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants