refactor(attn): unify PyAttentionInputs host/device tensors with _host/_device suffixes by Jonathan-hwx · Pull Request #1128 · alibaba/rtp-llm

Jonathan-hwx · 2026-06-22T08:27:23Z

refactor(attn): unify PyAttentionInputs naming — device tensors get a _device suffix, host stays bare/_host; Python names aligned 1:1 with C++ fields (no alias layer)

CLAassistant · 2026-06-22T08:28:00Z

All committers have signed the CLA.

LLLLKKKK · 2026-06-22T08:42:06Z

AI Code Review - PR #1128

Status: LGTM

Summary: P0/0 · P1/0 · P2/2 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

by_group 属性赋短列表时会残留旧组 @ rtp_llm/models_py/bindings/OpDefs.cc:137
- 建议：setter 按传入长度精确 resize(v.size()) 或 clear 后重建；host/device 两侧都要清理尾部旧组，并补充缩短/清空列表回归测试。
HostDeviceTensor 绑定重构缺少回归测试 @ rtp_llm/models_py/bindings/OpDefs.cc:102
- 建议：补充 pybind alias 读写、by_group 多组到少组/空列表复用测试，并覆盖一次 CUDA Graph replay 参数复制的回归场景。

Checklist Violations (5 fail / 25 total)

General Principles Checklist

[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue by_group 属性赋短列表时会残留旧组
by_group pybind setter 赋短列表不会清理尾部旧元素，复用对象时破坏“当前列表等于最后一次赋值”的状态不变量。
[6.1] Architecture — 兼容性：公开 API/持久数据/配置/环境迁移安全 → issue by_group 属性赋短列表时会残留旧组
旧 pybind vector 字段从 def_readwrite 变为自定义 setter 后，短列表赋值语义不再等价，会残留旧 group。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue HostDeviceTensor 绑定重构缺少回归测试
diff_paths 没有测试文件；pybind alias、HostDeviceTensor 字段访问和 CUDA Graph copy 均缺少随改动提交的回归覆盖。
[6.1] Tests — 边界 case 覆盖（空、单元素、最大值） → issue HostDeviceTensor 绑定重构缺少回归测试
缺少 by_group 从非空到空、从多组到单组的复用边界覆盖，正是旧组残留问题的触发条件。
[6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue HostDeviceTensor 绑定重构缺少回归测试
CUDA 和 ROCm attention consumer 都被修改，但 diff 中没有新增对应跨平台回归测试。

Strengths

HostDeviceTensor 将 host/device 成对字段收敛到同一结构，降低了裸字段命名不一致和调用点错配的维护成本。
legacy flat-name pybind alias 仍保留，降低了 Python 调用侧一次性迁移风险。

LLLLKKKK · 2026-06-22T10:15:00Z

AI Code Review - PR #1128

Status: LGTM

Summary: P0/0 · P1/0 · P2/2 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

by_group 兼容 setter 破坏旧 vector 赋值语义 @ rtp_llm/models_py/bindings/OpDefs.cc:137
- 建议：setter 按 v.size() 重建或 resize 到精确长度；getter 只返回 defined 的对应侧 tensor，保持旧 host/device by_group 独立赋值语义。
缺少 HostDeviceTensor 兼容层回归测试 @ rtp_llm/models_py/bindings/OpDefs.cc:102
- 建议：补充 PyAttentionInputs 旧 flat 属性读写、by_group 缩容/单侧赋值，以及至少一个 CUDA graph 或 ROCm binding 路径的回归测试。

Checklist Violations (3 fail / 79 total)

General Principles Checklist

[6.1] Architecture — 兼容性：公开 API/持久数据/配置/环境迁移安全 → issue by_group 兼容 setter 破坏旧 vector 赋值语义
两个 by_group 旧属性 setter 未保持旧 vector 整体替换语义，单侧赋值会暴露未定义另一侧，缩容会保留旧组尾部。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 缺少 HostDeviceTensor 兼容层回归测试
HostDeviceTensor/pybind alias/CUDA graph 迁移缺少对应新增测试，无法覆盖旧属性读写、by_group 缩容和单侧赋值边界。

RTP-LLM Checklist

[H] 测试与 CI — 测试覆盖充分：大重构等价覆盖，新功能端到端测试 → issue 缺少 HostDeviceTensor 兼容层回归测试
该重构触达 attention binding、CUDA graph、CUDA/ROCm 消费方，但没有看到等价覆盖或端到端回归测试随 diff 更新。

Strengths

HostDeviceTensor 统一了 host/device 字段命名，并在 pybind 层保留旧 flat 属性，CUDA graph 与 CUDA/ROCm 消费方也已同步迁移。

LLLLKKKK · 2026-06-23T10:10:45Z

AI Code Review - PR #1128

Status: LGTM

Summary: P0/0 · P1/0 · P2/1 · P3/2

lgtm ready to ci

Non-blocking Suggestions

P2

def_readonly → def_readwrite 移除了 device tensor 写保护 @ rtp_llm/models_py/bindings/OpDefs.cc:117
- 建议：除非确有 Python 侧写入需求，建议保持 def_readonly 以维持原有安全语义；若确需写入，需在 CaptureMemoryHold 层面添加防护（如 replay 前验证 data_ptr 未变）。

P3

.pyi 类型存根与 def_readwrite 不一致 @ rtp_llm/ops/librtp_compute_ops/__init__.pyi:214
- 建议：将 @Property 声明改为普通属性声明 input_lengths_d: torch.Tensor 和 prefix_lengths_d: torch.Tensor，与实际 pybind readwrite 语义一致。
调试日志标签仍使用旧 _d 后缀 @ rtp_llm/cpp/cuda_graph/cuda_graph_utils.cc:108
- 建议：将日志标签更新为 "sequence_lengths_plus_1_device" 和 "decode_cu_seqlens_device"，与字段名保持一致。

Checklist Violations (1 fail / 72 total)

Python Static-First Checklist

[P.A] 静态结构与类型纪律 — 禁止 getattr/setattr literal 访问 → checklist-only
test_py_attention_inputs.py 使用 setattr/getattr 字面量 key（如 setattr(a, self.HOST, ...)），但这是测试代码，目的是测试 pybind binding 的动态属性行为，属于合理例外场景

Strengths

命名约定清晰统一：_device 后缀消除了旧代码中 _d / 裸名指代 device tensor 的歧义
pybind 层保留了所有 Python-facing 名称，Python 侧完全向后兼容
新增回归测试覆盖了所有 host/device 别名映射，防止未来 pybind 误配

wht21 · 2026-06-23T10:14:14Z

internal source has been updated, please review the changes!

LLLLKKKK · 2026-06-24T14:45:00Z

AI Code Review - PR #1128

Status: LGTM

Summary: P0/0 · P1/0 · P2/4 · P3/3

lgtm ready to ci

Non-blocking Suggestions

P2

cu_kv_seqlens_device.max().item() 触发 GPU→CPU 同步（已有问题，非本 PR 引入） @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:47
- 建议：这是已有的 GPU→CPU 同步点，非本 PR 引入。但既然此次 rename 明确标注了 _device 后缀，建议后续优化：在 C++ 层预计算 cu_kv_seqlens 的 max 值并作为 host 标量传入，避免每次 get_mla_impl 调用时的隐式同步。
len(cu_seqlens_device) 在 CUDA graph 路径上调用 Python len() 获取 device tensor 长度 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/py_flashinfer_mha.py:144
- 建议：非本 PR 引入。len() 在 PyTorch tensor 上是 O(1) 且不触发 D2H，性能影响极小。但语义上 batch_size 应来自 input_lengths.size(0)，更清晰。可改为 attn_inputs.input_lengths.size(0)。
测试中 CPU tensor 赋值给 _device 命名字段 @ rtp_llm/models_py/kernels/cuda/test/test_xqa_batch_decode.py:391
- 建议：这是 PR 前就存在的行为（旧名 cu_seqlens 也赋了 CPU tensor），但重命名后命名不一致更突出。XQA decode 路径实际消费的是 decode_cu_seqlens_device（GPU），cu_seqlens 在 decode 中仅作辅助；若确实不需要 device 副本，考虑在测试中改用 cu_seqlens_host 或加注释说明为何用 CPU tensor。
CPU pinned-memory tensor 赋值给 cu_seqlens_device 属性，语义不一致 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/atten_test_util.py:152
- 建议：重命名后 _device 后缀暗示 tensor 应在 GPU 上，但此处为 CPU pinned memory。这是已有行为（重命名前就如此），但既然 PyAttentionInputs 已拆分出 cu_seqlens_host 属性，建议后续整理时将此处改为 .to(device) 或使用 cu_seqlens_host。当前不影响正确性（C++ 层会做 H2D copy），仅做标注。

P3

文档字符串中仍引用已废弃属性名 @ rtp_llm/models_py/modules/base/cuda/indexer_op.py:366
- 建议：将 docstring 中的 decode_cu_seqlens_d 更新为 decode_cu_seqlens_device，与实际代码保持一致。
测试注释中残留旧字段名 @ rtp_llm/cpp/cuda_graph/tests/cuda_graph_decode_padding.py:108
- 建议：注释中的 sequence_lengths_plus_1_d 和 decode_cu_seqlens_d 应更新为 *_device 以匹配新命名。indexer_op.py:366 也有类似残留。
indexer_ref 中对已在 GPU 的 tensor 冗余调用 .to('cuda') @ rtp_llm/models_py/modules/hybrid/test/indexer_ref.py:510
- 建议：cu_seqlens_device 按命名约定已在 GPU 上，.to('cuda') 是空操作但增加阅读负担。可移除冗余的 .to('cuda') 调用（此问题在 rename 前就存在）。

Checklist ✅ (56 items passed)

Strengths

属性重命名（_d → _device）全面且一致，覆盖了 shard 内所有 10 个文件的全部引用，无遗漏
rocm_impl/aiter.py 中 _infer_cuda_graph_device 正确使用 getattr 安全访问可选属性，避免了 AttributeError
重命名后命名更加清晰（_device 后缀明确表达 GPU 端 tensor 语义），提升了代码可读性
属性命名从歧义的 _d / 无后缀改为明确的 _device，消除了 host/device 张量放置位置的歧义
所有使用点都正确选择了 _device 变体（全部用于 GPU 操作：FlashInfer、TRT、XQA、CUDA kernel），无一处误用
rocm_impl/aiter.py 使用 getattr 安全访问，保持了向后兼容的防御性编程风格
将不一致的后缀（_d、无后缀）统一为 _device，消除了调用方对 tensor 所在设备的歧义，减少了误用 host tensor 触发隐式 D2H 拷贝的风险
所有改动是纯机械的属性名替换，不改变数据流或分配模式，不会引入性能回退
命名规范化做得很彻底：所有 device tensor 统一加 _device 后缀，host tensor 保持原名或加 _host，消除了之前 _d 后缀与无后缀混用的歧义
OpDefs.h 新增的命名约定注释（line 160-161）为后续开发者提供了清晰的规则参考

wht21 · 2026-06-24T14:48:27Z

internal source has been updated, please review the changes!

LLLLKKKK · 2026-06-25T02:35:33Z

AI Code Review - PR #1128

Status: BLOCKING

Summary: P0/0 · P1/1 · P2/6 · P3/2

Blocking Issues

P1

aiter.py 中 getattr 使用旧属性名，CUDA graph prefill 路径永远走 fallback @ rtp_llm/models_py/modules/factory/attention/rocm_impl/aiter.py:1541
- 建议：将 line 1541 的 "cu_seqlens" 改为 "cu_seqlens_device"，line 1542 的 "cu_kv_seqlens" 改为 "cu_kv_seqlens_device"。否则 ROCm CUDA graph prefill replay 永远无法使用 live cu_seqlens，总走 else 重建路径，可能在 prefix caching 场景下产生不一致结果。

Non-blocking Suggestions

P2

atten_test_util 中 _device 字段实际存放 CPU pinned-memory 张量 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/atten_test_util.py:152
- 建议：字段名含 _device 暗示张量在 GPU 上。虽然 pin_memory 行为未变，但如果下游 kernel 期望 device tensor，每次访问都会隐式 H2D 拷贝。建议在 cumsum 后显式 .to(device) 再赋值，或在字段名上保持一致（如 _host_pinned）。此问题为 pre-existing，非本 PR 引入，优先级不高。
cu_seqlens_device 与 cu_kv_seqlens_device 共享同一 tensor 对象 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/atten_test_util.py:153
- 建议：两个字段指向同一 tensor，下游如果 in-place 修改其中一个会意外影响另一个。在测试中风险低但语义不清晰。考虑对其中一个 .clone()。此为 pre-existing 问题，非本 PR 引入。
GPU→CPU 同步在 MLA impl 选择路径 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:47
- 建议：这是已有问题非本 PR 引入。.max().item() 对 device tensor 触发隐式 CUDA sync。因为此代码在工厂初始化路径（非 per-token hot path）执行，影响有限，但重命名后 _device 后缀让这个 sync 更显眼。如果未来此路径被 CUDA graph 覆盖则需消除。
cuda_graph_test_utils.py 中 cu_seqlens/cu_kv_seqlens 的 getattr 未同步重命名 @ rtp_llm/cpp/cuda_graph/tests/cuda_graph_test_utils.py:326
- 建议：将第326行改为 getattr(a, "cu_seqlens_device", None)，第327行改为 getattr(a, "cu_kv_seqlens_device", None)。同文件第338-343行的 _d→_device 已正确更新，这两处是遗漏。getattr 默认 None 不会崩溃，但 debug print 会静默丢失这两个 tensor 的信息。
cu_kv_seqlens_device 缺少 host mirror，capture 时触发同步 D2H 拷贝 @ rtp_llm/cpp/cuda_graph/cuda_graph_prefill.cc:49
- 建议：与 cu_seqlens 一样为 cu_kv_seqlens 添加 _host mirror 字段，在 initKernelInternalMemory 中初始化（类似 cu_seqlens_host 的模式），避免 capture 路径上的 .cpu() 同步拷贝。此问题为重命名前已有问题，非本 PR 回归，但重命名使其更明显。注意 OpDefs.h:187 的注释 'no host mirror needed' 与此处实际用法矛盾。
test_xqa_batch_decode.py 中 cu_seqlens_device 赋值 CPU tensor，与命名矛盾 @ rtp_llm/models_py/kernels/cuda/test/test_xqa_batch_decode.py:391
- 建议：此问题在重命名前已存在（旧名 cu_seqlens 无 device 暗示），但现在字段名明确为 _device 却放 CPU tensor 更容易误导。建议去掉 .cpu() 调用，保持 tensor 在 GPU 上与字段名一致；或确认消费方（XQA prepare_cuda_graph）确实需要 CPU tensor，如果是则考虑改用 cu_seqlens_host。同样的模式在此文件中出现约 12 次。

P3

indexer_ref 中对 _device 张量冗余调用 .to("cuda") @ rtp_llm/models_py/modules/hybrid/test/indexer_ref.py:510
- 建议：字段已命名 _device 表明在 GPU 上，.to("cuda") 虽为 no-op 但显得冗余。可简化为仅 .to(torch.int32)，或加注释说明是防御性转换。
XQA 测试中 cu_seqlens_device 实际赋值为 CPU tensor @ rtp_llm/models_py/kernels/cuda/test/test_xqa_batch_decode.py:389
- 建议：虽然这是 rename 前就存在的问题（decode 路径实际使用 decode_cu_seqlens_device），但新命名后 _device 后缀与 .cpu() 矛盾更加明显。建议后续统一修正，或加注释说明 decode 模式下该字段为占位值。

Checklist ✅ (56 items passed)

Strengths

字段重命名 _d → _device 统一且一致，消除了缩写歧义，提升可读性
所有 17 个测试文件的重命名覆盖完整，grep 确认无遗漏的旧字段名
CUDA graph 路径的 pin_memory vs device 分支逻辑在重命名后保持正确
属性重命名完整一致：17 个测试文件中所有 cu_seqlens → cu_seqlens_device、cu_kv_seqlens → cu_kv_seqlens_device、prefix_lengths_d → prefix_lengths_device、input_lengths_d → input_lengths_device、decode_cu_seqlens_d → decode_cu_seqlens_device 均已同步更新，无遗漏
CUDA graph copy 路径中 pinned-memory 与 device tensor 的分支逻辑（test_py_flashinfer_paged_prefill_cuda_graph.py）在重命名后保持正确
_AttnInputsWrapper 巧妙地用 Python 侧 override dict 绕过 prefix_lengths_device / input_lengths_device 的 pybind def_readonly 限制，测试设计合理
属性重命名完整且一致：所有 17 个测试文件中的 cu_seqlens/cu_kv_seqlens/prefix_lengths_d/input_lengths_d/decode_cu_seqlens_d 均已正确更新为带 _device/_host 后缀的新名称
重命名后无遗漏：grep 验证 shard 范围内不存在任何旧属性名引用
属性命名统一化：_d 后缀全部改为 _device，消除了命名不一致（cu_seqlens vs cu_seqlens_d vs decode_cu_seqlens_d），提升可读性
重命名机械且完整，所有使用点均已更新，无遗漏

LLLLKKKK · 2026-06-25T03:34:42Z

AI Code Review - PR #1128

Status: LGTM

Summary: P0/0 · P1/0 · P2/4 · P3/4

lgtm ready to ci

Non-blocking Suggestions

P2

GPU→CPU 同步在 MLA 快速路径判断中（已有问题，非本 PR 引入） @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:47
- 建议：此问题非本 PR 引入，但值得后续优化：考虑在 PyAttentionInputs 构造时预计算 max_kv_seqlen 并缓存为 CPU 标量，避免每次 get_mla_impl 调用时的 GPU→CPU 同步。可作为 follow-up 处理。
CUDA graph replay 路径每次分配 torch.arange 临时 tensor（已有问题） @ rtp_llm/models_py/modules/factory/attention/cuda_impl/py_flashinfer_mha.py:182
- 建议：非本 PR 引入。可在 init 中预分配 offsets buffer，replay 时用 copy_ 更新，避免 CUDA memory allocator 开销。
atten_test_util 中 cu_seqlens_device 和 cu_kv_seqlens_device 共享同一 tensor 对象 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/atten_test_util.py:152
- 建议：使用 cu_seqlens.clone() 赋给 cu_kv_seqlens_device，避免隐式别名。当前测试碰巧没有 in-place 修改所以不会出错，但如果未来新增的测试调用 .fill_() 等操作会静默损坏数据。此问题为 pre-existing，非本 PR 引入。
测试代码中 cu_seqlens_device/cu_kv_seqlens_device 实际放在 CPU 上 @ rtp_llm/models_py/kernels/cuda/test/test_xqa_batch_decode.py:391
- 建议：本次重命名的目的就是通过后缀区分 host/device，但测试中 cu_seqlens_device 和 cu_kv_seqlens_device 实际存的是 CPU tensor（.cpu()），与命名约定矛盾。XQA decode 路径不消费这两个字段所以不会崩，但建议要么去掉 .cpu()（让它们真正在 device 上），要么如果确实不需要 device tensor 则改用不带 _device 后缀的字段。

P3

indexer_ref.py 中冗余的 .to(torch.int32) 类型转换 @ rtp_llm/models_py/modules/hybrid/test/indexer_ref.py:510
- 建议：简化为 attention_inputs.cu_seqlens_device.to("cuda")，去掉冗余的 dtype 转换。此为历史遗留问题，不阻塞本 PR。
cu_seqlens_device 名称在 CUDA graph copy 路径中实际存放 CPU pinned tensor @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/test_flashinfer_prefill/test_py_flashinfer_paged_prefill_cuda_graph.py:58
- 建议：可考虑添加注释说明 with_copy_params=True 时 _device 属性暂存 pinned-host tensor，由 CUDA graph copy 阶段负责拷贝到 device。此命名由上游属性决定，此处仅为可读性建议。
padding_offset 未遵循命名约定但实际在 device 上 @ rtp_llm/models_py/bindings/OpDefs.h:191
- 建议：考虑在 OpDefs.h 的命名约定注释中注明 padding_offset 是例外，或者将其重命名为 padding_offset_device 以保持一致性。当前不影响正确性。
.pyi 中 cu_seqlens_host 为新增暴露，应注明 @ rtp_llm/ops/librtp_compute_ops/__init__.pyi:182
- 建议：cu_seqlens_host 在 OpDefs.cc 中早已绑定但之前 .pyi 中遗漏，此次补上是正确的。可以在 PR 描述中提及，避免 reviewer 误以为是新增字段。

Checklist ✅ (56 items passed)

Strengths

属性命名从 _d 后缀统一为 _device 后缀，语义更清晰，降低了 host/device tensor 混淆的风险
重命名覆盖完整，在 modules/ 目录下未发现遗留的旧属性名，不会产生 AttributeError
rocm_impl/aiter.py 中使用 getattr(..., None) 安全降级模式，兼容属性可能不存在的场景
属性命名从不一致的 _d 后缀统一为清晰的 _device 后缀，提高了可读性和可维护性
重命名在所有 10 个文件中完整一致，无遗漏引用
rocm_impl/aiter.py 中的 _infer_cuda_graph_device 使用 getattr(..., None) 安全访问属性，对属性缺失有良好的容错处理
C++ 定义、pybind11 绑定和 Python 消费端三层同步更新，无断层
系统性地将 PyAttentionInputs 的属性名从不一致的 _d 后缀和裸名统一为 _device 后缀，提升代码可读性和命名一致性
属性重命名（cu_seqlens → cu_seqlens_device 等）全局一致，17 个测试文件无遗漏，命名更清晰地区分 host/device tensor
CUDA graph 测试保留了 with_copy_params 分支对 pinned memory vs device tensor 的正确区分

wht21 · 2026-06-25T03:37:44Z

internal source has been updated, please review the changes!

LLLLKKKK · 2026-06-26T03:26:05Z

AI Code Review - PR #1128

Status: LGTM

Summary: P0/0 · P1/0 · P2/5 · P3/7

lgtm ready to ci

Non-blocking Suggestions

P2

测试代码 cu_seqlens_device 被赋值为 CPU tensor @ rtp_llm/models_py/kernels/cuda/test/test_xqa_batch_decode.py:391
- 建议：XQA 测试中 cu_seqlens_device / cu_kv_seqlens_device 被赋值为 CPU tensor（.cpu()），与 _device 后缀的语义矛盾。虽然 XQA 内核不消费这些字段不会导致失败，但如果其他 attention impl 复用这些 test utils 会产生设备不匹配错误。建议保持 device 上或去掉 .cpu() 调用。
capturePrefill 中 cu_kv_seqlens_device.cpu() 每次 capture 创建新 CPU tensor @ rtp_llm/cpp/cuda_graph/cuda_graph_prefill.cc:49
- 建议：此 .cpu() 在 capture 循环中对每个 seq_len 都做一次 D2H 拷贝并分配新 tensor。capture 只在启动时执行不影响 inference 吞吐，但可以预分配一个 pinned host tensor 在循环外复用，避免多次 cudaMemcpy + malloc。优先级不高，仅优化启动时间。
pyi 桩文件中遗留了不存在的 kv_cache_block_id_host_by_group 字段 @ rtp_llm/ops/librtp_compute_ops/__init__.pyi:200
- 建议：删除该行。C++ 结构体和 pybind 注册中均不存在 kv_cache_block_id_host_by_group 字段，这是一个幽灵条目，会误导类型检查器。
GPU→CPU 同步点在 MLA fast-path 判断中（预存在） @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:47
- 建议：此为预存在问题，重命名使其更明确。考虑在 C++ 侧预计算 max_kv_seqlen 并作为标量字段传入 PyAttentionInputs，避免 prepare 阶段的 device→host 同步。
ROCm FMHAParams.kv_cache_block_id 属性写入后从未被读取 @ rtp_llm/models_py/modules/factory/attention/rocm_impl/aiter.py:159
- 建议：确认此属性是否有下游消费者（可能在 ROCm 平台的其他代码中）。若无使用，建议删除此死写以避免维护混淆。若有使用且应为 _device 变体，则需修复。

P3

调试打印标签与字段名不一致 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/trt_tests/trt_test_utils.py:154
- 建议：将打印标签从 'kv_cache_block_id_host' 更新为 'kv_cache_block_id' 以保持一致性
cu_seqlens_device 和 cu_kv_seqlens_device 共享同一 tensor 对象 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/atten_test_util.py:152
- 建议：预存在的 aliasing 模式，非本 PR 引入。如果下游 op 原地修改了其中一个 tensor，会影响另一个。建议 .clone() 或确认下游仅读不写。对测试代码影响低，仅作提示。
cu_seqlens_device fill_() 与 cu_seqlens fill_() 可合并为单次 memset @ rtp_llm/cpp/cuda_graph/cuda_graph_runner.cc:281
- 建议：prepareInputs 中对 host (cu_seqlens) 和 device (cu_seqlens_device) 的 unused batch 部分分别做 fill_(same_value)。host fill 后紧接 D2D copy 已经覆盖 device 端，可省掉单独的 device fill kernel launch。不过此优化收益极小（仅 prefill cuda graph 且 bs < max_bs 时触发），可后续再处理。
ParamsBase.fill_params 的 .pyi 关键字参数名与 pybind 不一致 @ rtp_llm/ops/librtp_compute_ops/__init__.pyi:171
- 建议：.pyi 改成了 kv_cache_block_id，但 pybind11 py::arg 仍是 kv_cache_block_id_host。二者应保持一致——要么更新 pybind 的 py::arg 名称，要么回退 .pyi 改动。
check_attention_inputs 每次调用都创建空 tensor 对象 @ rtp_llm/models_py/modules/factory/attention/cuda_mla_impl/flashinfer_mla.py:73
- 建议：每次 prepare 调用都会创建 4 个 size-0 tensor（即使属性非 None 也会创建 dict）。可改为 lazy 模式：先 getattr 检查，仅在 None 时才创建 default tensor。
ROCm fillParams 参数名与新命名规范不一致 @ rtp_llm/models_py/modules/factory/attention/rocm_impl/aiter.py:154
- 建议：参数名仍为 kv_cache_block_id_host，但内部属性已重命名为 kv_cache_block_id。建议将参数名也统一为 kv_cache_block_id 以匹配新命名规范。
WriteCacheStoreOp 构造函数参数名未同步更名 @ rtp_llm/models_py/modules/base/common/kvcache_store.py:15
- 建议：将参数名 kv_cache_block_id_host 改为 kv_cache_block_id，与属性名和 caller 侧传入的字段名保持一致。

Checklist ✅ (56 items passed)

Strengths

系统性地将 PyAttentionInputs 字段命名从隐含约定 (_host, _d, 无后缀) 统一为显式后缀 (_device, 无后缀=CPU)，提升代码可读性
重命名覆盖了 21 个测试文件，改动机械且一致，未遗漏任何旧字段名
CUDA graph copy 路径中 pinned memory 的使用保持正确：with_copy_params 分支使用 pin_memory() 作为 H2D async copy 的源
21 个测试文件的字段重命名完全一致，无遗漏：kv_cache_block_id_host→kv_cache_block_id, kv_cache_kernel_block_id_host→kv_cache_kernel_block_id, cu_seqlens→cu_seqlens_device, cu_kv_seqlens→cu_kv_seqlens_device, prefix_lengths_d→prefix_lengths_device, input_lengths_d→input_lengths_device, decode_cu_seqlens_d→decode_cu_seqlens_device
所有新字段名与 pybind11 stub (librtp_compute_ops/init.pyi) 中 PyAttentionInputs 的定义完全匹配
命名规范化后语义更清晰：host 端去掉 _host 后缀作为默认，device 端统一使用 _device 后缀
跨 21 个测试文件的字段重命名非常系统化、一致，无遗漏
_AttnInputsWrapper 巧妙地通过 getattr override 解决了 pybind11 def_readonly 字段（prefix_lengths_device、input_lengths_device）在测试中的可读性问题
命名统一化（bare name = host pinned, _device = CUDA）大幅提升可读性，减少 host/device 混淆风险
重命名覆盖完整：header、pybind、Python op、model_desc、CUDA graph、测试均同步更新

wht21 · 2026-06-26T03:29:51Z

internal source has been updated, please review the changes!

LLLLKKKK · 2026-06-26T05:15:25Z

AI Code Review - PR #1128

Status: LGTM

Summary: P0/0 · P1/0 · P2/6 · P3/6

lgtm ready to ci

Non-blocking Suggestions

P2

aiter.py FMHAParams 参数名未同步更新 @ rtp_llm/models_py/modules/factory/attention/rocm_impl/aiter.py:154
- 建议：fillParams 的参数名 kv_cache_block_id_host 应改为 kv_cache_block_id 以匹配新的命名规范。虽然这是局部参数名不影响功能，但与整个 PR 的重命名意图不一致，容易误导。
aiter.py self.kv_cache_block_id 写后未读（dead store） @ rtp_llm/models_py/modules/factory/attention/rocm_impl/aiter.py:159
- 建议：self.kv_cache_block_id 在整个 aiter.py 中只有写入没有读取（实际计算全部使用 kv_cache_block_id_device）。确认是否仍需保留此属性，若不需要可以移除以减少误解。
TRTPagedMHAImpl 存储 cu_seqlens_device 但从未使用 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/trt.py:139
- 建议：TRTPagedMHAImpl.init 存储了 self.cu_seq_lens 但 forward() 和 prepare_cuda_graph() 都不使用它。只有非 paged 的 TRTMHAImpl 在 CUDA graph copy 中用到。可以删除此死代码。
prefill 热路径上对 GPU tensor 调用 .max().item() 引发 GPU→CPU 同步 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:47
- 建议：此问题在本 PR 之前已存在（旧名 cu_kv_seqlens 同样在 GPU 上），非本 PR 引入。建议后续优化：在 PyWrappedModel.cc 中将 cu_kv_seqlens 的 max 值预计算为标量字段存入 PyAttentionInputs，避免每批 prefill 的 cudaMemcpy D→H 同步。也可直接用 CPU 侧的 (input_lengths + prefix_lengths).max() 替代。
atten_test_util 将 CPU pinned tensor 赋给 _device 字段 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/atten_test_util.py:152
- 建议：此行为在 PR 前就已存在（旧字段也是如此），不是本 PR 引入的回归。但既然已在重命名，可以考虑将 cu_seqlens 用 .to(device) 创建设备版本再赋给 cu_seqlens_device，或者同时设置 cu_seqlens（host）和 cu_seqlens_device（device），与 OpDefs.h 注释的命名约定保持一致。
测试中 cu_seqlens_device / cu_kv_seqlens_device 被赋值为 CPU tensor @ rtp_llm/models_py/kernels/cuda/test/test_xqa_batch_decode.py:391
- 建议：这是重命名前遗留的问题（旧 cu_seqlens 是 device 字段但被赋 CPU 数据），现在 _device 后缀使矛盾更明显。XQA decode 不使用 cu_seqlens_device（只用 decode_cu_seqlens），所以当前无功能影响，但如果被复制到 prefill 测试中会导致实际 bug。建议改为 CUDA tensor 或添加注释说明该字段在 decode 路径中未使用。

P3

WriteCacheStoreOp.init 参数名与新属性名不一致 @ rtp_llm/models_py/modules/base/common/kvcache_store.py:15
- 建议：参数名 kv_cache_block_id_host 应同步重命名为 kv_cache_block_id，与存储的属性名保持一致。
CUDA graph 路径 pinned tensor 赋给 cu_seqlens_device 命名不一致 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/test_flashinfer_prefill/test_py_flashinfer_paged_prefill_cuda_graph.py:51
- 建议：OpDefs.h 注释说 cu_seqlens（bare）是 pinned CPU mirror 用于 CUDA graph replay。考虑在 with_copy_params 分支中设置 inp.cu_seqlens 而非 inp.cu_seqlens_device，或同时设置两者。
调试输出标签与新字段名不一致 @ rtp_llm/cpp/cuda_graph/cuda_graph_utils.cc:106
- 建议：标签 "cu_seqlens_host" 语义上正确（.cu_seqlens 确实是 host tensor），但与新字段名 cu_seqlens 不匹配。建议改为 "cu_seqlens" 与字段名一致，便于调试时检索代码。
注释中仍使用旧字段名 decode_cu_seqlens_host @ rtp_llm/cpp/cuda_graph/cuda_graph_runner.cc:534
- 建议：将注释中的 decode_cu_seqlens_host 更新为 decode_cu_seqlens（新的 host tensor 裸名）。
XQAWrapper 存储了 cu_seqlens_device 但从未使用 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/xqa.py:226
- 建议：删除未使用的 self.cu_qseqlens 赋值，避免误导后续开发者认为 XQA decode 依赖 cu_seqlens_device。
auto_model.py prefill 路径未设置 cu_seqlens（host mirror） @ rtp_llm/models_py/standalone/auto_model.py:160
- 建议：standalone auto_model 不走 CUDA graph replay，暂不影响正确性。但为了与 PyWrappedModel.buildPyAttentionInputs 保持一致，可补上 cu_seqlens = torch.tensor([0, input_length], dtype=torch.int32)。

Checklist ✅ (56 items passed)

Strengths

属性重命名完整一致，所有 Python 文件中的旧属性名访问都已更新，不存在遗漏导致的 AttributeError 风险
重命名语义更清晰：去掉了误导性的 _host 后缀（实际在 device 上的张量），添加了 _device 后缀标明张量位置
check_attention_inputs 中的 None 兜底逻辑同步更新了属性名，防止 None 属性在下游引发崩溃
XQA 的 decode_cu_seqlens_host → decode_cu_seqlens 重命名准确反映了该张量实际在 CPU 上的语义
系统性地统一了 PyAttentionInputs 的命名规范：host tensor 使用裸名（如 kv_cache_block_id, decode_cu_seqlens），device tensor 统一加 _device 后缀，消除了之前 _host/_d 后缀混用的歧义
C++ struct、pybind11 binding、Python .pyi 类型桩三层定义保持了完全一致的重命名
所有 15 个消费端文件的属性引用均已同步更新，grep 验证无遗留的旧名称引用
保留了 cu_seqlens (pinned CPU mirror) 与 cu_seqlens_device (CUDA device) 的双 tensor 设计并添加了注释说明语义差异
属性重命名覆盖完整：21 个测试文件中所有 kv_cache_block_id_host → kv_cache_block_id、cu_seqlens → cu_seqlens_device 等旧名已全部替换，grep 验证无遗漏
重命名后与 pybind11 stub (init.pyi) 中 PyAttentionInputs 的属性定义完全一致

wht21 · 2026-06-26T05:18:47Z

internal source has been updated, please review the changes!

wht21 · 2026-06-26T06:57:05Z

internal source has been updated, please review the changes!

wht21 · 2026-06-26T07:33:02Z

internal source has been updated, please review the changes!

wht21 · 2026-06-26T07:37:34Z

internal source has been updated, please review the changes!

LLLLKKKK · 2026-06-26T08:06:55Z

AI Code Review - PR #1128

Status: LGTM

Summary: P0/0 · P1/0 · P2/4 · P3/5

lgtm ready to ci

Non-blocking Suggestions

P2

cu_kv_seqlens_device.max().item() 在 prefill 路径触发隐式 GPU 同步 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:47
- 建议：这是 pre-existing 问题（rename 前也存在），但现在命名更清晰了。由于 is_prefill 短路求值保护，decode 不会触发。若 prefill 延迟敏感，可考虑在 C++ 侧维护 max_kv_seqlen 避免 .item() 同步。当前影响有限，仅标注。
fillParams 参数名与属性名不一致 @ rtp_llm/models_py/modules/factory/attention/rocm_impl/aiter.py:154
- 建议：参数名 kv_cache_block_id_host 应改为 kv_cache_block_id 以与新命名规范及存储属性 self.kv_cache_block_id 保持一致。当前实现功能正确（位置参数），但易造成误导。
atten_test_util 中 _device 字段被赋值为 CPU pinned tensor @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/atten_test_util.py:152
- 建议：此为 rename 前即存在的行为（pinned memory 用于 CUDA graph async copy），但新命名约定（bare = host, _device = CUDA）使其更加令人困惑。建议在测试中添加注释说明此处 pinned memory 会由 C++ 侧异步拷贝到 device，或考虑将 _device 字段显式设为 .to(device) 后的 tensor，让 host 版本走 cu_seqlens（已有该字段）。
测试中 cu_seqlens_device / cu_kv_seqlens_device 赋值为 CPU tensor @ rtp_llm/models_py/kernels/cuda/test/test_xqa_batch_decode.py:391
- 建议：按命名规范，_device 后缀字段应始终持有 CUDA tensor。虽然 XQA decode 路径实际不消费这两个字段（不会触发运行时错误），但建议保持 device 语义以避免未来依赖该字段的代码出现 device mismatch。可改为不加 .cpu()，或者如果确实不需要该字段则赋为 torch.empty(0) 占位。

P3

FMHAParams.kv_cache_block_id 疑似死属性 @ rtp_llm/models_py/modules/factory/attention/rocm_impl/aiter.py:159
- 建议：aiter.py FMHAParams 中 self.kv_cache_block_id 只有 fillParams 赋值，文件内无任何读取。确认是否为死属性，可考虑移除以减少混淆。
WriteCacheStoreOp.init 参数名 kv_cache_block_id_host 未同步重命名 @ rtp_llm/models_py/modules/base/common/kvcache_store.py:15
- 建议：参数名改为 kv_cache_block_id 与 self.kv_cache_block_id 保持一致，避免后续维护者误以为此参数必须是 host tensor。
xqa.py 中 cu_qseqlens 赋值后未使用 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/xqa.py:226
- 建议：该成员变量赋值后无任何读取，可以删除以减少混淆。如果后续需要可再加回。
cuda_graph_runner.cc 注释引用旧字段名 @ rtp_llm/cpp/cuda_graph/cuda_graph_runner.cc:534
- 建议：注释中 decode_cu_seqlens_host 已重命名为 decode_cu_seqlens，应同步更新注释。
cuda_graph_utils.cc debug 标签使用旧命名 @ rtp_llm/cpp/cuda_graph/cuda_graph_utils.cc:106
- 建议：debug 标签写的是 "cu_seqlens_host" 但字段已重命名为 cu_seqlens。建议改为 "cu_seqlens" 以与新命名一致，避免调试时混淆。

Checklist ✅ (56 items passed)

Strengths

命名规范清晰统一：裸名 = host (pinned CPU)，_device 后缀 = CUDA device，消除了之前 _host/_d 混用造成的歧义
所有 15 个文件的 rename 完全一致，无遗漏引用
CUDA graph 路径（capture/replay）的 tensor 引用都正确更新，未打破地址稳定性要求
XQA 的 decode_cu_seqlens（原 decode_cu_seqlens_host）正确保持为 host tensor，.tolist() 调用不引入 GPU 同步
属性重命名系统且完整：在本 shard 所有 15 个文件中，旧名称（kv_cache_block_id_host、cu_seqlens、decode_cu_seqlens_d 等）均已替换为新命名规范（裸名 = host，_device 后缀 = GPU），grep 确认无遗漏
新命名规范清晰一致：与 C++ struct 中文档化的约定完全匹配（'the host (pinned CPU) tensor uses the bare name; its device (CUDA) counterpart carries a _device suffix'），消除了之前 _host/_d/无后缀三种后缀混用的歧义
check_attention_inputs 中 setattr 使用的属性名与 pybind11 def_readwrite 注册的新名一致，不会产生动态属性覆盖问题
系统性地统一了 PyAttentionInputs 属性命名规范：无后缀=host/默认，_device=GPU tensor，消除了之前 _host/_d/无后缀混用的歧义
字段重命名（_host → 无后缀，cu_seqlens → cu_seqlens_device）在所有 21 个测试文件中完整且一致，没有遗漏
重命名使命名约定更清晰：_device 后缀明确标识 GPU 张量，去掉 _host 后缀简化了默认在 CPU 上的字段名

wht21 · 2026-06-26T08:08:45Z

internal source has been updated, please review the changes!

wht21 · 2026-06-26T08:21:17Z

internal source has been updated, please review the changes!

wht21 · 2026-06-26T09:18:44Z

internal source has been updated, please review the changes!

…t/_device suffixes

wht21 · 2026-06-26T11:24:38Z

internal source has been updated, please review the changes!

wht21 · 2026-06-26T13:20:33Z

internal source has been updated, please review the changes!

Jonathan-hwx requested a review from LLLLKKKK as a code owner June 22, 2026 08:27

Jonathan-hwx force-pushed the refactor/attn-input branch from 9183681 to d42d685 Compare June 22, 2026 09:21

Jonathan-hwx force-pushed the refactor/attn-input branch 3 times, most recently from ccfb4a3 to a0c0d02 Compare June 23, 2026 09:48

Jonathan-hwx force-pushed the refactor/attn-input branch from a0c0d02 to 8cfd257 Compare June 24, 2026 13:03

Jonathan-hwx force-pushed the refactor/attn-input branch from 8cfd257 to 36d5aa0 Compare June 25, 2026 02:25

Jonathan-hwx force-pushed the refactor/attn-input branch from 36d5aa0 to f635df7 Compare June 25, 2026 02:47

Jonathan-hwx force-pushed the refactor/attn-input branch from f635df7 to db4fa3f Compare June 26, 2026 03:08

Jonathan-hwx force-pushed the refactor/attn-input branch from db4fa3f to ac9f79e Compare June 26, 2026 03:52

LLLLKKKK approved these changes Jun 26, 2026

View reviewed changes

LLLLKKKK enabled auto-merge (rebase) June 26, 2026 06:44

Jonathan-hwx changed the title ~~refactor(attn): merge host/device fields of PyAttentionInputs into Ho…~~ refactor(attn): unify PyAttentionInputs host/device tensors with _host/_device suffixes Jun 26, 2026

Jonathan-hwx force-pushed the refactor/attn-input branch from ac9f79e to 2aa210b Compare June 26, 2026 06:53

Jonathan-hwx force-pushed the refactor/attn-input branch from 2aa210b to a7c2347 Compare June 26, 2026 09:15

refactor(attn): unify PyAttentionInputs host/device tensors with _hos…

b1f8d50

…t/_device suffixes

Jonathan-hwx force-pushed the refactor/attn-input branch from a7c2347 to b1f8d50 Compare June 26, 2026 11:21

LLLLKKKK merged commit 5466baf into alibaba:main Jun 26, 2026
2 of 3 checks passed

Uh oh!

Conversation

Jonathan-hwx commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LLLLKKKK commented Jun 22, 2026

AI Code Review - PR #1128

Non-blocking Suggestions

P2

Checklist Violations (5 fail / 25 total)

Strengths

Uh oh!

LLLLKKKK commented Jun 22, 2026

AI Code Review - PR #1128

Non-blocking Suggestions

P2

Checklist Violations (3 fail / 79 total)

Strengths

Uh oh!

LLLLKKKK commented Jun 23, 2026

AI Code Review - PR #1128

Non-blocking Suggestions

P2

P3

Checklist Violations (1 fail / 72 total)

Strengths

Uh oh!

wht21 commented Jun 23, 2026

Uh oh!

LLLLKKKK commented Jun 24, 2026

AI Code Review - PR #1128

Non-blocking Suggestions

P2

P3

Checklist ✅ (56 items passed)

Strengths

Uh oh!

wht21 commented Jun 24, 2026

Uh oh!

LLLLKKKK commented Jun 25, 2026

AI Code Review - PR #1128

Blocking Issues

P1

Non-blocking Suggestions

P2

P3

Checklist ✅ (56 items passed)

Strengths

Uh oh!

LLLLKKKK commented Jun 25, 2026

AI Code Review - PR #1128

Non-blocking Suggestions

P2

P3

Checklist ✅ (56 items passed)

Strengths

Uh oh!

wht21 commented Jun 25, 2026

Uh oh!

LLLLKKKK commented Jun 26, 2026

AI Code Review - PR #1128

Non-blocking Suggestions

P2

P3

Checklist ✅ (56 items passed)

Strengths

Uh oh!

wht21 commented Jun 26, 2026

Uh oh!

LLLLKKKK commented Jun 26, 2026

AI Code Review - PR #1128

Non-blocking Suggestions

P2

P3

Checklist ✅ (56 items passed)

Strengths

Uh oh!

Jonathan-hwx commented Jun 22, 2026 •

edited

Loading

CLAassistant commented Jun 22, 2026 •

edited

Loading