feat: merge python_native_v2 into main with review fixes by baohengyi · Pull Request #1121 · alibaba/rtp-llm

baohengyi · 2026-06-21T06:27:00Z

Continues from Continues from #985
This PR keeps the head branch on baohengyi/rtp-llm:feature/python_native_v2 and targets alibaba/rtp-llm:main.

It migrates the native build/test path to setup.py/pyproject.toml, pytest CI profiles, remote execution based test orchestration, and updated smoke/perf coverage. It also addresses all P0/P1 blocking issues and P2 non-blocking suggestions raised in the latest review round.

Key changes:

Build/package: setup.py/pyproject.toml migration, precompile bytecode during installs, OSS packaging fixes
CI/test: pytest profiles for unit/smoke/perf tests, remote execution orchestration, tau2 bench, gb200 eval
P0/P1 fixes: ReturnAllProbsMode scheduling, multimodal tensor safety, CudaSampleOp logsumexp optimization, MRoPE position_ids validation, embedding lookup guards, mm_profiler session isolation, scatter_qkv/remap kernel input validation
P2 optimizations: GPU per-row feature hash, warp_topk occupancy caching, threshold-based empty_cache, FP8 w1 reorder memory reduction, moriep divisibility check, reduce_scatter reusable output buffer

Validation:

All modified Python files pass python3 -m py_compile
C++ files reviewed; full build validation pending upstream CI
Branch history preserved as reviewable logical commits

- Move OSS build/package metadata into pyproject and setup.py.\n- Carry forward overlay platform detection and dependency/package-data fixes.\n- Keep Bazel/deps metadata aligned with the pyproject migration.

- Use string-based attention backend selection.\n- Avoid platform import cycles during python-native bootstrap.\n- Keep model, op, and server-arg imports safe across CUDA, ROCm, and OSS runtimes.

- Port OSS unit test entrypoints from Bazel wrappers to pytest profile execution.\n- Add shared pytest profile support and platform skips.\n- Preserve existing unit coverage under the new profile layout.

- Move smoke and perf suites to pytest-based profiles.\n- Restore smoke runtime behavior for ROCm and SM100 profiles.\n- Update SM100 golden data and related platform-specific test coverage.

- Add remote pytest execution over REAPI with CAS inputs and output collection.\n- Harden executor failover, timeout budgeting, and cleanup.\n- Include perf data inputs for per-test remote execution.

- Keep the OSS history aligned with the internal perf-profile CI rollout.

Sync feature/python_native_v2 with latest main branch updates. Conflict resolution (all adopted upstream/feature version): - .bazelrc: kept upstream cache_bust date - arch_select.bzl, http.bzl: kept upstream (bazel dep functions removed) - ConfigModules.h: kept upstream (no enable_paged_open_source_fmha/enable_trtv1_fmha) - ConfigInit.cc: auto-merged (upstream removed registerMultimodal/QuarkMXFP4) - test_py_flashinfer_mha_decode.py: kept upstream (run_bs fix) - multimodal_util.py: kept upstream (class definitions inline) - model_basic_info_analyzer.py: kept upstream (profiling_debug_logging_config) - BUILD files: kept upstream (minimal per python-native migration) - Deleted 26 files (BUILD/bzl/pyi/lock) per python-native migration

P0: - Restore trans_mm_input/trans_config in multimodal_util.py (mm_process_engine import crash fix) P1: - Remove invalid 'from smoke.*' imports in multi_inst_case_runner.py - Convert VitParameters to dataclass with field(default_factory=dict) - Fix FLA initial_state direction: .contiguous() instead of .transpose().contiguous() - Add default MMPreprocessConfig in qwen_vl_renderer/llava_renderer when missing - Bind VIT worker to specific GPU via torch.cuda.set_device - Serialize pool access in mm_process_engine.submit to fix race condition

- FusedRopeKVCacheOp: remove unconditional Mrope reject in prefill/decode ctors (moved to runtime check) - CudaSampleOp: fix top_k=1 fast path skipping cum_log_probs update - CudaSampleOp: fix ROCm cum_log_probs using wrong tensor shape (use log_softmax + gather instead of raw probs.log()) - FLA chunk: add zero-length sequence validation for FlyDSL path

P0-2: Multimodal feature injection for DeepSeek-VL2/KimiK25/QWenV2Audio - DeepSeek-VL2: switch from GenericMoeModel to MultimodalGenericModel so visual features are injected into text embeddings - DeepSeek-VL2: use tokenizer_path (fallback to ckpt_path) for AutoTokenizer - KimiK25: override _create_python_model to use MultimodalGenericModel instead of inheriting DeepSeekV2's text-only GenericMoeModel - QWenV2Audio: wrap Qwen2MtpModel with multimodal embedding injector so audio features are no longer silently dropped P1: RemoteMultimodalProcessor gRPC deadline - Set deadline from max(mm_timeout_ms) across all mm_inputs - Remove bad connection on RPC failure to avoid reusing stuck VIT workers P1: Empty TensorPB deserialization (grpc_util.py) - Handle default-constructed TensorPB (data_type==0) by returning torch.empty(0) instead of raising 'unknown error type' P1: FlyDSL 0-length sequence validation - Add explicit check in megakernel_fwd for cu_seqlens with zero-length sequences; raise ValueError instead of silent underflow P1: CUDA Graph MRoPE position_ids on CPU - Use options_cuda_int32_ instead of options_cpu_int32_ + pin_memory for combo_position_ids capture buffer P1: generic_moe TP-only all_reduce - Document optimization opportunity: when fused_moe supports skip_allreduce, switch to unified TP all_reduce on combined output P1: StreamGroups.h logprobs mixing - Existing one-shot warning + CORRECTNESS RISK comment is sufficient mitigation; full fix requires scheduler partitioning by ReturnAllProbsMode

P1 (blocking fixes): - ReturnAllProbsMode scheduling bucketing in BatchDecodeScheduler/FIFOScheduler - StreamGroups needReturnAllProbs documentation update - OpenaiEndpoint/TensorPbConvert/PyWrappedModel multimodal tensor safety - RemoteMultimodalProcessor constructor alignment - CudaSampleOp top1 logprob via logsumexp to avoid full log_probs materialization - FusedRopeKVCacheOp MRoPE position_ids validation - multimodal_embedding deepstack length/dim/divisibility validation - RtpEmbeddingLookup optional int tensor dtype/device/shape validation - mm_profiler session snapshot and active-request synchronization - scatter_qkv assert -> ValueError - remap_local_ids_kernel shape/device/contiguous/dtype guards - mm_process_engine extra_input handling fixes - qwen_v2_audio/deepseek_vl2/qwen3_next/mori_ep_intranode_router/fla fixes P2 (non-blocking suggestions): - MultimodalProcessor.cc: GPU per-row statistical hash instead of full embedding D2H - warp_topk.hpp + hip_utils.hpp: cache occupancy launch params; stack/fixed workspace - loader.py: threshold-based CUDA empty_cache (only when reserved > 85%) - per_channel_fp8_quant_weight.py: half-size temp buffer for w1 gate/up reorder - moriep_wrapper.py: validate expert_num % ep_size == 0 - multimodal_util.py: list branch MultimodalInput type guard - collective_torch.py: allow caller-provided output_tensor in reduce_scatter

CLAassistant · 2026-06-21T06:27:07Z

All committers have signed the CLA.

LLLLKKKK · 2026-06-21T08:03:51Z

AI Code Review - PR #1121

Status: BLOCKING

Summary: P0/0 · P1/1 · P2/0 · P3/1

Blocking Issues

P1

py_standalone_testlib label 被删除但仍被引用 @ rtp_llm/models_py/standalone/BUILD:43
- 建议：恢复 standalone BUILD 中的 py_standalone_testlib，或把依赖迁移到实际提供 auto_model/ops 的新 target。

Non-blocking Suggestions

P3

中文 README 仍引用已删除 requirements @ deps/requirements_torch_gpu_cuda12.txt:1
- 建议：把安装说明迁移到 pyproject extras/uv pip install 的新依赖安装方式。

Checklist ✅ (56 items passed)

Strengths

smoke suite 迁移到 pytest 参数化后，case 组织和 marker 筛选更清晰。
attention backend 增加显式 NAME/列表解析后，后端选择逻辑更容易按配置控制。

LLLLKKKK · 2026-06-21T10:40:39Z

AI Code Review - PR #1121

Status: BLOCKING

Summary: P0/2 · P1/18 · P2/0 · P3/0

Blocking Issues

P0

多模态特征哈希退化会导致 KV cache 静默误复用 @ rtp_llm/cpp/multimodal_processor/MultimodalProcessor.cc:32
- 建议：恢复基于完整 row 字节内容的顺序敏感 hash，或实现覆盖完整内容、dtype、shape 的 GPU hash；不要用低维统计量做 cache key。
安全下载模块缺失时会退回裸 requests @ rtp_llm/multimodal/multimodal_util.py:32
- 建议：安全请求模块不可用时 fail-closed，或确保 ssrf_check 随包/namespace 一起发布并补导入失败测试。

P1

MLA CP 路径会选到非 CP 实现 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:37
- 建议：恢复 CP 过滤、实例化时传入 parallelism_config，并保持 CP prefill 不走 fast path。
XQA CUDA graph 回放会使用旧 sequence_lengths @ rtp_llm/models_py/modules/factory/attention/cuda_impl/xqa.py:96
- 建议：恢复对捕获 sequence_lengths 张量的引用，并在 prepare_cuda_graph 中原地拷贝新的长度。
ROCm smoke marker 未注册导致 strict collection 失败 @ pyproject.toml:220
- 建议：在 pytest markers 中注册 MI308X_ROCM7，或在 conftest 中动态 addinivalue_line 后再使用。
headwise 配置不再传入 attention inputs @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:227
- 建议：恢复从 model_config.headwise_config 到 attn_inputs.headwise_config 的赋值，或把该配置作为显式 factory 参数传入。
MoeConfig 删除 use_mori_ep 后 C++ 仍引用该成员 @ rtp_llm/cpp/config/ConfigModules.h:220
- 建议：恢复 MoeConfig::use_mori_ep 成员，或同步删除/迁移 ConfigModules.cc、ConfigInit.cc 及 Python 配置链路中的所有引用。
BatchDecode 混合 logprobs 模式会永久凑不满批 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:147
- 建议：按 ReturnAllProbsMode 分组统计可运行数量，选择已满足 batch_size_ 的组；或支持超时/不足批时按单一模式小批调度。
TorchSymmMem close 新增后未在销毁分布式环境时调用 @ rtp_llm/models_py/distributed/symm_mem.py:246
- 建议：在 destroy_distributed_environment 中先获取 _symm_mem_comm 调 close() 并置 None，再 destroy_process_group，保证同进程重初始化不会复用旧 IPC/PG 状态。
ROCm TopK launch 缓存无锁并发读写 @ 3rdparty/trt_beam_search/efficient_topk/warp_topk.hpp:602
- 建议：用 mutex/shared_mutex 或 call_once 保护缓存读写；若 k 取值有限，改为预计算数组加原子状态，避免 unordered_map 并发读写 UB。
six.BUILD 删除后仍被外部依赖引用 @ deps/git.bzl:94
- 建议：恢复 3rdparty/six/six.BUILD，或把 six_archive 改为现存 BUILD 文件 / build_file_content，并验证 @six_archive//:six 可解析。
未 profiling 时多模态请求被全局锁串行化 @ rtp_llm/multimodal/mm_profiler.py:199
- 建议：只在锁内读取状态，把 not want_profile 和 re-check bail-out 分支的 yield 移到锁外执行。
end_profile 会丢失进行中的 profile 结果 @ rtp_llm/multimodal/mm_profiler.py:119
- 建议：先等待 active profile 归零，再在同一把锁下读取并清空 _profiled_count/_last_averages/_finished。
DeepSeek VL2 默认图片预处理配置类型错误 @ rtp_llm/openai/renderers/deepseek_vl2_renderer.py:100
- 建议：改用 rtp_llm.ops.MMPreprocessConfig(-1, ..., [], 30000)，或让 multimodal_util 继续重导出 pybind 配置类。
多图 batch 的严格 mm_timeout_ms 被 max 放宽 @ rtp_llm/multimodal/mm_process_engine.py:321
- 建议：使用正数 timeout 的最小值，或按 timeout 拆分 work item，保证较严格的单个输入不会被同 batch 放宽。
ExecuteResponse 非 OK 状态可能被当成远端测试通过 @ rtp_llm/test/remote_tests/executor.py:792
- 建议：在 _parse 中先检查非 OK status，返回失败 exit_code 和 stderr/status 诊断，并补充非 OK status + 空 result 的测试。
ROCm DeepEP 用例迁移后会被 pytest 以未参数化函数收集 @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/test/deepep_normal_router_test.py:183
- 建议：给 test_single 增加 pytest.mark.parametrize，并按 world_size 设置 gpu count；如果暂不启用则重命名为非 test helper 或显式 skip。
删除 test_util Bazel 目标后仍有依赖引用 @ rtp_llm/test/utils/BUILD:13
- 建议：恢复 test_util py_library，或同步移除/替换仍引用该 label 的 target。
mainse comparer 未注册会静默落到普通 comparer @ rtp_llm/test/smoke/comparer_registry.py:54
- 建议：对 q_r 标记 mainse_module 的 case 禁止 fallback，或在 fallback 前显式检测并报 comparer 未注册。
remote_tests 源码直接导入仍依赖已生成 proto @ rtp_llm/test/remote_tests/executor.py:16
- 建议：在 remote_tests import 前自动 ensure_proto_files_generated，或改为延迟导入并给出明确错误；打包时确保生成文件随 wheel/source 分发。

Checklist Violations (6 fail / 56 total)

General Principles Checklist

[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue XQA CUDA graph 回放会使用旧 sequence_lengths
CUDA graph 参数、symm_mem 生命周期和 profiling 状态清理均存在状态不变量破坏。
[6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue ROCm smoke marker 未注册导致 strict collection 失败
ROCm marker、DeepEP pytest 迁移和 remote_tests proto 生成路径存在跨平台/测试收集风险。

Python Static-First Checklist

[P.A] 静态结构与类型纪律 — 数据容器用 dataclass/NamedTuple/TypedDict → issue DeepSeek VL2 默认图片预处理配置类型错误
多模态 Python dataclass 与 C++ pybind MMPreprocessConfig 契约不一致，默认路径会传入错误类型。
[P.B] 错误处理 — 禁止 bare except 或静默吞异常 → issue 安全下载模块缺失时会退回裸 requests
安全请求模块 ImportError 后回退裸 requests，属于安全关键依赖缺失时的 silent fallback。
[P.F] 语言陷阱 — 禁止模块级 import 副作用 → issue remote_tests 源码直接导入仍依赖已生成 proto
remote_tests 顶层导入生成 proto 文件，源码环境缺少生成物时 import 即失败。
[P.G] 测试规范 — 数据驱动测试用 pytest.mark.parametrize → issue ROCm DeepEP 用例迁移后会被 pytest 以未参数化函数收集
ROCm DeepEP 迁移后 test_single 未参数化，原多卡矩阵用例语义丢失。

Strengths

smoke 入口从 Bazel bzl 迁到 pytest manifest 后，case 定义更集中，可读性更好。
comparer registry 把 OSS 与内部 mainse comparer 解耦，方向正确。
TensorPB 转 torch 前新增 shape/payload 字节数校验，避免畸形 RPC payload 继续进入 from_blob/clone。

P0: - Restore full row byte-content FNV-1a hash for multimodal feature cache keys (replaces low-dimensional statistics that caused silent KV-cache reuse) - Fail-closed when ssrf_check module is missing instead of falling back to bare requests.get P1: - Restore CP filtering and parallelism_config in MLA get_mla_impl - Restore captured sequence_lengths reference and in-place copy in XQA CUDA graph - Register MI308X_ROCM7 pytest marker - Restore headwise_config assignment to attn_inputs - Restore MoeConfig::use_mori_ep member - BatchDecodeScheduler: group streams by ReturnAllProbsMode to avoid starvation - Call TorchSymmMemCommunicator.close() before destroy_process_group - Add mutex to protect ROCm TopK occupancy cache - Restore 3rdparty/six/six.BUILD - mm_profiler: move yield outside lock; wait for active profiles before clearing - Fix DeepSeek VL2 default preprocess config to use MMPreprocessConfig(-1,...) - Use minimum positive mm_timeout_ms across batch instead of maximum - Treat non-OK REAPI Execute response status as test failure - Rename test_single to run_single to avoid pytest collection - Restore test_util py_library Bazel target - Forbid fallback comparer for mainse-flagged cases - Wrap remote_tests proto imports with clear error message

LLLLKKKK · 2026-06-22T01:14:51Z

AI Code Review - PR #1121

Status: BLOCKING

Summary: P0/0 · P1/8 · P2/7 · P3/0

Blocking Issues

P1

LRO 响应类型不匹配会被当成成功 @ rtp_llm/test/remote_tests/executor.py:739
- 建议：检查 Unpack 返回值和 op.error/空 response；不匹配时返回非 0，并补 LRO 空响应/错类型回归测试。
非 OK REAPI 状态可能被 EXIT_CODE 覆盖成成功 @ rtp_llm/test/remote_tests/plugin.py:1726
- 建议：当 response_status_code 非 0 时禁止解析 stdout EXIT_CODE 覆盖结果，并避免写入成功缓存；补非 OK status+EXIT_CODE=0 测试。
BatchDecodeScheduler 将 NONE 单独分组导致可兼容请求不入批 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:142
- 建议：把 NONE 作为通配请求填入选中的 DEFAULT/ORIGINAL 组，或在总 waiting 已满足时按注释调度小 batch；补混合 DEFAULT/NONE 单测。
HTTP 多模态下载会因缺失 ssrf_check 直接失败 @ rtp_llm/multimodal/multimodal_util.py:32
- 建议：随 PR 提交 ssrf_check 实现并纳入打包/BUILD，或按环境提供安全 fallback；补 http/https URL 下载测试。
DeepSeek VL2 默认多模态配置会构造失败 @ rtp_llm/openai/renderers/deepseek_vl2_renderer.py:100
- 建议：改用 rtp_llm.ops 的 9 参数 MMPreprocessConfig 或本地默认配置工厂；补无 preprocess_config 的 DeepSeek VL2 图片请求测试。
FlashInfer decode CUDA graph replay 仍会重新 plan @ rtp_llm/models_py/modules/factory/attention/cuda_impl/py_flashinfer_mha.py:756
- 建议：replay 路径避免调用 plan；若必须 replan，需证明不重分配 captured buffer，并用 spy 断言 plan 调用与 buffer 指针稳定。
standalone BUILD 仍加载已删除的 arch_select 符号 @ arch_config/arch_select.bzl:1
- 建议：迁移 standalone/BUILD 到新的依赖声明方式，或在 arch_select.bzl 保留兼容 stub 直到所有引用清空。
旧 Bazel labels 被删除但仍被 standalone 依赖 @ rtp_llm/BUILD:1
- 建议：恢复 alias/py_library 兼容目标，或一次性迁移 standalone 依赖到新 label；提交前跑 bazel package loading 覆盖 standalone。

Non-blocking Suggestions

P2

MoriEP 多卡测试删除 Bazel GPU 资源声明 @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/test/BUILD:85
- 建议：保留拆分后的 2 卡/4 卡入口并加等价 pytest gpu marker，或在新测试注册机制中表达 count=2/4 的资源需求。
新 MultimodalInput 字段名与转换逻辑不一致 @ rtp_llm/multimodal/multimodal_util.py:281
- 建议：统一字段名为 mm_preprocess_config 或在 trans_mm_input 中读取 config，并补 list[MultimodalInput] 转换单测。
MultimodalInput 使用可变默认对象 @ rtp_llm/multimodal/multimodal_util.py:100
- 建议：把默认值改为 None，在函数体内创建新的 MMPreprocessConfig 和 tensor；必要时用 dataclass default_factory。
smoke suite 校验脚本空跑仍返回成功 @ scripts/verify_smoke_suites.py:45
- 建议：扫描实际的 test_smoke_*.py/SMOKE_CASES，且 suites 目录存在但匹配数为 0 时直接返回非零。
perf baseline 缺失会静默跳过回归校验 @ rtp_llm/test/perf_test/perf_runner.py:63
- 建议：只要 test_config 配置 baseline，文件缺失或内容为空都应 fail；仅未配置 baseline 的 case 才上传新结果。
ROCm fused_moe conftest 会屏蔽 CPU-only 回归测试 @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/test/conftest.py:9
- 建议：仅对依赖 aiter/GPU 的文件做 importorskip 或 ignore，保留 CPU-only 回归测试收集。
CAS 批量上传单 blob 失败后仍继续执行 @ rtp_llm/test/remote_tests/cas_client.py:435
- 建议：任一 blob status 非 0 时抛错或重试；ByteStream Write 同时校验 committed_size 等于 digest.size_bytes。

Checklist Violations (12 fail / 56 total)

General Principles Checklist

[6.1] Architecture — 依赖方向：无循环依赖/跨层惊喜 → issue standalone BUILD 仍加载已删除的 arch_select 符号
发现对应问题：standalone BUILD 仍加载已删除的 arch_select 符号。
[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue BatchDecodeScheduler 将 NONE 单独分组导致可兼容请求不入批
发现对应问题：BatchDecodeScheduler 将 NONE 单独分组导致可兼容请求不入批。
[6.1] Architecture — 错误语义：fail-fast/retry/fallback/silent 行为显式 → issue LRO 响应类型不匹配会被当成成功
发现对应问题：LRO 响应类型不匹配会被当成成功。
[6.1] Architecture — 兼容性：公开 API/持久数据/配置/环境迁移安全 → issue 旧 Bazel labels 被删除但仍被 standalone 依赖
发现对应问题：旧 Bazel labels 被删除但仍被 standalone 依赖。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue smoke suite 校验脚本空跑仍返回成功
发现对应问题：smoke suite 校验脚本空跑仍返回成功。
[6.1] Tests — 被删除测试有等价替代覆盖 → issue MoriEP 多卡测试删除 Bazel GPU 资源声明
发现对应问题：MoriEP 多卡测试删除 Bazel GPU 资源声明。
[6.1] Tests — 边界 case 覆盖（空、单元素、最大值） → issue DeepSeek VL2 默认多模态配置会构造失败
发现对应问题：DeepSeek VL2 默认多模态配置会构造失败。
[6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue MoriEP 多卡测试删除 Bazel GPU 资源声明
发现对应问题：MoriEP 多卡测试删除 Bazel GPU 资源声明。

Python Static-First Checklist

[P.A] 静态结构与类型纪律 — 数据容器用 dataclass/NamedTuple/TypedDict → issue 新 MultimodalInput 字段名与转换逻辑不一致
发现对应问题：新 MultimodalInput 字段名与转换逻辑不一致。
[P.F] 语言陷阱 — 禁止可变默认参数 → issue MultimodalInput 使用可变默认对象
发现对应问题：MultimodalInput 使用可变默认对象。
[P.G] 测试规范 — 数据驱动测试用 pytest.mark.parametrize → issue MoriEP 多卡测试删除 Bazel GPU 资源声明
发现对应问题：MoriEP 多卡测试删除 Bazel GPU 资源声明。
[P.G] 测试规范 — pytest.raises 带 match 参数 → issue smoke suite 校验脚本空跑仍返回成功
发现对应问题：smoke suite 校验脚本空跑仍返回成功。

Strengths

本次重构显式拆分 CUDA/ROCm/Python native 依赖方向，整体目标是降低旧 monolithic BUILD 的耦合。
RemoteExecutor 对非 OK ExecuteResponse 已有 fail-closed 方向，修复后可提升远端测试可靠性。
多处新增 pytest GPU/平台标记，说明 PR 已在尝试把资源约束从 Bazel 迁移到 pytest 层。

P1: - LRO response type mismatch / empty response handling in remote executor - Prevent EXIT_CODE in stdout from overriding non-OK REAPI status - BatchDecodeScheduler: treat NONE ReturnAllProbsMode as wildcard - Add rtp_llm.utils.ssrf_check and route HTTP downloads through it - DeepSeek VL2: use rtp_llm.ops.MMPreprocessConfig (9-arg) consistently - FlashInfer CUDA graph replay: avoid calling plan() on replay path - standalone/BUILD: drop deleted arch_select symbols and stale labels P2: - MoriEP router tests: register pytest gpu(type/count) markers - MultimodalInput: align field name with trans_mm_input, avoid mutable defaults - verify_smoke_suites: discover test_smoke_*.py and fail on empty match - perf_runner: fail when configured baseline file is missing or empty - ROCm fused_moe conftest: only ignore GPU-specific tests - CAS client: fail batch upload / ByteStream write on size/status errors

LLLLKKKK · 2026-06-22T02:31:43Z

AI Code Review - PR #1121

Status: BLOCKING

Summary: P0/1 · P1/15 · P2/0 · P3/0

Blocking Issues

P0

SSRF 校验未覆盖重定向目标 @ rtp_llm/utils/ssrf_check.py:52
- 建议：将 allow_redirects=False，或手动 follow redirect 并在每个 Location 发起请求前重新做 scheme/host/IP 校验；同时考虑消除 DNS 校验与实际连接之间的二次解析窗口。

P1

BatchDecodeScheduler 混合 logprobs 模式会卡住等待队列 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:182
- 建议：按模式分别判断是否能凑满 batch；凑不满时应允许超时降级为部分 batch、显式拒绝不兼容组合，或避免让总 waiting 数触发一个永远无法入队的调度循环。
DECODE 调度仍会把不同 return_all_probs 模式混入已有批次 @ rtp_llm/cpp/engine_base/schedulers/FIFOScheduler.cc:208
- 建议：调度新流前先从 running_streams_ 初始化批次 return_all_probs 模式，并拒绝与已有 RUNNING 批次冲突的 DEFAULT/ORIGINAL；补充 DECODE 角色下 running ORIGINAL + waiting DEFAULT 的回归测试。
缓存禁用配置被破坏 @ rtp_llm/multimodal/multimodal_util.py:228
- 建议：恢复 cache_size <= 0 时将 mm_data_cache 置为 None 的逻辑；当当前缓存为 None 且新 size > 0 时重新创建 LruDict，并补充 0/负数用例。
Triton kernels 的 Bazel 目标被删除但仍被引用 @ rtp_llm/models_py/triton_kernels/BUILD:1
- 建议：恢复 fla/kimi_kda/moe 等 py_library 目标，或同步修改所有下游 BUILD 依赖到新的测试入口。
smoke suite 校验脚本不再是 stdlib-only @ scripts/verify_smoke_suites.py:54
- 建议：不要直接 import pytest entry；将 SMOKE_CASES 抽到无依赖的 data 模块，或用 AST/受限执行只读取字面量配置，并确保脚本在无 torch 环境可运行。
Kimi Linear 的无 RoPE MLA 路径会启动崩溃 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:40
- 建议：恢复 get_global_weight_or_none，并让需要 RoPE 的 impl 自己校验非空；无 RoPE 模型继续传 None。
远端 KVCM 在失败路径不会关闭 @ rtp_llm/test/smoke/case_runner.py:251
- 建议：把 server_manager 和 remote_kvcm_server 的 stop/copy_logs 放进 finally，并对 kill_remote 后的重复 stop 做幂等处理。
CPU-only collect-only 会因缺少 triton 失败 @ conftest.py:116
- 建议：不要用全局 _RTP_CONFTEST_DONE 解除所有轻量收集路径的 deferral；或在 rtp_llm/init.py 对 collect-only/缺少 triton 继续 lazy fallback。
smoke suite 顶层导入破坏预构建校验 @ rtp_llm/test/smoke/suites/test_smoke_h20_dense.py:9
- 建议：把 SMOKE_CASES 拆到 stdlib-only cases 模块，或让 verifier 用 AST/静态解析读取字典；pytest 运行入口再导入 runner。
显式禁用后端名称写错会静默失效 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:175
- 建议：构建已注册 NAME 集合，启动或 get_fmha_impl 时校验 attn_backend 与 disable_attn_backends；未知名称直接抛 ValueError，并为 flashinfer 增加显式 alias 或删除 help 文案。
ROCm 直接使用默认 FMHAConfig 会优先选择 Triton PA @ rtp_llm/cpp/config/ConfigModules.h:114
- 建议：保持 C++ 默认与 CLI 默认一致，或在 ROCm auto 注册顺序中将 ASM 放在 Triton 前；同时补充直接构造 FMHAConfig 的单测。
多角色 smoke 的共享 env 被丢弃 @ rtp_llm/test/smoke_framework/runner.py:72
- 建议：当 smoke_args 是 dict 且 envs 是 list 时，将该 list 复制到每个 role；dict envs 再按 role 追加覆盖。
mainse comparer 没有注册入口 @ rtp_llm/test/smoke/comparer_registry.py:50
- 建议：在 smoke conftest 或 suite 入口导入时注册 mainse/mainse_arpc comparer，并确保注册发生在 OSS generic Openai/Normal fallback 之前。
CPU-only collect 仍会强制导入 triton @ rtp_llm/__init__.py:131
- 建议：把 triton 导入也纳入轻量启动判定；pytest collect-only/CPU 环境下跳过 triton 和 ops，运行时需要时再 fail-fast。
Bazel Python 依赖声明 shim 被删除但调用未迁移 @ arch_config/arch_select.bzl:1
- 建议：恢复 requirement shim，或把这些 BUILD 迁移到新的 Python 依赖声明方式并删除 load/call。

Checklist ✅ (56 items passed)

Strengths

attention registry 新增 NAME/重复注册测试，有助于提前发现显式后端派发失效。
BatchDecodeScheduler 已把 NONE 作为 wildcard 合入选中 logprobs 模式，修复了上一版可兼容请求被单独分组的问题。
reduce_scatter 新增可复用 output_tensor，调用方可避免分布式热路径每次分配输出 buffer。

P0: - SSRF redirect validation: manual redirect follow with per-Location scheme/host/IP re-check and relative URL resolution. - BatchDecodeScheduler: partial-batch fallback when incompatible modes prevent filling batch_size_. P1: - FIFOScheduler: initialize batch return_all_probs mode from running streams and reject incompatible DEFAULT/ORIGINAL waiting streams. - multimodal_util: restore cache_size <= 0 disables cache; recreate LRU when disabled cache gets positive size. - triton_kernels/BUILD: restore py_library targets for common/fla/kimi_kda/ moe/causal_conv1d/sparse_mla. - verify_smoke_suites: AST-only SMOKE_CASES parsing (stdlib-only). - attn_factory: restore get_global_weight_or_none for RoPE-less MLA; validate attn_backend/disable_attn_backends names and add flashinfer alias for py_flashinfer. - case_runner: server_manager/remote_kvcm cleanup in finally with idempotent stop/log copy. - conftest/__init__: defer heavy torch/triton/ops imports during pytest collect-only and plugin discovery. - ConfigModules: default use_triton_pa=false to align ROCm defaults. - smoke_framework/runner: copy shared env list to every role in multi-role smoke cases. - comparer_registry: auto-register internal mainse comparers before OSS fallback. - arch_select.bzl: restore requirement/internal_deps/triton_deps shims. - validation: add 'eval' to known smoke markers.

- P0-1: pass layer_idx through Qwen3NextAttention to CausalAttention - P0-2: SSRF check validates redirects and pins connections to resolved IPs - P1-1: reset MoriEP singleton before destroying process groups - P1-2: CPU backend sends ready signal via local_rank_start - P1-3: reuse self.tokenizer.image_token_id in DeepSeek-VL2 - P1-4: cache URL bytes and return fresh BytesIO copies - P1-5: align local MM batch timeout semantics with remote path - P1-6/P1-7: fix ROCm filtered_probs scope and cum_log_probs distribution - P1-8: support 2-D dispatch_ids/weights in remap_local_ids - P1-9: disable TRT allreduce for unsupported world sizes - P1-10: avoid permanent stall in BatchDecodeScheduler mixed logprobs mode - P1-11: abort VIT RPC on empty embeddings - P1-12: choose Local/RemoteMultimodalProcessor by vit_separation/role/tp_rank

LLLLKKKK · 2026-06-22T04:10:43Z

AI Code Review - PR #1121

Status: BLOCKING

Summary: P0/0 · P1/10 · P2/34 · P3/0

Blocking Issues

P1

HTTPS 下载会因改写为 IP 而握手失败 @ rtp_llm/utils/ssrf_check.py:138
- 建议：不要直接把 HTTPS URL host 改成 IP；改用兼容当前 requests/urllib3 的自定义 connection/pool 来固定 connect IP，同时显式设置 server_hostname/assert_hostname，并补 HTTPS 回归测试。
LRO error 被当成普通失败，跳过 failover/retry @ rtp_llm/test/remote_tests/executor.py:738
- 建议：按 op.error.code/status message 分类：可恢复的 REAPI/worker 错误返回 exit_code=-1 并设置 infra_category，非恢复错误再作为普通失败上报。
CUDA 原始 all_probs 模式下 cum_log_probs 使用了错误分布 @ rtp_llm/models_py/bindings/core/CudaSampleOp.cc:212
- 建议：为 cum_log_probs 单独保存过滤/归一化后的采样分布；return_original_all_probs 只应影响 output_all_probs，并补充 ORIGINAL+top_k/top_p+return_cum_log_probs 测试。
kimi_kda Bazel 目标缺少运行时依赖 @ rtp_llm/models_py/triton_kernels/BUILD:18
- 建议：在 :kimi_kda deps 中补充 //rtp_llm/models_py/triton_kernels/autotune_cache:autotune_cache，避免 Bazel py_test 运行时 import 失败。
VIT 角色分离模式会误要求本地 mm_process_engine @ rtp_llm/cpp/model_rpc/LocalRpcServer.cc:47
- 建议：将 VIT_SEPARATION_ROLE 的非 VIT 后端路径继续走 RemoteMultimodalProcessor，或按 role_type 区分本地 VIT 进程和远端调用方。
flashinfer 禁用别名在 auto 模式下失效 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:209
- 建议：解析 disable_attn_backends 时统一做别名归一化或展开，例如将 flashinfer 同时展开为 py_flashinfer/py_flashinfer_paged，并让显式和 auto 路径使用同一套 canonical 名称。
remote_tests proto 生成晚于打包导致插件导入缺模块 @ setup.py:1198
- 建议：把 proto 生成提前到 build_py/sdist 阶段，或自定义 build_py 先调用 ensure_proto_files_generated，并在 package_data/MANIFEST 中显式包含生成的 _pb2.py、_pb2_grpc.py。
attention blocklist 对 flashinfer 别名不生效 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:205
- 建议：解析 blocked 时把 flashinfer 规范化为 py_flashinfer，或同时加入 alias 和真实 NAME；为 auto/explicit 模式各加单测。
legacy FMHA 总开关未被工厂消费 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:103
- 建议：在 auto 模式先处理 enable_fmha=false 为禁用全部 MHA，或转换为 attn_backend=none；同时明确 enable_open_source_fmha 的兼容语义并接入过滤。
smoke runner 忽略 per-test data_root @ rtp_llm/test/smoke_framework/runner.py:146
- 建议：在 run_smoke_test 内根据 test_config.get('data_root') 调 compute_smoke_rel_path，避免使用导入时捕获的 REL_PATH；resolve_prompt_refs 和 TaskInfo 路径也使用同一个本地 data_root。

Non-blocking Suggestions

P2

重定向响应未关闭会滞留流式连接 @ rtp_llm/utils/ssrf_check.py:172
- 建议：在跟随下一跳前显式 response.close()；可用 try/finally 包住 redirect 分支，避免多模态 URL 高频下载时连接/FD 依赖 GC 回收。
小文件上传会一次性驻留所有 Batch payload @ rtp_llm/test/remote_tests/cas_client.py:225
- 建议：按 batch 边构造边提交，或用有界队列限制在途 batch 数，避免大量缺失小文件同时占用内存。
session cache 命中校验逐条发 CAS 请求 @ rtp_llm/test/remote_tests/plugin.py:1859
- 建议：先收集并去重所有待校验 digest，批量调用 FindMissingBlobs，再用本地 set 判断每个 entry 是否可 replay。
大输出下载路径重复物化完整归档 @ rtp_llm/test/remote_tests/output_collector.py:141
- 建议：为大 blob 提供流式下载到文件的接口，输出归档直接写临时文件并解压，减少 controller 侧峰值内存和拷贝。
[符号已在executor_failover_test.py定义-自动降级] CAS 上传 RPC 没有 deadline，故障时会卡死 pytest @ rtp_llm/test/remote_tests/cas_client.py:344
- 建议：给 FindMissingBlobs/BatchUpdateBlobs/ByteStream.Write 统一设置 deadline，并让 BatchUpdate/Write 也走 _grpc_call_with_retry；大文件 Write 还应校验 committed_size。
远端 outputs 解包仍会落地不安全链接 @ rtp_llm/test/remote_tests/output_collector.py:32
- 建议：跳过 symlink/hardlink，或同时校验 linkname 解析后仍在 dest 内；优先用 data-only extraction 语义。
大文件 ByteStream 上传未校验提交大小 @ rtp_llm/test/remote_tests/cas_client.py:406
- 建议：保存 WriteResponse 并校验 committed_size == digest.size_bytes，不一致时抛错并让远端执行失败或重试。
CAS 下载失败与空 blob 无法区分 @ rtp_llm/test/remote_tests/cas_client.py:317
- 建议：对非空 digest 的下载失败抛异常或返回带状态的结果；调用方再决定重试、失败或保留已有 stream log。
低 GPU 数 session 会把高卡用例落到 1 卡阶段 @ rtp_llm/test/remote_tests/plugin.py:1909
- 建议：1GPU phase 排除所有 gpu_count_N (N>1)；对超过 total_gpus 的 tier 显式 skip 或 fail-fast。
ROCm TopK occupancy 缓存未区分设备 @ 3rdparty/trt_beam_search/efficient_topk/warp_topk.hpp:600
- 建议：将 device id、arch 或 SM 数纳入缓存 key；如该函数在请求热路径频繁调用，也考虑降低命中路径的全局 mutex 开销。
CUDA graph replay 测试未直接拦截 replan @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/test_py_flashinfer_mha_decode.py:363
- 建议：对 decode_wrapper.plan 做 monkeypatch 计数或直接 fail，并断言 captured buffer 的 data_ptr 在 replay 后不变。
超时可能遗留 GPU worker 进程 @ rtp_llm/models_py/modules/base/cuda/test/torch_symm_mem_test.py:295
- 建议：用独立 session/process group 启动 wrapper，并在 TimeoutExpired 时 kill 整个进程组，或显式管理并 terminate 所有 worker。
分布式初始化失败会被当作测试成功 @ rtp_llm/models_py/modules/base/cuda/test/torch_symm_mem_test.py:51
- 建议：初始化失败应直接抛出；只有在测试入口确认环境不可用时才显式 skip。
ROCm 基础测试会因 aiter 缺失整目录不收集 @ rtp_llm/models_py/modules/base/rocm/test/conftest.py:6
- 建议：只对真正依赖 aiter 的 FMHA 用例局部 skip；目标 ROCm 环境中 aiter 导入失败应 fail fast 或显式 skip。
FlashInfer 缺失会跳过非 FlashInfer TRT 测试 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/conftest.py:3
- 建议：把 flashinfer gate 下沉到 flashinfer 相关测试目录；TRT/XQA 按各自依赖单独 gate。
CP attention 测试缺少 FlashInfer 可用性 gate @ rtp_llm/models_py/modules/factory/attention/cuda_cp_impl/test/conftest.py:5
- 建议：在 conftest 中增加 flashinfer import gate，缺失时显式 skip，避免收集阶段直接报错。
手工测试残留本地绝对路径污染导入 @ rtp_llm/models_py/modules/factory/attention/cuda_mla_impl/test/sparse_mla_decode_op_test.py:28
- 建议：删除硬编码 sys.path；本地调试通过 PYTHONPATH 或 pytest 配置显式传入。
文档仍引用已删除的 docs requirements @ docs/README.md:10
- 建议：改为安装 .[docs] 等新的 docs extra，或恢复并维护 docs/requirements.txt。
源码构建文档混用 pip wheel 和 bazel-out @ docs/start/install.md:33
- 建议：同步更新后续步骤，说明 pip wheel 的 proto 产物位置；若仍依赖 Bazel 生成 proto，应保留 Bazel 构建步骤。
预处理提交被全局锁串行化 @ rtp_llm/multimodal/mm_process_engine.py:198
- 建议：只在读取/替换 pool 和 rebuild 时持锁；正常 apply_async 尽量在锁外执行，或用 pool generation 校验避免并发 rebuild 竞态。
DistributedServerTest 的测试方法被缩进进 tearDown @ rtp_llm/distribute/test/distributed_server_test.py:178
- 建议：把 17 个 test_* 方法反缩进到 DistributedServerTest 类级别，只保留 patch stop 逻辑在 tearDown 内。
BatchDecode 调度在锁内重复分配分组容器 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:142
- 建议：改用固定 3 桶数组或单次遍历记录迭代器/计数，避免 map/list 节点分配并缩短 scheduler lock 持有时间。
FIFO 批次模式会被不可入批请求提前锁定 @ rtp_llm/cpp/engine_base/schedulers/FIFOScheduler.cc:257
- 建议：先完成错误、事件和内存检查，确认 stream 会加入 new_streams 后再设置或比较 batch_return_all_probs_mode。
多模态 hash 仍阻塞复制完整特征并逐字节扫描 @ rtp_llm/cpp/multimodal_processor/MultimodalProcessor.cc:28
- 建议：保留完整内容 hash 语义，但改成 GPU row-digest 或 SIMD/xxhash 后只回传紧凑 digest，避免 expandTokenIds 阶段复制和扫描完整 embedding。
TensorPB shape 乘法缺少非法维度和溢出检查 @ rtp_llm/cpp/model_rpc/TensorPbConvert.cc:59
- 建议：在累乘前拒绝负维度，并用 max/checked_mul 校验 numel 与 expected_bytes 溢出；为 malformed shape 增加单测。
CUDA 检测到工具链但缺少 version.json 时会误报无法检测平台 @ _build/platform.py:437
- 建议：当 _detect_cuda() 为真但 version.json 不存在或解析失败时，降级到 _get_cuda_config_from_version("") 或读取 nvcc/version.txt。
全局 autouse 清理会拖慢 GPU 测试 @ conftest.py:192
- 建议：改为仅对标记需要强隔离的 GPU 测试启用，或通过环境变量/失败路径触发 aggressive cleanup。
entry.py 每次结束都无条件拷贝 ROCm 调试产物 @ rtp_llm/test/smoke/entry.py:149
- 建议：只在失败或显式 DEBUG 环境变量开启时收集，并限制文件数量/总大小。
mainse comparer 注册失败被静默吞掉 @ rtp_llm/test/smoke/comparer_registry.py:106
- 建议：仅捕获 ModuleNotFoundError/ImportError 表示 OSS 环境缺失；其他异常至少 logging.exception，或保存原始异常并在 mainse case resolve 失败时带出。
xdist GPU 校验文件使用全局目录清扫 @ conftest.py:36
- 建议：默认 GPU_VERIFY_DIR 加入会话级唯一前缀（pid、pytest tmp_path 或 workerinput 的 testrunuid），只清理本会话文件。
远程构建 TOML 配置解析失败会静默降级 @ setup.py:484
- 建议：缺文件可返回空；文件存在但解析失败应打印具体路径和异常，RTP_REMOTE=1 或关键配置路径下建议直接 fail-fast。
kimi_kda Bazel 目标缺少 autotune_cache 依赖 @ rtp_llm/models_py/triton_kernels/BUILD:20
- 建议：给 :kimi_kda 增加 //rtp_llm/models_py/triton_kernels/autotune_cache 依赖，确保 Bazel runfiles/依赖图完整。
mainse_arpc 缺 comparer 时会静默 fallback @ rtp_llm/test/smoke/comparer_registry.py:56
- 建议：把 q_r.get('mainse_arpc', False) 纳入 is_mainse 判定，并调整错误信息；增加缺内部 comparer 的 mainse_arpc 单测。
pytest smoke case 的 timeout 字段未生效 @ rtp_llm/test/smoke_framework/manifest.py:105
- 建议：在 build_smoke_params 中检测 config['timeout'] 并追加 pytest.mark.timeout(timeout)，或删除无效字段并明确统一 timeout 策略。

Checklist ✅ (56 items passed)

Strengths

smoke suite 验证脚本改为 AST 解析，避免执行 suite 顶层 pytest/torch import，预构建校验更稳。
multi-role smoke env 支持共享 env list 后，每个 role 都会得到通用环境变量。
SSRF helper 改为手动逐跳校验 redirect，修复了原先只校验初始 URL 的风险。
verify_smoke_suites 改为 AST 读取 SMOKE_CASES，避免 prepare-source 阶段导入 pytest/torch；已在 worktree 中执行通过。
multi-role smoke envs 现在会把 list 形式共享 env 注入每个 role，减少配置重复。
SSRF 下载改为手动逐跳校验 redirect，修复了默认自动跳转绕过初始 host 校验的方向性问题。
verify_smoke_suites 改用 AST 读取 SMOKE_CASES，避免执行测试文件顶层依赖。
multi-role smoke env 支持共享 env 列表，减少角色间配置遗漏。
Merkle 构建对文件内容采用流式 hash，避免在建树阶段加载完整输入。
CAS 上传先用 FindMissingBlobs 过滤已存在 blob，并把大文件走 ByteStream，减少重复传输和大 payload RPC。

- P2-1: gate large BeamSearch test matrix behind env var for CI - P2-2: cache VIT role address in EmbeddingEndpoint with TTL refresh - P2-4: throttle gc.collect() by weight count threshold in FP8 loader - P2-7: replace assert with explicit ValueError in reduce_scatter - P2-8: replace assert with explicit RuntimeError in MoriEP topology checks - P2-9: clean up VIT proxy workers on startup exception - P2-10: use 'is not None' instead of truthiness for Tensor default - P2-11: align scatter_qkv test exception type with implementation - P2-12: fix profiler active count leak on creation failure - P2-16: use VitConfig.mm_timeout_ms for VIT proxy default timeout - P2-17: short-circuit GPU sync in CP prefill attention routing - P2-18: downgrade MoriEP router hot-path log from info to debug - P2-20: remove duplicate trans_input serialization in enqueue - P2-21: add per-dim boundary check and checked multiply in TensorPbConvert - P2-22: convert remote multimodal bad response to ErrorInfo - P2-23: precompute host-side max_seqlen for Qwen2VL visual attention - P2-24: optimize BatchDecodeScheduler stream removal with set + remove_if - P2-25: reuse output buffer in PureCpRouter reduce_scatter hot path

LLLLKKKK · 2026-06-22T07:48:52Z

AI Code Review - PR #1121

Status: BLOCKING

Summary: P0/0 · P1/11 · P2/36 · P3/1

Blocking Issues

P1

HTTPS 下载会因改写为 IP 而握手失败 @ rtp_llm/utils/ssrf_check.py:138
- 建议：不要只把 URL host 改成 IP；为当前 requests/urllib3 路径实现可用的 SNI/hostname 校验，或固定支持版本并补真实 HTTPS 下载回归测试。
LRO error 被当作普通测试失败，绕过 executor failover @ rtp_llm/test/remote_tests/executor.py:738
- 建议：按 op.error.code/message 分类，REAPI/worker 基础设施错误返回 exit_code=-1 并设置可 failover 的 infra_category。
CUDA 原始 all_probs 模式下 cum_log_probs 使用了原始分布 @ rtp_llm/models_py/bindings/core/CudaSampleOp.cc:212
- 建议：为 cum_log_probs 单独保留过滤归一化后的采样分布；return_original_all_probs 只影响返回的 all_probs，并补充组合测试。
ROCm 原始 all_probs 会被过滤分布污染 @ rtp_llm/models_py/bindings/core/CudaSampleOp.cc:473
- 建议：过滤前用 probs_t.clone() 作为 filtered_probs，或先保存原始 probs 专供 ORIGINAL all_probs 输出。
VIT ROLE 分离后端会误要求本地 mm_process_engine @ rtp_llm/cpp/model_rpc/LocalRpcServer.cc:47
- 建议：将 VIT_SEPARATION_ROLE 的非 VIT 后端也走 RemoteMultimodalProcessor，或按 role_type 区分 VIT 进程本地处理与 PREFILL/DECODE 远端调用。
flashinfer 禁用别名在 auto 模式下失效 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:212
- 建议：解析 blocklist 时把 flashinfer 展开为 py_flashinfer，或在 auto 分支同时检查别名集合。
Headwise 分头后丢失 GQA 的原始 Q/KV 头映射 @ rtp_llm/models_py/modules/factory/attention/cuda_headwise_impl/headwise.py:357
- 建议：按 KV group 整组切分并保持原始 Q head 到 KV head 的映射；若 mask 不是整组对齐，应 fallback 到全量 attention 或拒绝该 headwise_config，FP8 路径同步修复。
源码/Bazel 环境下 ops 导入会丢失 .so 查找路径 @ rtp_llm/ops/__init__.py:67
- 建议：恢复对 bazel-bin/runfiles 中 libth_transformer_config.so 的查找，或在 _so_available 为 false 时提供真正的 collection-only stub，避免继续硬导入。
旧 FMHA 开关未完整映射到新后端选择 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:106
- 建议：在 auto 模式先处理 enable_fmha=false；将 enable_open_source_fmha/旧 paged-open-source/trtv1 语义映射到对应 NAME/blocklist，或对已不支持的旧配置 fail-fast。
clean wheel 可能不包含 remote_tests 生成 proto @ setup.py:1198
- 建议：增加自定义 build_py 或 bdist_wheel，在 build_py 收集包之前调用 ensure_proto_files_generated；同时用 clean tree 构建 wheel 后校验 remote_execution_pb2 可 import。
scatter_qkv 优化路径可能拒绝真实非连续输入 @ rtp_llm/models_py/model_desc/qwen3_next.py:256
- 建议：在调用 scatter_qkv 前显式保证 contiguous，或仅当 mixed_qkv.is_contiguous() 时走 Triton 路径；补充从 _conv1d 返回值进入 _fla 的集成测试。

Non-blocking Suggestions

P2

重定向响应未关闭会滞留流式连接 @ rtp_llm/utils/ssrf_check.py:172
- 建议：在跳转前显式 response.close()；异常和最大跳转次数耗尽路径也应关闭当前响应/Session。
标量 FP32 TensorPB 会被误解为空 tensor @ rtp_llm/utils/grpc_util.py:32
- 建议：区分“无 payload 的默认 TensorPB”和“shape=[] 的标量”；可检查数据 buffer 是否为空，标量应按 dtype reshape([]) 还原。
小文件 CAS 上传会先物化全部 batch payload @ rtp_llm/test/remote_tests/cas_client.py:190
- 建议：按 batch 构造后立即提交，或用有界队列限制在途 batch 数，避免峰值内存随缺失小文件总量增长。
session cache 命中校验逐条发 CAS 请求 @ rtp_llm/test/remote_tests/plugin.py:1859
- 建议：先收集并去重所有候选 entry 的 digest，批量 FindMissingBlobs 一次或少量分片，再用 missing set 判定每个 entry 是否可 replay。
远端输出下载重复占用完整归档内存 @ rtp_llm/test/remote_tests/output_collector.py:141
- 建议：给 CASClient 增加流式下载到文件接口，ByteStream.Read 直接写临时 tar，再解压，避免 bytes join 和二次拷贝。
ROCm cum_log_probs 只取一个 token 却物化整行 log @ rtp_llm/models_py/bindings/core/CudaSampleOp.cc:535
- 建议：改为先 gather selected_probs，再对 [batch] selected_probs.log()，与 CUDA flashinfer 分支一致。
[符号已在executor_failover_test.py定义-自动降级] CAS 上传 RPC 缺少 deadline，故障时会卡死收集阶段 @ rtp_llm/test/remote_tests/cas_client.py:344
- 建议：给 FindMissingBlobs、BatchUpdateBlobs、ByteStream.Write 统一设置 deadline，并让上传 RPC 走重试包装；Write 同时校验 committed_size。
CAS 下载失败与真实空 blob 无法区分 @ rtp_llm/test/remote_tests/cas_client.py:317
- 建议：对非空 digest 的下载失败抛异常或返回带状态的结果，让调用方决定重试、失败或保留已有日志。
远端 outputs 解包允许落地不安全链接 @ rtp_llm/test/remote_tests/output_collector.py:32
- 建议：跳过 symlink/hardlink，或对 linkname 解析目标也做 dest 内约束；优先采用 data-only extraction 语义。
低 GPU 数 session 会把高卡用例落到 1 卡阶段 @ rtp_llm/test/remote_tests/plugin.py:1909
- 建议：1GPU phase 排除所有 N>1 的 gpu_count_N；对超过 total_gpus 的 tier 显式 skip 或 fail-fast。
大文件 ByteStream 上传未校验提交大小 @ rtp_llm/test/remote_tests/cas_client.py:406
- 建议：保存 WriteResponse 并校验 committed_size，不一致时抛错并让远端执行重试或失败。
ROCm TopK occupancy 缓存未区分设备 @ 3rdparty/trt_beam_search/efficient_topk/warp_topk.hpp:600
- 建议：将 device id、arch 或 SM 数纳入缓存 key；如果该函数在请求热路径高并发调用，也评估降低全局 mutex 命中路径开销。
CUDA graph replay 测试未直接拦截 replan @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/test_py_flashinfer_mha_decode.py:363
- 建议：对 decode_wrapper.plan 做 monkeypatch 计数或直接 fail，并断言 captured buffer 的 data_ptr 在 replay 后不变。
超时可能遗留 GPU worker 进程 @ rtp_llm/models_py/modules/base/cuda/test/torch_symm_mem_test.py:295
- 建议：用独立 session/process group 启动 wrapper，并在 TimeoutExpired 时 kill 整个进程组，或显式管理并 terminate 所有 worker。
分布式初始化失败会被当作测试成功 @ rtp_llm/models_py/modules/base/cuda/test/torch_symm_mem_test.py:51
- 建议：初始化失败应直接抛出异常；只有入口处确认环境不满足时才显式 skip。
ROCm 基础测试会因 aiter 缺失整目录不收集 @ rtp_llm/models_py/modules/base/rocm/test/conftest.py:9
- 建议：只对真正依赖 aiter 的 FMHA 用例局部 skip，保留 embedding/norm/layernorm 等基础 ROCm 覆盖。
FlashInfer 缺失会跳过非 FlashInfer TRT 测试 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/conftest.py:3
- 建议：把 flashinfer gate 下沉到 flashinfer 相关测试目录；TRT/XQA 按各自依赖单独 gate。
CP attention 测试缺少 FlashInfer 可用性 gate @ rtp_llm/models_py/modules/factory/attention/cuda_cp_impl/test/conftest.py:5
- 建议：在 conftest 中增加 flashinfer import gate，缺失时显式 skip，避免收集阶段直接报错。
手工测试残留本地绝对路径污染导入 @ rtp_llm/models_py/modules/factory/attention/cuda_mla_impl/test/sparse_mla_decode_op_test.py:28
- 建议：删除硬编码 sys.path；本地调试通过 PYTHONPATH 或 pytest 配置显式传入。
文档仍引用已删除的 docs requirements @ docs/README.md:10
- 建议：改为安装新的 docs extra 或恢复并维护 docs/requirements.txt。
ROCm 测试在 aiter 异常时会静默不收集 @ rtp_llm/models_py/modules/base/rocm/test/conftest.py:6
- 建议：非 ROCm/无依赖环境可显式 skip；在 ROCm CI 上 aiter RuntimeError 应 fail 或至少报告跳过原因，避免整目录 0 case 通过。
DistributedServerTest 的测试方法被缩进进 tearDown @ rtp_llm/distribute/test/distributed_server_test.py:178
- 建议：把 test_get_master_、test_split_ip_port_ 等方法反缩进到 DistributedServerTest 类级别，tearDown 内只保留 patch 清理逻辑。
CUDA 存在但缺少 version.json 时平台检测会误失败 @ _build/platform.py:437
- 建议：当 _detect_cuda() 为真但 version.json 缺失或解析失败时，降级到 _get_cuda_config_from_version("")，或从 nvcc/version.txt 获取版本。
BatchDecode 调度在锁内重复分配分组容器 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:143
- 建议：改为固定 3 桶计数/迭代器选择，确认入批后再移动，避免 map/list 节点分配并缩短 scheduler lock 持有时间。
FIFO 批次模式会被不可入批请求提前锁定 @ rtp_llm/cpp/engine_base/schedulers/FIFOScheduler.cc:257
- 建议：先完成错误、事件和内存检查，确认 stream 会加入 new_streams 后再设置或比较 batch_return_all_probs_mode。
多模态特征 hash 会同步复制并逐字节扫描完整 embedding @ rtp_llm/cpp/multimodal_processor/MultimodalProcessor.cc:28
- 建议：保留完整内容 hash 语义，但改为 GPU row digest 或 SIMD/xxhash 后只回传紧凑 digest，避免每次 expandTokenIds 复制完整 embedding。
零尺寸 TensorPB 会吞掉非空 payload @ rtp_llm/cpp/model_rpc/TensorPbConvert.cc:78
- 建议：在 numel==0 返回前也校验 dataBytes(tensor_pb)==0，或把 expected/actual bytes 校验统一前置，避免协议异常被静默忽略。
大 beam 搜索默认覆盖被削弱 @ rtp_llm/cpp/testing/BeamSearchOpTest.hpp:139
- 建议：保留至少一个默认 CI 的大 beam/变 beam 代表用例，或新增明确设置 RTP_LLM_RUN_LARGE_BEAMSEARCH_TESTS 的 nightly/slow test target。
mainse comparer 注册异常被吞掉 @ rtp_llm/test/smoke/comparer_registry.py:106
- 建议：只吞 ImportError/ModuleNotFoundError 表示 OSS 环境缺包；其他异常应记录 traceback 或直接抛出。
远程构建配置解析失败会静默降级 @ setup.py:484
- 建议：区分文件缺失和解析失败；解析失败应打印路径和异常并在 RTP_REMOTE=1 时 fail-fast。
Qwen2-VL 视觉层仍逐层触发 GPU 到 CPU 同步 @ rtp_llm/multimodal/multimodal_mixins/qwen2_vl/modeling_qwen2_vl.py:314
- 建议：将 max_seqlen 在 Qwen2VisionTransformerPretrainedModel.forward 中计算一次后传给所有 block；非 flash attention 路径不要计算该值。
DeepGEMM 损坏安装仍会打断模块导入 @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:143
- 建议：把 ImportError 也纳入捕获范围，并保持符号为 None，让调用点走 _missing_deep_gemm 的清晰错误。
FP8 group gemm 缺少可用性检查 @ rtp_llm/models_py/kernels/cuda/fp8_kernel/fp8_kernel.py:241
- 建议：在 cutlass_moe_mm_fp8_scaled 入口检查 fp8_grouped_gemm_ptpc 是否为 None，抛出带安装/平台信息的 RuntimeError 或走明确 fallback。
triton_kernels Bazel 目标缺少运行时依赖 @ rtp_llm/models_py/triton_kernels/BUILD:17
- 建议：给 :kimi_kda 增加 autotune_cache 依赖；为 rtp_llm/models_py/utils 定义 py_library 并加入 :moe deps，或把所需 math helper 放入已有可见 target。
内部 fused_moe 平台扩展不会被注册 @ rtp_llm/models_py/modules/factory/platform_ext_loader.py:14
- 建议：在 fused_moe registry 初始化完成后调用 load_platform_extension，并在存在 register_fused_moe 时传入 device_type/registry/factory；扩展执行失败需保留 warning 或 fail-fast 策略。
Qwen2-VL 视觉层重复触发 GPU 同步 @ rtp_llm/multimodal/multimodal_mixins/qwen2_vl/modeling_qwen2_vl.py:314
- 建议：在 Qwen2VisionTransformerPretrainedModel.forward 中计算一次 max_seqlen，并作为参数传给每个 block；或保持 grid_thw/cu_seqlens 的长度元数据在 CPU 侧计算。

P3

debug 日志仍在热路径 eager 格式化 @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/routers/mori_ep_intranode_router.py:92
- 建议：改成 logging.debug("... tokens=%s, ep_rank=%s", a1.shape[0], self.ep_rank)，避免 debug 关闭时仍构造字符串。

Checklist ✅ (56 items passed)

Strengths

VIT proxy 转发超时变为可注入默认值，便于测试和避免 worker 卡死耗尽线程池。
MLA factory 在 CP 场景下先判断 parallelism，减少不必要的 cu_kv_seqlens.max().item() GPU 同步。
SSRF helper 改为手动逐跳校验 redirect，方向上修复了自动跳转绕过初始 host 校验的问题。
smoke suite 验证脚本改用 AST 读取 SMOKE_CASES，避免 prepare 阶段执行 pytest/torch 顶层 import。
VIT RPC 对空 embedding 改为明确 INTERNAL 错误，避免返回 OK 空响应后触发下游 CHECK。
VIT worker 返回空 embedding 时现在通过 gRPC INTERNAL 明确失败，避免空 OK 响应触发下游 CHECK 崩溃。
VIT proxy 转发超时支持从配置注入，减少 worker 卡死时耗尽代理线程池的风险。
smoke runner 对单测环境变量做 finally 恢复，降低同一 pytest session 内 case 间状态污染。
CAS 上传已区分大文件 ByteStream 和小文件 BatchUpdateBlobs，并对失败响应做显式报错。
RemoteExecutor 对非 OK ExecuteResponse、LRO error、stream deadline 和 failover 的处理比裸 pytest 远端执行更可观测。

P1 fixes: - HTTPS SSRF: stop rewriting URL to IP; pin connection pool host instead, preserving TLS SNI/cert verification via server_hostname/assert_hostname - VIT role separation: VIT_SEPARATION_ROLE with non-VIT role now uses RemoteMultimodalProcessor instead of requiring local mm_process_engine - flashinfer alias: expand 'flashinfer' <-> 'py_flashinfer' in blocklist for both auto and explicit backend modes - legacy FMHA: consume enable_fmha (global switch) and enable_open_source_fmha in _is_fmha_impl_disabled_legacy - kimi_kda BUILD: add autotune_cache dependency - LRO error: classify recoverable infra failures (exit_code=-1 + infra_category) instead of treating all LRO errors as normal test failures - proto generation: add build_py/sdist custom commands to generate proto files before packaging - smoke runner: use per-test data_root to compute local REL_PATH instead of import-time captured value P2 fixes: - SSRF redirect: close response before following next hop to avoid FD leak Already fixed in prior commits (verified): - CudaSampleOp cum_log_probs sampling distribution - TensorPbConvert shape boundary/overflow check - BatchDecodeScheduler stream removal optimization

LLLLKKKK · 2026-06-22T14:50:46Z

AI Code Review - PR #1121

Status: BLOCKING

Summary: P0/0 · P1/8 · P2/50 · P3/10

Blocking Issues

P1

SSRF get_connection 回退路径静默吞掉验证失败，DNS 重绑定窗口打开 @ rtp_llm/utils/ssrf_check.py:141
- 建议：当 _resolve_and_validate_host 抛出 ValueError（DNS 解析到私有 IP）时，不应该 pass 继续返回 conn。旧代码通过 URL 重写将 validated IP 硬编码到 URL 中，所以 get_connection 天然防重绑定。新代码保留原始 hostname URL，在 get_connection 中重新解析 DNS，如果 DNS 重绑定到私有 IP，except ValueError: pass 会导致连接在没有 IP pinning 的情况下继续，绕过 SSRF 防护。当前环境 requests==2.28.1（<2.32），此路径是实际执行路径。修复：将 except ValueError: pass 改为 except ValueError: raise 或 return 一个会拒绝连接的 sentinel，或在 get_connection 中也读取 request._ssrf_validated_ip（需要将 request 对象传递下来）。
Smoke runner 修改 common_def.REL_PATH 无法影响 utils.py 的 prompt 缓存重载 @ rtp_llm/test/smoke_framework/runner.py:166
- 建议：将 utils.py:9 改为 from rtp_llm.test.smoke import common_def，并将 utils.py:18 和 :47 的 REL_PATH 改为 common_def.REL_PATH，确保读取的是模块属性而非 import-time 绑定的值副本。或者在 runner.py 中同步修改 _smoke_utils.REL_PATH = local_rel_path。
别名展开后 blocklist 验证在 ROCm 上会崩溃 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:209
- 建议：在 known_names 构建时，将别名展开产生的名称也加入白名单，例如: known_names = registered_names | {"auto", "none", "flashinfer", "py_flashinfer"}。或者在验证循环中跳过由别名展开自动添加的名称。
UpdateWeights 异常处理：无 GIL 调用 PyErr_Fetch + 忽略 pybind11 已捕获的异常信息 @ rtp_llm/cpp/model_rpc/LocalRpcServer.cc:637
- 建议：py::error_already_set 构造时已经 fetch 了 Python 错误，此时 PyErr_Fetch 必定返回 NULL（错误已被清除），且调用时 GIL 已释放（gil_scoped_acquire 作用域已结束），属于 UB。应直接使用 e.what() 获取错误信息：std::string err_msg = e.what();
start_server 启动异常被吞，进程以 exit code 0 退出 @ rtp_llm/start_server.py:406-408
- 建议：在 graceful_shutdown() 后添加 raise 或 sys.exit(1)，确保启动失败时进程以非零退出码退出。当前行为会导致 K8s/容器编排认为服务启动成功但实际无服务。PR 重写了 start_server() 但保留了这个预存 bug，应在本次修改中修复。
MoeConfig pickle 缺少 fp4_moe_op 字段，跨进程静默丢失 @ rtp_llm/cpp/pybind/ConfigInit.cc:649
- 建议：在 MoeConfig pickle getstate 中追加 self.fp4_moe_op，setstate 中恢复 c.fp4_moe_op = t[12].caststd::string()，tuple size 检查改为 13。本 PR 已修改了 MoeConfig pickle（添加 use_mori_ep），应一并补全 fp4_moe_op。
BatchDecodeScheduler 忙等期间持有 mutex，可能导致调度器死锁 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:199
- 建议：增加超时机制（例如最大等待 30 秒），超时后将 stream 标记为错误并移除。如果远程 cache 加载挂起，当前代码会导致 scheduler 死锁，所有后续请求永远排队。PR 修改了 evaluateWaitingStreams 的调度逻辑但保留了此忙等。
Operation proto 缺少 error 字段导致 LRO 失败路径 AttributeError 崩溃 @ rtp_llm/test/remote_tests/executor.py:738
- 建议：在 remote_execution.proto 的 Operation 消息中添加 google.rpc.Status error = 4; 字段（需先 import google/rpc/status.proto），或改用 oneof result { google.rpc.Status error = 4; google.protobuf.Any response = 5; } 以与标准 LRO 规范一致

Non-blocking Suggestions

P2

_session_pids 遍历所有 /proc 条目并逐一读取 /proc/{pid}/environ @ rtp_llm/test/utils/device_resource.py:79
- 建议：在繁忙机器上（数千进程）此操作可能需要秒级时间。由于只在子进程退出后的清理路径调用，影响有限。但可以先过滤 /proc/{pid}/cmdline 是否包含 python/rtp 相关关键字来减少 environ 读取次数。
enable_fmha 全局开关在显式 backend 模式下不生效 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:112
- 建议：如果 enable_fmha 是全局开关（名字暗示如此），在显式 backend 模式下也应检查。可在 get_fmha_impl 开头加 if fmha_config and not fmha_config.enable_fmha: raise Exception('FMHA disabled')，或在文档中注明此开关仅在 auto 模式生效。如果是故意设计（显式选择覆盖全局开关），建议加注释说明。
Smoke runner 临时修改模块级全局变量 common_def.REL_PATH 非线程安全 @ rtp_llm/test/smoke_framework/runner.py:165
- 建议：当前 smoke test 均为单进程串行执行，风险较低。长期建议将 REL_PATH 作为参数传递给 resolve_prompt_refs 和 _load_prompt_candidates，避免全局状态修改。
_resolve_and_validate_host 的 except ValueError 误捕获私有 IP 的主动拒绝 @ rtp_llm/utils/ssrf_check.py:63
- 建议：将 ipaddress.ip_address() 的 ValueError（解析失败）和 _is_private_ip 的 ValueError（主动拒绝）分开处理。可以先单独 try ipaddress.ip_address()，成功后在 try 外部做 _is_private_ip 检查并 raise。
op.error.code 同时用作 exit_code 和 status_code 语义不一致 @ rtp_llm/test/remote_tests/executor.py:747
- 建议：当前因 executor_worker_io 分支不可达所以无实际影响，但建议注释说明 op.error.code 的语义是 google.rpc.Code，与进程 exit_code 不同。若未来 classifier 新增基于 status_code 值的分支可能产生误判
smoke runner 中 _PROMPT_CACHE 在 finally 块外部修改，缺少异常安全恢复 @ rtp_llm/test/smoke_framework/runner.py:169
- 建议：将 _smoke_utils._PROMPT_CACHE = None 的恢复也放入 finally 块中，确保无论成功还是异常后续调用都能重新加载正确的 prompt candidates。
[符号已在bytestream.proto定义-自动降级] _parse 访问 op.error 但 Operation proto 未定义 error 字段，导致 AttributeError 崩溃 @ rtp_llm/test/remote_tests/executor.py:738
- 建议：在 remote_execution.proto 的 Operation message 中添加 google.rpc.Status error = 4（或内联的 Status error = 4），然后重新生成 pb2 代码。或者改用 hasattr(op, 'error') 做防御性检查。当前代码会导致所有 _parse 调用崩溃，每次远程测试执行都会失败并返回无用的 AttributeError 信息。
_bytestream_write_file_parallel 未校验 committed_size，大文件静默部分上传 @ rtp_llm/test/remote_tests/cas_client.py:406
- 建议：与 _bytestream_write 保持一致，在 _bytestream_write_file_parallel 中也检查 resp.committed_size != digest.size_bytes，否则大文件并行上传可能静默部分写入而不报错。
download_blob 对所有 gRPC 错误静默返回空字节，调用方无法区分'空 blob'与'下载失败' @ rtp_llm/test/remote_tests/cas_client.py:317
- 建议：对于 download_blob 中非 NOT_FOUND 的 gRPC 错误，考虑抛出异常或返回 None 让调用方可以区分'blob 为空'和'下载失败'；或至少添加 warning 级别日志记录错误类型（当前只有 _bytestream_read 有日志）。
plugin.py 直接调用 CASClient 私有方法 _find_missing @ rtp_llm/test/remote_tests/plugin.py:1069
- 建议：在 CASClient 上暴露公开方法 find_missing()（无下划线前缀），或将该逻辑移入 CASClient 内部。当前 _find_missing 是内部实现细节，直接调用增加了耦合。
DeepGEMM 导入异常未捕获 ImportError @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:146
- 建议：将 ImportError 加入捕获列表：except (ImportError, AssertionError, RuntimeError, OSError)。deep_gemm 的 native extension 加载失败时可抛 ImportError。
fp8_grouped_gemm_ptpc 为 None 时无保护直接调用 @ rtp_llm/models_py/kernels/cuda/fp8_kernel/fp8_kernel.py:241
- 建议：在 cutlass_moe_mm_fp8_scaled 入口处加 if fp8_grouped_gemm_ptpc is None: raise RuntimeError('rtp_kernel.fp8_group_gemm not available') 或提供 fallback。
Qwen2VL VisionBlock 每层重复计算 max_seqlen 触发 GPU→CPU sync @ rtp_llm/multimodal/multimodal_mixins/qwen2_vl/modeling_qwen2_vl.py:314
- 建议：将 max_seqlen 计算移到 Qwen2VisionTransformerPretrainedModel.forward（line 435 循环之前），一次计算后通过参数传给每个 block。当前 depth=32 意味着 32 次多余的 .item() GPU→CPU 同步。
ops/init.py libpython 加载：LIBDIR 为 None 时 TypeError 不被 except OSError 捕获 @ rtp_llm/ops/__init__.py:94
- 建议：改为 except (OSError, TypeError): pass，或先检查 lib_dir = sysconfig.get_config_var('LIBDIR'); if lib_dir: ...。此问题在旧代码中同样存在（旧代码无 try/except），PR 改善了但未完全修复。
flashinfer 别名扩展未覆盖 py_flashinfer_paged @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:104
- 建议：考虑将 flashinfer 别名扩展为同时包含 py_flashinfer 和 py_flashinfer_paged，或在文档/帮助中说明 flashinfer 别名不覆盖 paged 变体，需显式指定 --disable_attn_backends=flashinfer,py_flashinfer_paged。
缺少对新增 enable_fmha / enable_open_source_fmha 逻辑的单元测试 @ rtp_llm/models_py/modules/factory/attention/test/test_impl_name_registry.py:1
- 建议：为 _is_fmha_impl_disabled_legacy 和 _get_blocked_backends 添加单元测试，覆盖：1) enable_fmha=False 禁用所有 impl；2) enable_open_source_fmha=False 仅禁用 TRT impls；3) flashinfer 别名扩展逻辑。
占位锁争用：occupancy_cache 使用 static mutex @ 3rdparty/trt_beam_search/efficient_topk/warp_topk.hpp:605
- 建议：occupancy_cache 命中后仅需只读访问，可用 shared_mutex 的 shared_lock 做读路径，write_lock 做写路径，减少多线程 topk 调用时的锁争用。或者用 double-check locking 模式避免写后重复加锁。
CUPTI benchmark 每轮 synchronize 引入测量误差 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/bench_utils.py:861
- 建议：每次迭代都做 synchronize 会序列化 GPU pipeline，拉高测量时间。考虑只在全部迭代结束后做一次 synchronize，然后通过 CUPTI 的 correlation_id 匹配正确的 kernel 时间（已实现的 activity 后处理逻辑正是为此设计的）。或在文档中说明这是有意为之以确保 CUPTI buffer flush 完整。
bench_gpu_time_with_cupti 内 L2 flush buffer 未复用 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/bench_utils.py:791
- 建议：三个 bench 变体 (cuda_event / cupti / cudagraph) 各自独立分配 256MB buffer。如果被连续调用（如 bench_gpu_time 切换模式），会重复分配/释放大块 GPU 显存。可将 buffer 提到 bench_gpu_time 层级统一管理或使用模块级缓存。
flashinfer 别名双向展开可能导致 blocklist 校验误报 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:199
- 建议：当前逻辑正确因为 flashinfer 始终在 known_names 中（line 202 硬编码）。但若 user 在 disable_attn_backends 中写了 py_flashinfer，而该平台没注册 py_flashinfer，校验不会报错（因为展开后 flashinfer 在 known_names 中），可能给用户造成 blocklist 生效的错觉。建议在展开别名前也对原始名做校验，或在文档中注明别名行为。
AiterPrefillAttnOpPaged.forward 每层重复调用 _sanitize_and_pad_block_table @ rtp_llm/models_py/modules/factory/attention/rocm_impl/aiter.py:831
- 建议：将 _sanitize_and_pad_block_table 调用提升到 prepare() 中（参考 AiterDecodeAttnOp 的 _sanitize_block_table 在 prepare 阶段调用的模式），缓存结果在 self.fmha_params 上，forward() 直接使用缓存值。每层可节省 6+ 个 GPU tensor 分配。
MoriEpIntranodeRouter.finalize 热路径使用 logging.info + f-string @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/routers/mori_ep_intranode_router.py:225
- 建议：改为 logging.debug() 以匹配 prepare() 方法 (line 92) 的日志级别，或使用延迟格式化: logging.info("[MoriEpIntranodeRouter] finalize called, ep_rank=%s, chunked=%s", self.ep_rank, self._is_chunked)。
remap_to_local_ids 无条件 .to(torch.float32) 分配 @ rtp_llm/models_py/triton_kernels/moe/remap_local_ids_kernel.py:79
- 建议：添加 dtype 检查: weights_f32 = dispatch_weights if dispatch_weights.dtype == torch.float32 else dispatch_weights.to(torch.float32); 然后传入 weights_f32.view(-1)。
MoriEpIntranodeRouter._remap_to_local_ids 热路径 lazy import @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/routers/mori_ep_intranode_router.py:76
- 建议：将 import 移到模块顶层，或在 init 中缓存函数引用: self._remap_fn = remap_to_local_ids。
rocm_impl/aiter.py _gather_and_reshape_kv_compact 每层 index_select 创建中间 tensor @ rtp_llm/models_py/modules/factory/attention/rocm_impl/aiter.py:392
- 建议：在 _prepare_block_table_indices 中预分配 k_used/v_used 缓冲区，使用 torch.index_select(out=pre_allocated_buf) 消除每层分配。或考虑使用 custom CUDA kernel 直接 scatter 到 compact 布局，跳过中间 tensor。
PdSeperationCaseRunner decode 服务启动失败时未停止 prefill 服务 @ rtp_llm/test/smoke/multi_inst_case_runner.py:134
- 建议：在 decode 失败的 early return 前添加: if prefill_server_manager: prefill_server_manager.stop_server()。同样的问题也存在于 DpSeperationCaseRunner (line 295-299) 和 VitSeperationCaseRunner (line 497-502，LLM 失败时未停止 VIT)。此问题为 pre-existing，但本 PR 重新格式化了相关代码。
FlyDSL kernel wrapper .contiguous() 冗余调用 @ rtp_llm/models_py/triton_kernels/fla/chunk.py:305
- 建议：在函数入口添加一次统一检查: tensors = [q, k, v, g, beta]; if not all(t.is_contiguous() for t in tensors): ... 。对于 .to().contiguous() 模式，直接去掉冗余的 .contiguous()。
FlyDSL kernel 每次 launch 创建新的 dummy empty tensor @ rtp_llm/models_py/triton_kernels/fla/flydsl_chunk_gdn_mi308x.py:1127
- 建议：在模块级别缓存 dummy tensor (按 device 索引的 dict)，避免重复分配。torch.empty(0) 虽小但仍经过 CUDA allocator。
OpenaiComparer 谓词直接访问 q_r["query"] 可能抛 KeyError @ rtp_llm/test/smoke/case_runner.py:94
- 建议：改为 q_r.get("query", {}) 以防止 q_r 中没有 "query" 键时抛出 KeyError。当前所有 smoke JSON 都有 "query" 键所以不会实际触发，但与同文件 _resolve_endpoint 中 .get() 风格一致更安全。
comparer 注册顺序变化：tau2_bench 从首位移到末位 @ rtp_llm/test/smoke/case_runner.py:94-105
- 建议：当前 tau2_bench 的 query 结构中不含 messages 字段，不会命中 OpenaiComparer 的谓词，暂不影响正确性。但建议文档注释说明顺序依赖，或将 tau2_bench 注册提前以保持与旧代码一致的优先级语义。
FusedQKRMSNorm.forward 使用 flashinfer 但 flashinfer 可能为 None @ rtp_llm/models_py/modules/base/cuda/norm.py:11-14,123
- 建议：在 FusedQKRMSNorm.init 中增加 flashinfer is not None 检查，若为 None 则 raise RuntimeError 提供清晰的错误信息。实际此路径仅在 CUDA 环境下执行不会触发，属于防御性编程。
GrpcConfig::from_json 的 std::stoi 可能因超大数值抛出 out_of_range 异常 @ rtp_llm/cpp/config/ConfigModules.cc:446
- 建议：虽然 gRPC 配置值通常不会溢出 int，但建议用 try/catch 包裹 stoi 或改用带范围检查的解析，以避免恶意/错误输入导致 server 崩溃
transMMInputsPB 参数按值传递导致不必要的 vector 拷贝 @ rtp_llm/cpp/model_rpc/QueryConverter.cc:203
- 建议：参数应改为 const std::vector<MultimodalInput>& mm_inputs，避免拷贝整个 MultimodalInput 向量（含 tensor 和 string 成员）
evaluateRunningMemory 中 size_t 转 int 可能溢出 @ rtp_llm/cpp/engine_base/schedulers/FIFOScheduler.cc:145
- 建议：max_batch_tokens_size_ 是 size_t，强转 int 可能溢出为负数导致比较永远为 true。建议统一用 size_t 或 int64_t 进行比较：< max_batch_tokens_size_
FIFOScheduler 的 ReturnAllProbsMode 可被不可入批的 stream 提前锁定 @ rtp_llm/cpp/engine_base/schedulers/FIFOScheduler.cc:257
- 建议：将 batch_return_all_probs_mode 的设置移到 line 271 的 new_streams.push_back(stream) 之后（即确认 stream 通过 error/event/memory 检查并真正入批后再设置 mode），防止失败的 stream 锁定批模式导致后续兼容 stream 被无谓跳过。
BatchDecodeScheduler 在锁内构造 map + list 临时容器，增加不必要的堆分配 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:143
- 建议：ReturnAllProbsMode 仅有 3 种枚举值（NONE/DEFAULT/ORIGINAL），改为固定 3 桶 array<list, 3> 或直接在 waiting_streams_ 上做分桶迭代（记录 begin iterator + count），避免 map 树节点和 list 节点的堆分配以及 shared_ptr 引用计数的原子操作。
getFeatureHash 对完整多模态 embedding 做同步 D2H 拷贝 + 逐字节 FNV hash @ rtp_llm/cpp/multimodal_processor/MultimodalProcessor.cc:28
- 建议：考虑用 GPU 上的 row-wise hash 或 xxHash/CRC32C 等 SIMD 友好的 hash 算法只回传紧凑 digest（每行 8B 而非整行数据），避免每次 expandTokenIds 时对完整 embedding 做 D2H 同步。如果维持 CPU hash，至少按 uint64_t 块处理而非逐字节。对于大视频 embedding（num_tokens 数千、hidden_dim 数千），当前实现的 D2H 同步 + CPU 扫描可达数十毫秒。
transMMInputsPB 按值传递 vector 引发整批拷贝 @ rtp_llm/cpp/model_rpc/QueryConverter.h:26
- 建议：将参数改为 const std::vector<MultimodalInput>& mm_inputs（加 const 引用），与同文件 transMMPreprocessConfig 的修正方式一致，避免不必要的 vector + string + Tensor 拷贝。
StreamGroups::updateStreams 缺少 vector 越界检查 @ rtp_llm/cpp/engine_base/stream/StreamGroups.h:238
- 建议：函数开头添加 RTP_LLM_CHECK_WITH_INFO(spec_update_infos.size() == decode_streams_.size() + context_streams_.size(), "spec_update_infos size mismatch") 断言。
KVCacheConfig pickle 缺少 load_cache_retry_times 字段 @ rtp_llm/cpp/pybind/ConfigInit.cc:360
- 建议：在 KVCacheConfig pickle 中追加 load_cache_retry_times 序列化/反序列化，更新 tuple size 检查。跨进程后 retry_times 会静默回退默认值 1。
ParallelismConfig pickle 缺少 local_rank 字段 @ rtp_llm/cpp/pybind/ConfigInit.cc:1013
- 建议：在 ParallelismConfig pickle 中追加 local_rank。虽然 start_backend_server 可能重新设置 local_rank，但其他路径使用 unpickled config 时会得到默认值 0。
needReturnAllProbs 混合模式警告使用 call_once，后续违规静默降级 @ rtp_llm/cpp/engine_base/stream/StreamGroups.h:167
- 建议：使用带频率限制的日志（如每 60 秒一次）替代 call_once，确保持续违规能被运维发现。当前代码仅在进程生命周期内首次出现时告警，之后完全静默。
RemoteMultimodalProcessor 未使用的 vit_cluster_name_ 成员变量 @ rtp_llm/cpp/multimodal_processor/RemoteMultimodalProcessor.h:63
- 建议：移除未使用的 vit_cluster_name_ 成员。死代码可能误导维护者认为存在相关功能。
smoke runner REL_PATH from-import 绑定导致 per-test data_root 切换无效 @ rtp_llm/test/smoke/utils.py:9
- 建议：将 utils.py 第 9 行改为 from rtp_llm.test.smoke import common_def，所有引用改为 common_def.REL_PATH，使动态修改生效
flashinfer 别名扩展未覆盖 py_flashinfer_paged 导致 disable 不完整 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:194
- 建议：在 alias expansion 中添加 blocked.add('py_flashinfer_paged')，或改用前缀匹配：any(b.startswith(name) for b in blocked)
ROCm use_triton_pa=True 不再抑制 AiterDecodeImplAsm 选择 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:222
- 建议：恢复 AiterDecodeImplAsm 的 use_triton_pa 优先级检查：elif 'AiterDecodeImplAsm' in name: return fmha_config.use_triton_pa or not fmha_config.use_asm_pa
enable_fmha=False 在显式 backend 模式下被静默忽略 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:359
- 建议：在 explicit backend 路径中也调用 _is_fmha_impl_disabled_legacy 进行过滤，或在入口处统一检查 enable_fmha 并 early-return None
Gemma 模型 env var 控制与 fmha_config 对象不同步导致 Python 执行路径行为不一致 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:118
- 建议：在 Gemma init 中同步修改 fmha_config.enable_open_source_fmha = False，或在 _is_fmha_impl_disabled_legacy 中同时检查 env var ENABLE_OPENSOURCE_FMHA
Qwen2VL VisionBlock 每层重复计算 max_seqlen 触发 GPU→CPU sync @ rtp_llm/multimodal/multimodal_mixins/qwen2_vl/modeling_qwen2_vl.py:314
- 建议：将 max_seqlen 计算提升到 Qwen2VLVisionTransformer.forward 中 for blk in self.blocks 循环之前，作为参数传入每个 block
fp8_grouped_gemm_ptpc 为 None 时无保护直接调用 @ rtp_llm/models_py/kernels/cuda/fp8_kernel/fp8_kernel.py:241
- 建议：在 cutlass_moe_mm_fp8_scaled 入口添加 assert fp8_grouped_gemm_ptpc is not None, 'rtp_kernel.fp8_group_gemm required'，或在调用方判断可用性

P3

smoke runner 中 json5 import 在每次 test case 中重复尝试 @ rtp_llm/test/smoke_framework/runner.py:157
- 建议：将 json5 可用性检测提前到模块级（如 _HAS_JSON5 = importlib.util.find_spec('json5') is not None），避免在每个 test case 中重复 try/except import。
MagaServerManager 每次 start_server 都重新计算 torch lib 路径 @ rtp_llm/test/utils/maga_server_manager.py:155
- 建议：torch 路径在进程生命周期内不变，可缓存为类变量或模块级常量，减少 os.path 计算。不过此为测试代码，影响微乎其微。
_FakeExecutor 使用类级别可变列表作为测试状态，测试间可能互相影响 @ rtp_llm/test/remote_tests/executor_failover_test.py:52
- 建议：将 _FakeExecutor 的可变状态改为实例属性，或使用 pytest fixture 自动在每个测试前重置。当前 _reset_fake_executor 需要手动调用，遗漏时会导致测试间数据泄漏。
ops/init.py 重复 import sys @ rtp_llm/ops/__init__.py:90
- 建议：删除 line 90 的重复 import sys。
select_topk_op_test 每 subTest 重新创建 Op 实例 @ rtp_llm/models_py/modules/base/cuda/test/select_topk_op_test.py:165
- 建议：SelectTopkOp 构造可能涉及 CUDA kernel 编译/加载，若参数相同可缓存复用（类似 test_select_topk_fake_balance_expert 中的 op_cache 模式）。不影响正确性，仅影响测试执行时间。
per_block_cast_to_fp8 零填充整块可优化 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/bench_utils.py:69
- 建议：先 torch.zeros 再 copy 相当于两次写入 padded 区域。可用 torch.empty + 仅 pad 边界 zero_ 代替，但此函数仅在 test/bench 路径使用，不影响生产性能。
setup.py BuildPyWithProto/SdistWithProto 未处理 ensure_proto 异常信息 @ setup.py:1637
- 建议：ensure_proto_files_generated 内部已有清晰的异常消息，直接传播是合理的。可选：在 BuildPyWithProto.run() 中 catch 并包装为 distutils.errors.DistutilsSetupError 以获得更标准的 pip 错误格式。非阻塞建议。
comparer_registry.py line 106 捕获了过宽的 Exception @ rtp_llm/test/smoke/comparer_registry.py:106
- 建议：缩小为 except ImportError: pass，或在非 ImportError 时 logging.warning 以便排查问题。
GrpcConfig::from_json 使用简单正则解析 JSON，不支持嵌套或负数值 @ rtp_llm/cpp/config/ConfigModules.cc:436
- 建议：regex (\d+) 仅匹配正整数，无法处理负值或浮点数。当前仅用于 gRPC channel arg（均为正整数），但建议注释说明局限性或引入轻量 JSON 解析
ops/init.py LIBDIR 为 None 时 TypeError 逃逸 except OSError @ rtp_llm/ops/__init__.py:96
- 建议：将 except OSError 改为 except (OSError, TypeError)，或在拼接前 guard: if LIBDIR is not None

Checklist ✅ (56 items passed)

Strengths

ssrf_check.py 实现了完整的 SSRF 防护：DNS pin 到验证后的 IP、手动追踪 redirect、关闭 DNS rebinding 窗口，安全设计优秀
device_resource.py 重写为无 torch 依赖的纯 subprocess 检测，GPU lock 逻辑更健壮（zombie context 检测、轮转分配、ROCm 兼容）
smoke_framework 统一了 OSS 和 internal 的 runner 逻辑，消除了大量代码重复，env 泄露通过 finally 恢复
perf_runner.py 的 env 恢复和 sys.argv 隔离避免了跨 test case 的状态污染
VitParameters 修正为 @DataClass + field(default_factory=dict) 解决了可变默认值共享 bug
SSRF adapter 从 URL 重写改为 connection pinning，避免了 TLS SNI/证书验证问题，设计方向正确
redirect 响应新增 response.close() 防止连接泄漏
flashinfer 别名扩展双向一致（flashinfer↔py_flashinfer），blocklist 和 auto 模式均生效
LRO 错误路径增加 infra 分类，使得 GPU Xid、worker setup 网络等基础设施失败可以被正确识别并触发 failover/retry，减少误报的 CI 失败
复用已有的 _classify_execute_response_infra 方法，保持分类逻辑统一

- BatchDecode: partial scheduling + 100ms flush timeout + set reserve - ROCm/CUDA sampling: clone probs, per-request seed, gather+log cum_log_probs - VIT: round-robin addrs, proxy timeout fallback, URL singleflight - MoriEP: shmem finalize, local_world_size, finalize debug log - Quant: MXFP4 stacked MoE keys, fp4_moe_op pickle, fake_balance dtype check - MM: input length/batch output validation, extra_input count check - Renderers: preprocess_config for llama3/kimi_k25, empty TensorPB shape/dtype - Qwen2-VL/Qwen3VL: max_seqlen once, CPU locs reuse - Kernels: FlyDSL sentinel cache, remap fp32 cast, PureDP RS buffer cache - BeamSearch: default large-beam boundary test

LLLLKKKK · 2026-06-23T05:38:58Z

AI Code Review - PR #1121

Status: BLOCKING

Summary: P0/2 · P1/15 · P2/22 · P3/0

Blocking Issues

P0

删除 smoke defs.bzl + BUILD 导致内部 CI 所有 smoke test 构建失败 @ rtp_llm/test/smoke/defs.bzl:1
- 建议：合并前必须保留 defs.bzl 和 BUILD（至少作为 thin wrapper），或同步更新 internal_source 下所有 .bzl 和 BUILD 的引用。
test_store_ssm_state_to_block_map 的 bs>1 分支不执行任何测试 @ rtp_llm/models_py/triton_kernels/fla/test/test_gdn_block_prefill.py:219
- 建议：在 if bs > 1 分支内补充 input_lengths 定义和 _test_one_case() 调用。

P1

BatchDecodeScheduler 部分批次永远不会被调度 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:255
- 建议：timeout 触发后放宽条件为 !waiting_streams_.empty()，或添加独立 partial-batch 超时计数器。
SSRF 检查中 ValueError 被错误捕获导致私有 IP 校验逻辑被跳过 @ rtp_llm/utils/ssrf_check.py:55
- 建议：将 except ValueError 收窄为只捕获 ip_address() 的解析异常。先 try parse，catch 后 fall through；parse 成功时直接校验并 return/raise。
OpenaiComparer 谓词 q_r["query"] 缺少 key 时抛 KeyError @ rtp_llm/test/smoke/case_runner.py:94
- 建议：改为 q_r.get("query", {})。
VitSeperationCaseRunner 对 None 调 stop_server 且泄漏 LLM 进程 @ rtp_llm/test/smoke/multi_inst_case_runner.py:510
- 建议：判空后再 stop，同时在失败路径中也 stop llm_server_manager。
所有 multi_inst runner 的 curl_server 异常路径缺 try/finally，泄漏 GPU 进程 @ rtp_llm/test/smoke/multi_inst_case_runner.py:159
- 建议：将 curl_server 及后续 stop_server 包在 try/finally 中，与基类一致。
PdSeperation/DpSeperation 前端服务器启动后若后续失败被泄漏 @ rtp_llm/test/smoke/multi_inst_case_runner.py:138
- 建议：统一用 try/finally 确保所有已启动 server 在退出前关闭。
LoRA 更新校验用 and 而非 or，允许错误响应静默通过 @ rtp_llm/test/smoke/case_runner.py:591
- 建议：将 and 改为 or。
并发测试 str(str(results[0])) 双重嵌套导致比较逻辑异常 @ rtp_llm/test/smoke/case_runner.py:317
- 建议：改为 str(result) != str(results[0])，并在不一致时显式设置 task_states.ret = False。
setup.py 模块级副作用在 PEP 517 元数据解析时崩溃 @ setup.py:1622
- 建议：延迟到 if __name__ == '__main__' 块内，或 try/except 提供 fallback。
setup.py 重试构建时截断日志文件 @ setup.py:855
- 建议：首次用 'w'，后续 retry 用 'a' 追加，或写独立日志文件。
comparer_registry 用 bare except Exception 吞掉注册错误 @ rtp_llm/test/smoke/comparer_registry.py:106
- 建议：缩窄为 except (ImportError, ModuleNotFoundError)。
conftest.py GPU pool 解析：CUDA_VISIBLE_DEVICES="" 错误 fallback 到 HIP @ conftest.py:59
- 建议：改为 _pool = _cvd if _cvd is not None else (_hvd if _hvd is not None else "")。
concurrency_limit_test setUp 设置环境变量但 tearDown 未清理 @ rtp_llm/test/concurrency_limit_test.py:84
- 建议：用 unittest.mock.patch.dict(os.environ, ...) 替代直接赋值。
fp8_grouped_gemm_ptpc 为 None 时直接调用导致 TypeError @ rtp_llm/models_py/kernels/cuda/fp8_kernel/fp8_kernel.py:241
- 建议：入口处检查 None 时抛 RuntimeError，或 import 失败时直接 raise。
kimi_k25_renderer 访问 ImageURL 上不存在的 preprocess_config 属性 @ rtp_llm/openai/renderers/kimi_k25_renderer.py:68
- 建议：改为 part.preprocess_config（两处：line 68 条件和 line 70 参数）。

Non-blocking Suggestions

P2

BatchDecodeScheduler wait_for 超时从 30s 降到 100ms 导致 CPU 空转 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:241
- 建议：队列为空时用较长超时(1-5s)，有 waiting_streams_ 时用 100ms flush 超时。
_get_moe_quant_ckpt_infos 对混合 stacked/per-expert 权重未防御 @ rtp_llm/model_loader/mixed_fp4_quant_weight.py:256
- 建议：添加 assert 确认所有权重名要么全匹配要么全不匹配 W_SUFFIX。
sampleGreedy 非 greedy 路径无条件 clone probs_t，热路径增加 GPU 开销 @ rtp_llm/models_py/bindings/core/CudaSampleOp.cc:480
- 建议：条件化 clone: filtered_probs = params.return_original_all_probs ? probs_t.clone() : probs_t;
grpc_util trans_tensor 反序列化丢失空 tensor 的多维 shape @ rtp_llm/utils/grpc_util.py:32
- 建议：空 tensor 分支改用 torch.empty(list(t.shape), dtype=...) 重建原始 shape。
grpc_util trans_from_tensor 空 tensor 不支持 int64/uint8 等 dtype @ rtp_llm/utils/grpc_util.py:66
- 建议：增加 INT64 等常用 dtype，或对未知 dtype 降级返回空 TensorPB()。
MultimodalEmbeddingInjector.forward 类型注解未更新 @ rtp_llm/models_py/modules/base/common/multimodal_embedding.py:64
- 建议：改为 Union[torch.Tensor, Sequence[int]]。
comparer_registry Tau2BenchComparer 优先级从最高变为最低 @ rtp_llm/test/smoke/case_runner.py:105
- 建议：将 Tau2BenchComparer 注册移到 OpenaiComparer 之前。
utils.py 导入时捕获 REL_PATH 快照，set_default_data_root 后仍用旧值 @ rtp_llm/test/smoke/utils.py:9
- 建议：改为 from rtp_llm.test.smoke import common_def，使用 common_def.REL_PATH 动态引用。
normal_comparer 流式响应 chunks 为空时 IndexError @ rtp_llm/test/smoke/normal_comparer.py:135
- 建议：添加空列表检查：if not chunks: raise SmokeException(...)。
normal_comparer aux_info 比对在 actual 为 None 时静默跳过 @ rtp_llm/test/smoke/normal_comparer.py:508
- 建议：当 expect.aux_info not None 但 actual.aux_info is None 时追加 diff。
start_servers_parallel 无超时保护，服务器挂起导致测试永久阻塞 @ rtp_llm/test/smoke/case_runner.py:682
- 建议：给 as_completed() 添加 timeout 参数（如 3600s）。
mm_process_engine.py for 循环体缩进异常 @ rtp_llm/multimodal/mm_process_engine.py:596
- 建议：将循环体从 indent=20 调整为 indent=16。
DeepGEMM 导入异常未捕获 ImportError @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:146
- 建议：将 ImportError 加入捕获列表。
ops/init.py LIBDIR 为 None 时 TypeError 未被捕获 @ rtp_llm/ops/__init__.py:96
- 建议：except (OSError, TypeError) 或拼接前检查 LIBDIR is not None。
comparer_registry mainse_comparer 导入路径和类名均错误 @ rtp_llm/test/smoke/comparer_registry.py:92
- 建议：修正导入路径为正确的子包路径，并使用实际存在的类名。
MagaServerManager.init 默认参数使用 mutable dict @ rtp_llm/test/utils/maga_server_manager.py:31
- 建议：改为 env_args = None，函数体中 self._env_args = env_args or {}。
EnvArgumentParser._env_mappings 是类变量，多实例间状态泄漏 @ rtp_llm/server/server_args/server_args.py:165
- 建议：改为实例变量（init 中初始化）。
vit_rpc_server 中 context.abort() 后未 return @ rtp_llm/server/vit_rpc_server.py:64
- 建议：abort() 后显式 return，意图更清晰。
safe_request_get 重定向耗尽时最后一个 response 未关闭 @ rtp_llm/utils/ssrf_check.py:182
- 建议：raise 前 response.close() 或用 try/finally 包裹。
conftest.py GPU 内存清理 bare except 吞掉所有异常 @ conftest.py:211
- 建议：改为 except Exception as e: logger.warning(...)。
conftest.py faulthandler 文件描述符泄漏 @ conftest.py:99
- 建议：注册 atexit handler 做 flush + close。
xdist gw0 cleanup 与其他 worker 存在竞态 @ conftest.py:34
- 建议：gw0 只删除上一个 session 的文件（通过 session marker 或时间戳区分）。

Checklist ✅ (56 items passed)

Strengths

QueryConverter::transMMOutput 新增 extra_input count 与 split_sizes count 校验，防止越界或数据不匹配
MoE 权重加载重构为 _get_moe_quant_ckpt_infos 统一处理 per-expert/stacked 两种格式，消除大量重复代码
EmbeddingEndpoint 升级为 round-robin 多后端地址列表，改善 VIT 负载均衡
CudaOps.cc 修复 cudaEvent 同步时序 bug — event 在 invokeBatchCopy 之后 record，确保 workspace tensor 在 kernel 完成前不被释放
multimodal_util.py singleflight 模式优雅解决并发下载同一 URL 的问题

- multi_runner.sh: PID tracking + EXIT_CODE propagation for build/kill/copy/clean/test - Fix broken [ -z "$TP_SIZE" ] / [ -z "$MODEL_TYPE" ] spacing - .bazelrc: document cuda12 as shared base config; use cuda12_6/12_9/12_9_arm - pyproject.toml: document transformers 4.51.2 pin rationale - oss_optional_extras.toml: document aiter 0.1.13.dev14 pin rationale

- Restore rtp_llm/test/smoke/defs.bzl and thin BUILD wrapper for internal CI. - test_gdn_block_prefill.py: run _test_one_case for bs > 1. - BatchDecodeScheduler.h: schedule partial batch when flush timeout fires. - ssrf_check.py: narrow ValueError catch so private-IP validation propagates. - case_runner.py: OpenaiComparer get("query"), concurrency compare fix, LoRA or validation. - multi_inst_case_runner.py: try/finally cleanup for Pd/Dp/Vit/FrontApp; null-safe stop. - setup.py: move install_requires/extras computation under if __name__ == "__main__"; append retry logs. - comparer_registry.py: narrow bare except to ImportError/ModuleNotFoundError. - conftest.py: preserve explicit empty CUDA_VISIBLE_DEVICES pool. - concurrency_limit_test.py: patch env vars so tearDown restores them. - fp8_kernel.py: guard fp8_grouped_gemm_ptpc None with RuntimeError. - kimi_k25_renderer.py: use part.preprocess_config, not image_url.preprocess_config.

- CudaSampleOp: conditional clone only when return_original_all_probs - grpc_util: preserve multi-D shape for empty tensors; graceful dtype fallback - multimodal_embedding: fix forward() type annotation for multimodal_locs - case_runner: raise Tau2BenchComparer priority; add as_completed timeout - utils.py: use common_def.REL_PATH dynamic reference instead of snapshot - normal_comparer: guard against empty chunks IndexError; detect aux_info mismatch - mm_process_engine: fix excessive indentation in for-loop body - deepgemm_wrapper: add ImportError to exception catch list - ops/__init__: catch TypeError when LIBDIR is None - maga_server_manager: replace mutable default args with None - server_args: move _env_mappings from class var to instance var - vit_rpc_server: add explicit return after context.abort() - ssrf_check: close response before raising on max redirects - conftest: log GPU cleanup exceptions; register atexit for faulthandler fd - mixed_fp4_quant_weight: assert stacked/per-expert weight consistency

LLLLKKKK · 2026-06-23T10:33:26Z

AI Code Review - PR #1121

Status: BLOCKING

Summary: P0/1 · P1/8 · P2/31 · P3/8

Blocking Issues

P0

EnvArgumentParser._env_mappings 类属性不存在导致首次 add_argument 崩溃 @ rtp_llm/server/server_args/server_args.py:237
- 建议：将 line 237 的 EnvArgumentParser._env_mappings 改为 self._env_mappings，使 _register_env_mapping 写入实例属性，与 parse_args/logging 读取路径一致。同时检查 print_env_mappings/get_env_mappings（lines 398-419）也改用 self._env_mappings。

P1

SmokeException 缺少 error_status 参数，运行时会 TypeError @ rtp_llm/test/smoke/normal_comparer.py:137
- 建议：改为 raise SmokeException(QueryStatus.VISIT_FAILED, "Streaming response chunks are empty but return_incremental is True")，与同文件其他调用点保持一致
enable_remote_cache=False 时访问未初始化的 self.remote_kvcm_server 导致 AttributeError @ rtp_llm/test/smoke/multi_inst_case_runner.py:159
- 建议：在 PdSeperationCaseRunner.run() 和 DpSeperationCaseRunner.run() 开头加 self.remote_kvcm_server = None，或在 _cleanup_servers 调用时用 getattr(self, 'remote_kvcm_server', None)
concurrency_test 不一致时缺少 break，后续一致结果会覆盖 err_msg @ rtp_llm/test/smoke/case_runner.py:316
- 建议：找到第一个不一致后加 break，保留最早发现的差异信息。
grpc_util.trans_from_tensor 空 tensor 静默丢弃 shape 信息 @ rtp_llm/utils/grpc_util.py:66
- 建议：对不支持 dtype 的空 tensor，不要返回全新 TensorPB()，而是保留已设置的 shape，只跳过 data_type 设置（或设为默认 FP32）：res.data_type = TensorPB.DataType.FP32; return res。这样 shape 信息不丢失。
_env_mappings 类变量→实例变量迁移不完整，_register_env_mapping 会 crash @ rtp_llm/server/server_args/server_args.py:237
- 建议：将 line 237 改为 self._env_mappings[action.dest] = full_env_name。同时将 print_env_mappings/get_env_mappings 中所有 EnvArgumentParser._env_mappings 引用（lines 398, 400, 405, 415, 416）也改为 self._env_mappings。已用 Python 复现确认 AttributeError。
BatchDecodeScheduler 超时从30s降到100ms，空闲时CPU唤醒频率提高300倍 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:241
- 建议：100ms flush timeout 在空闲时会导致 schedule() 每秒唤醒10次（之前每30s一次）。建议：(1) 仅在 waiting_streams_ 非空时使用短超时，空队列时保持长超时；或 (2) 将 kFlushTimeoutMs 提高到 1-5 秒，因为 enqueue() 会 notify cond_ 所以新请求不会被阻塞。
BatchDecodeScheduler: FINISHED zombie stream 未从 waiting_streams_ 清除，新 timeout 路径放大问题 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:204-216
- 建议：在 line 205 的 remove_if 之前，将即将被移除的 FINISHED streams 也加入 scheduled_ptrs（或单独从 waiting_streams_ 移除）。否则在新的 !woken timeout 路径下，这些 zombie 会导致每 100ms 重复进入 evaluateWaitingStreams 并对 FINISHED stream 调用 reportEvent(CanRun)，可能触发非法状态转移。
docs/README.md 引用了已删除的 docs/requirements.txt @ docs/README.md:10
- 建议：删除 README.md 第 10 行引用 docs/requirements.txt 的 pip install 命令，或将其依赖内联到第 6 行已有的 sphinx/myst-nb 安装命令中。

Non-blocking Suggestions

P2

_cleanup_servers 中单个 stop_server 异常会阻断后续服务器清理 @ rtp_llm/test/smoke/multi_inst_case_runner.py:21
- 建议：在 _stop_server_safe 中加 try/except 包裹 manager.stop_server()，记录 warning 后继续清理下一个 manager。虽然 MagaServerManager.stop_server() 内部有异常处理，但 remote_kvcm_server 的 stop_server/copy_logs 没有同样的保护。
test_gdn_block_prefill 新增 bs>1 分支缺少 prefix_lengths 带 0 后缀的测试覆盖说明 @ rtp_llm/models_py/triton_kernels/fla/test/test_gdn_block_prefill.py:224
- 建议：考虑在 bs>1 分支也增加一组 input_lengths 为 seq_size_per_block 对齐的测试用例，与 bs==1 分支保持对称覆盖
sentinel cache 未覆盖 int64 dtype，dummy.to() 每次仍分配新 tensor @ rtp_llm/models_py/triton_kernels/fla/flydsl_chunk_gdn_mi308x.py:1145
- 建议：直接调用 _get_zero_sentinel(q.device, torch.int64) 获取缓存的 int64 sentinel，避免每次 dummy.to(torch.int64) 创建新 0-size tensor（_launch_tail_safe_into 和 _launch_fast_into 两处均需修改，共 4 处 .to() 调用）
_sentinel_cache 非线程安全 @ rtp_llm/models_py/triton_kernels/fla/flydsl_chunk_gdn_mi308x.py:52
- 建议：多线程场景下 check-then-set 可导致重复创建（对 0-size tensor 无功能影响但非惯用写法）。可改用 functools.lru_cache 或加 threading.Lock。实际风险很低——worst case 多创建一个 0-size tensor。
fp8_kernel.py 在 CUDA 平台 import 失败时只 fallback 了 fp8_grouped_gemm_ptpc，compute_ops 的 import 未保护 @ rtp_llm/models_py/kernels/cuda/fp8_kernel/fp8_kernel.py:22
- 建议：当前 PR 的 fp8_grouped_gemm_ptpc None guard 改进正确；但 compute_ops 的 5 个 import 不在 try/except 中，如果 compute_ops .so 加载失败会导致整个模块不可用。建议后续考虑对 compute_ops import 也加 try/except 并设置 fallback，使 cutlass_moe_mm_fp8_scaled 等函数在适当位置 raise RuntimeError 而不是 ImportError 崩溃。
ssrf_check.py session 资源泄漏 @ rtp_llm/utils/ssrf_check.py:180
- 建议：使用 with requests.Session() as session: 上下文管理器，确保 session 在所有路径都被正确关闭。
ssrf_check.py redirect 空 Location 时 response 未关闭 @ rtp_llm/utils/ssrf_check.py:196
- 建议：结合上一条 session 泄漏修复，用 context manager 统一管理。
concurrency_test 中所有结果都成功且一致时返回空 TaskStates @ rtp_llm/test/smoke/case_runner.py:302
- 建议：在 for 循环之后（所有结果一致的分支），设置 task_states = results[0] 将成功结果传递出去。
multi_inst_case_runner _cleanup_servers 中 stop_server 异常会中断后续清理 @ rtp_llm/test/smoke/multi_inst_case_runner.py:15
- 建议：在 _stop_server_safe 中用 try/except 包裹 stop_server() 调用，记录但不传播异常，确保所有 server 都能被清理。
setup.py 中 dynamic_version() 移入 main 但 PEP 621 read_attr 仍需要它 @ setup.py:1660
- 建议：确认 pyproject.toml 的 version dynamic 配置不依赖 setup.py 模块级变量 version（而是函数 dynamic_version）。如果 read_attr 指向 setup.version 而非 setup.dynamic_version，则会 AttributeError。
OpenaiComparer predicate 中 in 检查对 query 为字符串时语义变化 @ rtp_llm/test/smoke/case_runner.py:95
- 建议：如果 query 预期总是 dict，可以加类型检查：isinstance(q_r.get("query"), dict) and "messages" in q_r["query"]。
grpc_util.trans_tensor 空 tensor 路径增加了 list(t.shape) 分配 @ rtp_llm/utils/grpc_util.py:34
- 建议：影响轻微（仅空 tensor 路径，非推理热路径），且语义更正确（保留 shape 信息）。建议保留但确认 shape 为 () 时 list(t.shape) = [] 的语义是否与 torch.empty(0) 一致。
ssrf_check.py 重定向上限到达后 response.close() 依赖 locals() 检查 @ rtp_llm/utils/ssrf_check.py:202
- 建议：在 for 循环前初始化 response = None，然后用 if response is not None: response.close()，避免使用 locals() hack。
conftest.py _gpu_mem_monitor 每个测试函数执行 import gc + import torch @ conftest.py:211
- 建议：开销在测试框架中可接受（non-production path），且防止 GPU 内存泄漏是正确做法。建议将 import 提到 fixture 外部（模块级别），减少每次函数调用时的 sys.modules 查找。属于 P3 级别优化，不阻塞。
_build_session_deselect_args 中 N+1 gRPC 查询模式 @ rtp_llm/test/remote_tests/plugin.py:1858
- 建议：先遍历所有候选 entry 收集全部需要验证的 digest，一次性调用 _find_missing(all_digests) 批量查询（已支持 1000/batch），再根据 missing set 过滤 entry。可将 500 次 gRPC round-trip 降为 1-2 次。
条件克隆的别名关系脆弱，未来修改易引入隐蔽 bug @ rtp_llm/models_py/bindings/core/CudaSampleOp.cc:480
- 建议：当前逻辑经验证正确: probs_t 在 line 544 的使用被 return_original_all_probs 保护, cum_log_probs 在 line 553 使用的是 division 后的新 filtered_probs。但别名 + 原地修改 + 条件重赋值的组合较脆弱, 建议在 filtered_probs 声明附近添加一行注释说明 'filtered_probs 与 probs_t 可能共享存储, probs_t 在 filtering 后不可再用于需要原始分布的场景'
loader.py 每次循环都调用 get_device_properties() 查询 GPU 总内存 @ rtp_llm/model_loader/loader.py:422
- 建议：将 device_props 和 total_memory 提前到循环外缓存一次（如 total_mem = torch.cuda.get_device_properties(device).total_memory），避免重复 CUDA driver 调用。虽然有 Python cache 但写法更清晰且保险。
FNV-1a byte-by-byte 哈希大 embedding 行性能差 @ rtp_llm/cpp/multimodal_processor/MultimodalProcessor.cc:35
- 建议：FNV-1a 逐字节哈希在大 row（~8KB/行 × 数百行）时较慢。可考虑 xxHash/wyhash 等块哈希算法（8字节一步），或用 std::hashstd::string_view 简洁实现（旧代码就是这样）。不过这在 prefill prep path 上不是 decode 热路径，影响有限。
BatchDecodeScheduler evaluateWaitingStreams 不再等 batch_size_ 就调度 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:195
- 建议：改为任何非空 group 立即调度，目的是避免混合 ReturnAllProbsMode 的 group 永远凑不齐 batch_size_。但对不使用 all_probs 的常规负载，每100ms超时后即使只有1个stream也会触发调度，牺牲了 batching 效率。建议增加最小 batch 阈值或区分有/无 mode 隔离的场景。
BatchDecodeScheduler mode_groups 使用 std::map 不必要的排序开销 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:143
- 建议：ReturnAllProbsMode 只有 3 个枚举值（NONE/DEFAULT/ORIGINAL），使用 std::unordered_map 或直接用 3 个 list 变量更高效，避免红黑树分配。不过 waiting_streams_ 通常不大，实际影响轻微。
combo_position_ids 从 CPU pinned 改为 CUDA 直接分配，CUDA graph capture 时可能影响 update 路径 @ rtp_llm/cpp/cuda_graph/cuda_graph_runner.cc:467
- 建议：如果 combo_position_ids 需要在 graph replay 前从 CPU 更新数据到 GPU（cudaMemcpyAsync H2D），直接分配在 GPU 上意味着 graph update 时需要通过其他方式写入。需确认 graph update path 不依赖 H2D copy，否则可能需要单独的 staging buffer。标记为需交叉确认。
case_runner.py 并发结果对比循环未在首次不一致时 break @ rtp_llm/test/smoke/case_runner.py:316-321
- 建议：在设置 task_states.ret = False 和 err_msg 后添加 break，保留首个不一致的错误信息便于调试。
cuda_impl/test/conftest.py 过度跳过：flashinfer 不可用时 TRT 测试被误跳 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/conftest.py:3
- 建议：将 conftest.py 的 flashinfer 检查移到只影响 flashinfer 测试的层级，或者在 trt_tests/ 子目录添加独立 conftest.py 覆盖父级 skip。TRT 测试已有自己的 try/except 守卫，不需要依赖 flashinfer 可用性。
sparse_mla_decode_op_test.py 包含硬编码用户路径 @ rtp_llm/models_py/modules/factory/attention/cuda_mla_impl/test/sparse_mla_decode_op_test.py:28
- 建议：移除硬编码的个人路径 /data2/baowending.bwd/new/RTP-LLM/github-opensource/。此路径在 CI 和其他开发者机器上不存在，可能导致导入行为不一致。这是 pre-existing 的问题，但既然添加了 pytestmark 做了迁移，应一并清理。
test_py_headwise.py 测试尺寸大幅缩减但 qo_len 始终固定 64 @ rtp_llm/models_py/modules/factory/attention/cuda_headwise_impl/test/test_py_headwise.py:235
- 建议：缩减测试规模可以理解（CI OOM 风险），但 qo_len 固定为 64 意味着 long-context prefill 这个 headwise 的核心场景不再被测到。建议至少保留一个 qo_len > page_size 的 case（如 1024），确保跨 page 的 headwise attention 逻辑被覆盖。
FusedQKRMSNorm.forward 在 flashinfer=None 时报错信息不清晰 @ rtp_llm/models_py/modules/base/cuda/norm.py:123
- 建议：在 FusedQKRMSNorm.init 中添加 if flashinfer is None: raise RuntimeError('FusedQKRMSNorm requires flashinfer (CUDA-only)') 以提供清晰的错误信息。
install.md 源码编译说明中注释与实际命令不一致 @ docs/start/install.md:31-36
- 建议：更新注释以匹配新的 pip wheel 构建流程，删除或修正 bazel-out/ 路径的 symlink 步骤（pip wheel 会将 proto 文件打包进 wheel 中）。
XQAWrapper.forward 每次调用触发 GPU->CPU 同步（死代码） @ rtp_llm/models_py/modules/factory/attention/cuda_impl/xqa.py:295
- 建议：当前为死代码不阻塞合入。若未来恢复 XQADecodeImpl：(1) 将 batch geometry 计算移回 prepare()，(2) 恢复自适应 nb_sub_seq 计算逻辑，(3) 将 xqa import 移到 init 中。建议将 TODO 转为正式 issue 跟踪。
AiterDecodeImplAsm 移除了 use_triton_pa 互斥检查，依赖注册顺序保证优先级 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:124-128
- 建议：在 _is_fmha_impl_disabled_legacy 中添加注释说明 AiterDecodeImplAsm 的优先级依赖于 init.py 中的注册顺序（Triton 先于 Asm），或恢复 use_triton_pa 互斥检查以显式保证行为。
trans_mm_input list 路径将 Python MMPreprocessConfig 直接传给 C++ 构造函数 @ rtp_llm/multimodal/multimodal_util.py:318
- 建议：新增 trans_config_from_py(config: MMPreprocessConfig) -> _CppMMPreprocessConfig 转换函数，在 list 路径调用；或将 Python MMPreprocessConfig 补齐 crop_positions/mm_timeout_ms 字段后统一走 trans_config 逻辑。
compute_ops._is_cuda() 与 arch.py::is_cuda() 逻辑重复，易于不同步 @ rtp_llm/ops/compute_ops.py:10
- 建议：改为 from rtp_llm.models_py.utils.arch import is_cuda 直接复用，避免逻辑分叉。如因循环导入无法直接 import，则提取到公共模块。

P3

get_bytes_io_from_url singleflight 失败后 waiter 各自重试，未能传播失败 @ rtp_llm/multimodal/multimodal_util.py:222
- 建议：当 downloader 失败时，所有等待线程各自独立重新下载同一 URL，singleflight 保护失效。可考虑缓存异常结果或短期 negative cache，让后续 waiter 快速失败。当前设计是 fail-open（容错优先），在大量并发对同一故障 URL 时会放大下载压力。
setup.py 模块级计算移入 main 块，改善导入性能 @ setup.py:1639
- 建议：这是性能改进，减少了 metadata-only 场景的不必要计算和网络请求。无需修改。
multimodal_locs 类型注解风格不一致 @ rtp_llm/models_py/modules/base/common/multimodal_embedding.py:64
- 建议：统一使用 Union[torch.Tensor, Sequence[int]]（同文件中两个类保持一致），或统一用 PEP 604 union 语法
conftest.py _close_fault_file atexit 顺序风险 @ conftest.py:110
- 建议：在 _close_fault_file 中先调用 _fh.disable() 再 flush/close，避免 faulthandler 在文件关闭后尝试写入
grpc_util.py: 空 tensor 未知 dtype 的静默降级可能改变语义 @ rtp_llm/utils/grpc_util.py:65
- 建议：可接受的行为改进。建议添加 logging.debug 记录被降级的 dtype，方便排查问题。
ROCm conftest 使用 collect_ignore_glob 而非 pytest.skip @ rtp_llm/models_py/modules/base/rocm/test/conftest.py:10
- 建议：其他 conftest.py（如 cuda/test/conftest.py）使用 pytest.skip(allow_module_level=True) 做平台守卫。ROCm conftest 使用 collect_ignore_glob 虽然功能等价，但风格不一致且静默忽略（不产生 SKIPPED 统计）。建议统一为 pytest.skip 以便在测试报告中可见。
get_xqa_impl 中残留 23 行注释掉的代码 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/xqa.py:367-388
- 建议：将 TODO 转为正式 issue 跟踪，移除注释掉的代码块。如果 XQADecodeImpl 当前无法使用，考虑将其整体标记或移除。
flashinfer def.bzl 中 sub_lib 函数成为死代码 @ 3rdparty/flashinfer/def.bzl:17-36
- 建议：确认内源 override 也不再调用后，在后续清理 PR 中删除 sub_lib 和 preloaded_deps。

Checklist ✅ (56 items passed)

Strengths

Tau2BenchComparer 注册优先级从最末提到最前，避免 tau2_bench 标记的 case 被 NormalComparer 错误匹配——这是一个实际正确性修复
utils.py 从 import REL_PATH 值绑定改为 import common_def 模块绑定，修复了 set_default_data_root() 修改全局 REL_PATH 后引用还指向旧值的问题
as_completed 添加 3600s 超时，避免并行服务器启动挂死时测试永远不退出
新增空 chunks 检查（line 136-137）可以防止后续 chunks[-1] 的 IndexError 崩溃
aux_info 新增 expect 非空但 actual 为空的比较分支，堵住了静默通过的边界 case
multi_inst_case_runner 使用 try/finally + _cleanup_servers 统一清理逻辑，修复了原代码中服务器泄漏的问题
comparer 注册顺序修正：tau2_bench 移到 OpenaiComparer 之前，避免含 messages 的 tau2_bench case 被 OpenAI comparer 误匹配
lora update 条件从 and 改为 or，修复了 status 不符但 response 恰好不同时跳过检测的逻辑漏洞
kimi_k25_renderer.py 修复了 preprocess_config 从错误对象(image_url)读取到正确对象(ContentPart)的 bug，与 api_datatype.py 定义一致
deepgemm_wrapper.py 补齐了 ImportError 异常捕获，防止 find_spec 成功但 import 失败时抛出未捕获异常

P0: - server_args: fix EnvArgumentParser._env_mappings → self._env_mappings in _register_env_mapping, print_env_mappings, get_env_mappings P1: - normal_comparer: add QueryStatus.VISIT_FAILED to SmokeException - multi_inst_case_runner: init self.remote_kvcm_server = None in Pd/Dp runners - case_runner: add break on first concurrency inconsistency - grpc_util: preserve shape on unsupported empty tensor dtype (FP32 fallback) - BatchDecodeScheduler: conditional timeout (5s idle / 100ms busy); remove FINISHED zombie streams from waiting_streams_ - docs/README.md: inline docs deps, remove deleted requirements.txt ref P2: - multi_inst_case_runner: try/except in _stop_server_safe and cleanup - ssrf_check: use session context manager; init response=None - case_runner: assign results[0] on consistent concurrency; OpenaiComparer type-safe predicate with isinstance check - conftest: _fh.disable() before closing fault file - sparse_mla_decode_op_test: remove hardcoded sys.path - norm.py: FusedQKRMSNorm None flashinfer check in __init__ - docs/start/install.md: update source build instructions for pip wheel

LLLLKKKK · 2026-06-24T05:56:29Z

AI Code Review - PR #1121

Status: BLOCKING

Summary: P0/1 · P1/6 · P2/14 · P3/2

Blocking Issues

P0

rtp_llm/BUILD 删除所有 py_library 目标导致 smoke 和多个 py_test 在 Bazel analysis 阶段失败 @ rtp_llm/BUILD:1
- 建议：同步更新 smoke defs.bzl 和各 test BUILD，将 //rtp_llm:testlib 等替换为新依赖路径或重新定义这些目标。

P1

_write_final_stream_files 无条件覆盖 ByteStream 已拉取的内容 @ rtp_llm/test/remote_tests/executor.py:645
- 建议：仅在 ByteStream 未启动或 result.stdout_raw 更长时才覆盖，否则保留流式拉取的内容。
_build/platform.py CUDA 环境下 version.json 缺失时静默跳过检测 @ _build/platform.py:437
- 建议：在 cuda_version 为 None 时 fallback 到 nvcc --version 输出或默认 cuda12_6，而非 fall through。
build_remote_setup_command 无条件调用 prepare_venv.py 但该文件不存在 @ rtp_llm/test/remote_tests/remote_exec_rtp.py:392
- 建议：将 prepare_venv.py 加入 PR 或在 build_remote_setup_command 中加 if exists() 守卫。
mainse 比较器注册导入路径错误且 MainseArpcComparer 类不存在 @ rtp_llm/test/smoke/comparer_registry.py:92
- 建议：修正导入路径为 rtp_llm.test.smoke.mainse.mainse_comparer 并使用 MainseDecodeArpcComparer / MainseEmbeddingArpcComparer 分别注册。
GenericMoeLayer.forward() 每次调用都重新创建 GroupTopK C++ Op @ rtp_llm/models_py/model_desc/generic_moe.py:131
- 建议：将 GroupTopK() 及 renormalize/num_expert_group/topk_group 属性移到 init 中初始化，forward() 直接复用。
SSRF 保护在旧版 requests 的 get_connection 回退路径中被静默跳过 @ rtp_llm/utils/ssrf_check.py:144
- 建议：将 except ValueError: pass 改为 raise 或至少 log warning。如确认 send() 已前置校验则加注释说明安全依据。

Non-blocking Suggestions

P2

download_blob 静默吞掉所有 gRPC 错误并返回空字节 @ rtp_llm/test/remote_tests/cas_client.py:317
- 建议：添加 log.warning 记录失败原因，与 _bytestream_read 的日志行为保持一致。
_bytestream_write_file_parallel 未校验 committed_size @ rtp_llm/test/remote_tests/cas_client.py:406
- 建议：添加与 _bytestream_write 相同的 committed_size 校验。
ExecutorEndpointPool 无线程安全保护 @ rtp_llm/test/remote_tests/endpoint_info.py:103
- 建议：添加 threading.Lock 保护 refresh()/advance()/current_endpoint() 中的共享状态。
BatchDecodeScheduler 空闲超时 5s 可能延迟首请求 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:248
- 建议：在 predicate 中加入 !waiting_streams_.empty() 确保首请求及时调度。
FlashInferTRTLLMPrefillOp.forward() 在 Optional 守卫前访问 kv_cache @ rtp_llm/models_py/modules/factory/attention/cuda_impl/trtllm_gen.py:383
- 建议：去掉 Optional 标注改为必传，或在函数顶部 assert kv_cache is not None。
_execute_with_retry MAX_RETRIES 负值时返回 None 违反类型约束 @ rtp_llm/test/remote_tests/plugin.py:779
- 建议：函数头部验证 MAX_RETRIES >= 0 或返回前 assert last_result is not None。
distributed_server_test 17 个 test 方法嵌套在 tearDown 内永远不被发现 @ rtp_llm/distribute/test/distributed_server_test.py:178
- 建议：将 test 方法缩进退回 4 空格（类体级别）。
runner 多角色模式缺 ENABLE_STABLE_SCATTER_ADD @ rtp_llm/test/smoke_framework/runner.py:83
- 建议：多角色路径也设置 ENABLE_STABLE_SCATTER_ADD=ON，或加注释说明有意省略。
OpenaiEndpoint logprobs=false 无条件覆盖 extra_configs.return_all_probs @ rtp_llm/cpp/api_server/openai/OpenaiEndpoint.cc:87
- 建议：与 true 分支保持对称，或注释说明 false 优先级高于 extra_configs 是故意设计。
warp_topk occupancy 缓存 TOCTOU 导致重复 hipOccupancy 调用 @ 3rdparty/trt_beam_search/efficient_topk/warp_topk.hpp:605
- 建议：用 std::shared_mutex 或 hold-lock-through-compute。k 值种类极少，锁竞争可忽略。
MultimodalProcessor::getFeatureHash GPU 上做 contiguous() 浪费显存 @ rtp_llm/cpp/multimodal_processor/MultimodalProcessor.cc:28
- 建议：改为 mm_emb.to(torch::kCPU).contiguous() 避免 GPU 临时分配。
EP 共享专家门控退化为未融合的 sigmoid * mul @ rtp_llm/models_py/model_desc/generic_moe.py:186
- 建议：EP 路径也使用 sigmoid_gate_scale_add 融合算子。
ssrf_check _validate_url 抛异常时 session 未关闭 @ rtp_llm/utils/ssrf_check.py:184
- 建议：将 _validate_url(url) 移到 with session: 块内部，或延后 session 创建。
FNV-1a 逐字节哈希在 prefill 路径上较慢 @ rtp_llm/cpp/multimodal_processor/MultimodalProcessor.cc:35
- 建议：考虑 XXH3_64bits 或 std::hash<string_view>，吞吐量可提升 5-10x。

P3

DpSeperationCaseRunner 错误信息引用了 PdSeperationCaseRunner @ rtp_llm/test/smoke/multi_inst_case_runner.py:214
- 建议：改为 'DpSeperationCaseRunner'。
runs_plugin _clone_item 共享 keywords 引用 @ rtp_llm/test/smoke_framework/runs_plugin.py:52
- 建议：用 dict(item.keywords) 浅拷贝。

Checklist ✅ (56 items passed)

Strengths

executor.py 实现了三层超时防护（gRPC deadline、wall-clock watchdog、QUEUED watchdog），确保任何层级的超时都能被捕获
FailoverRemoteExecutor 实现了 DNS 轮换失败转移，含 infra 分类（GPU Xid、IO、网络等）和全局预算控制
output_collector.py 的 _safe_extract 实现了路径遍历防护（normpath + resolve 双重检查）
FusedQKRMSNorm 构造时增加 flashinfer None 检查，将运行时 AttributeError 提前为清晰的 RuntimeError
test_cache.py 实现了两级缓存（本地 + ActionCache），含 TTL 过期和 CAS blob 驱逐检测

P0: - rtp_llm/BUILD: restore 12 py_library stub targets (pip shims + testlib/sdk) required by smoke defs.bzl for Bazel analysis P1: - executor.py: _write_final_stream_files only overwrites when ByteStream not started or result data is longer - platform.py: fallback to nvcc --version when version.json missing, ultimate default cuda12_6 - remote_exec_rtp.py: guard prepare_venv.py with if-exists check - comparer_registry.py: fix mainse import path and class names (MainseDecodeArpcComparer / MainseEmbeddingArpcComparer) - generic_moe.py: move GroupTopK() and config attrs to __init__ - ssrf_check.py: re-raise ValueError in get_connection fallback for security parity with send()

P2: - cas_client: log.warning in download_blob on gRPC failure - cas_client: add committed_size check in _bytestream_write_file_parallel - endpoint_info: add threading.RLock to ExecutorEndpointPool - BatchDecodeScheduler: predicate checks !waiting_streams_.empty() for immediate first-request wakeup - trtllm_gen: assert kv_cache is not None in forward() - plugin: validate MAX_RETRIES >= 0 in _execute_with_retry - distributed_server_test: dedent 17 test methods from tearDown to class body level - runner: add ENABLE_STABLE_SCATTER_ADD=ON in multi-role path - OpenaiEndpoint: add comment explaining logprobs=false priority - warp_topk: upgrade to std::shared_mutex for occupancy cache - MultimodalProcessor: .to(kCPU).contiguous() avoids GPU temp alloc - MultimodalProcessor: replace FNV-1a with std::hash<string_view> - generic_moe: EP path uses fused sigmoid_gate_scale_add operator - ssrf_check: _validate_url before session creation P3: - multi_inst_case_runner: fix DpSeperation error message - runs_plugin: shallow copy keywords in _clone_item

LLLLKKKK · 2026-06-24T13:03:34Z

AI Code Review - PR #1121

Status: BLOCKING

Summary: P0/0 · P1/7 · P2/44 · P3/11

Blocking Issues

P1

UpdateWeights 中 PyErr_Fetch 后未释放 PyObject 引用导致内存泄漏 @ rtp_llm/cpp/model_rpc/LocalRpcServer.cc:638
- 建议：PyErr_Fetch 获取的 type/value/traceback 必须用 Py_XDECREF 释放，否则每次 Python 异常都会泄漏。此外 py::error_already_set 已捕获异常，直接用 e.what() 即可，无需手动 PyErr_Fetch。改为：catch (const py::error_already_set& e) { return {grpc::StatusCode::INTERNAL, "exception from python: " + std::string(e.what())}; }
mainse comparer import 路径错误，导致内部 mainse 冒烟测试静默注册失败 @ rtp_llm/test/smoke/comparer_registry.py:92
- 建议：将 import 拆分为两个正确的模块路径：from rtp_llm.test.smoke.mainse.mainse_decode_arpc_comparer import MainseDecodeArpcComparer 和 from rtp_llm.test.smoke.mainse.mainse_embedding_arpc_comparer import MainseEmbeddingArpcComparer；或者在 mainse/init.py 中做 re-export
reranker_comparer 对比失败抛出 Exception 而非 SmokeException，导致错误分类不一致 @ rtp_llm/test/smoke/reranker_comparer.py:35
- 建议：将第 35 行和第 40 行的 raise Exception(...) 改为 raise SmokeException(QueryStatus.COMPARE_FAILED, ...)。在 case_runner._curl_server_impl 中，SmokeException 会被正确归类为 COMPARE_FAILED，而 plain Exception 会被归到 QueryStatus.OTHERS，导致错误状态上报不准确，影响 CI 结果分析。
embedding_comparer._compare_dict_result 抛出 Exception 而非 SmokeException @ rtp_llm/test/smoke/embedding_comparer.py:79
- 建议：将第 79 行的 raise Exception(...) 改为 raise SmokeException(QueryStatus.COMPARE_FAILED, ...)，与同文件其他对比错误保持一致。否则 key 缺失的对比失败会被错误归类为 OTHERS 而非 COMPARE_FAILED。
PD 分离测试中 GPU ID 切片无边界校验，GPU 不足时静默切分出空列表 @ rtp_llm/test/smoke/multi_inst_case_runner.py:87
- 建议：在 PdSeperationCaseRunner.run() 和 DpSeperationCaseRunner.run() 中，取得 gpu_ids 后增加断言：assert decode_gpu_size + prefill_gpu_size <= len(gpu_ids), f"需要 {decode_gpu_size + prefill_gpu_size} 块 GPU，但只有 {len(gpu_ids)} 块可用"。当前如果 GPU 不足，会静默产生空的 CUDA_VISIBLE_DEVICES，导致服务启动失败但错误信息不明确。
openai_comparer 中 torch.load 未指定 weights_only，PyTorch 2.6+ 默认 weights_only=True 会导致加载失败 @ rtp_llm/test/smoke/openai_comparer.py:35
- 建议：补充 weights_only=False（或如果文件只含 tensor 则用 weights_only=True）。normal_comparer.py 和 embedding_comparer.py 已正确指定了 weights_only=False，此处遗漏。PyTorch 2.6+ 将 weights_only 默认值改为 True，缺省会抛 _pickle.UnpicklingError。
crop_positions 除以零导致请求崩溃 @ rtp_llm/openai/renderers/llava_renderer.py:30
- 建议：在执行除法前校验 crop_positions[4] 和 crop_positions[5] 不为零：if crop_positions[4] == 0 or crop_positions[5] == 0: raise ValueError('crop_positions h/w dimensions must be non-zero')。此处是用户输入边界，应做校验。

Non-blocking Suggestions

P2

endpoints() 方法缺少线程锁保护 @ rtp_llm/test/remote_tests/endpoint_info.py:209
- 建议：此 PR 已为 refresh()、current_endpoint()、advance() 添加了 self._lock，但 endpoints() 遗漏了。refresh() 释放锁后、列表推导读取 self._hosts 和 self._active_spec 之前，其他线程可修改这两个字段导致不一致。应在 endpoints() 中也加 with self._lock 包裹读取操作。当前仅在测试中调用，影响有限。
_write_final_stream_files 中 stat/exists TOCTOU 竞态 @ rtp_llm/test/remote_tests/executor.py:646
- 建议：stdout_file.exists() 和 stdout_file.stat() 之间存在 TOCTOU 窗口。建议用 try/except OSError 包裹 stat() 调用，FileNotFoundError 时回退为 0，避免极端情况下的异常。stderr 同理（行 649）。虽然此处 ByteStream 线程已 join，实际触发概率极低。
assert 后 if kv_cache 检查成为死代码 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/trtllm_gen.py:393
- 建议：移除 if kv_cache: 条件判断，直接执行 kv_cache.kv_cache_base = kv_cache.kv_cache_base.view(...)，减少阅读歧义。如果用 python -O 运行，assert 会被优化掉，此时 if 是唯一守卫，但当前上下文已在 line 384 无条件访问 kv_cache.kv_cache_base.dtype，所以 -O 模式下 if 也没有实际保护作用
PureDpRouter 每层 MoE forward 都有 D2H 同步 (.item()) @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/routers/pure_dp_router.py:127
- 建议：按 TODO 注释计划，将 max_n 提升到 step 级别的 host metadata（在进入 MoE 栈之前计算一次），避免每层都做 D2H sync。当前已标注 CUDA Graph 不兼容并禁用了该路径，但在非 graph 模式下仍是性能瓶颈。
ROCm 采样路径中 per-row multinomial 在 batch 有 generator 时退化 @ rtp_llm/models_py/bindings/core/CudaSampleOp.cc:528
- 建议：将有 generator 和无 generator 的行分组：先对无 generator 的行做一次 batched multinomial，再对有 generator 的行逐个采样。这样只有真正需要 generator 的行才会走逐行路径。或者用统一 seed + offset 方案避免逐行采样。
runtimeBatchCopy 的 cudaEventSynchronize 现在等待时间更长 @ rtp_llm/models_py/bindings/core/CudaOps.cc:299
- 建议：改用 CUDA stream callback 或 stream-ordered 的 workspace 生命周期管理（如将 workspace tensor 绑定到 stream event 回调中释放），避免 CPU 同步等待整个 kernel 完成。或者复用预分配的 workspace buffer 并用 stream ordering 保证安全，从而完全消除 EventSynchronize。当前的改法虽然有性能代价但修复了 workspace 生命周期 bug，优先级较低。
TP-only 模式下 MoE + shared expert 做了两次 allreduce（可合并为一次） @ rtp_llm/models_py/model_desc/generic_moe.py:162
- 建议：已有 TODO 注释。当 fused_moe 支持 skip_allreduce 后，两个 expert 都 skip 内部 allreduce，合并 partial output 后做一次统一 allreduce，减少通信开销约 50%。
ExecutorEndpointPool.refresh() 在 RLock 内做 DNS 解析 + sleep @ rtp_llm/test/remote_tests/endpoint_info.py:106
- 建议：将 DNS 解析移到锁外部，只在更新 _hosts/_active_spec 时短暂持锁。即先做 DNS 解析得到 resolved 列表（无锁），再 with self._lock: 更新内部状态。这样锁只保护状态赋值，不阻塞 I/O。
evaluateRunningMemory 中 int 与 size_t 混合比较存在隐式转换风险 @ rtp_llm/cpp/engine_base/schedulers/FIFOScheduler.cc:136
- 建议：max_token_size 声明为 int，与 size_t 类型的 running_streams_.size() 做算术运算时存在隐式转换。当 max_token_size 为负值（理论上不应出现但无校验）时，转 unsigned 会产生巨大值。建议将 max_token_size 改为 size_t 或 int64_t，并在 contextLength() 后加断言确保非负。
BatchDecodeScheduler::evaluateWaitingStreams 中忙等无超时可能无限挂起 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:202
- 建议：如果 cache load 永远完不成（如远端超时、网络不可达），此循环将无限忙等占住调度线程，导致整个 scheduler 停止工作。建议添加超时上限（如 load_cache_timeout_ms），超时后标记 stream 为 ERROR 并跳出。
BatchDecodeScheduler::stop() 未实现，stop 信号无法优雅终止调度 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:275
- 建议：stop() 缺失实现意味着 BatchDecodeScheduler 无法在关闭时取消 waiting_streams_ 中的请求（FIFOScheduler 已正确实现 cancelStreams）。应实现类似逻辑：设置 stop_ flag，对所有队列调用 cancelStreams，并 notify cond_。
BatchDecodeScheduler::empty() 始终返回 true，生命周期判断不准确 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:280
- 建议：empty() 始终返回 true，上层若依赖 empty() 来判断 scheduler 是否有剩余工作（如 FIFOScheduler::lastScheduleTime 中的用法），会误认为无工作，影响超时判断。应返回实际状态：waiting_streams_.empty() && loading_cache_streams_.empty() && running_streams_.empty()。
GrpcConfig::from_json 使用正则解析 JSON 不支持嵌套大括号 @ rtp_llm/cpp/config/ConfigModules.cc:436
- 建议：用正则 [^}]+ 解析 JSON 对象，如果值中包含嵌套对象或字符串中含 }，解析会静默截断。建议使用 rapidjson（项目已依赖）进行 JSON 解析，或至少记录 WARNING 当解析结果为空时。
transMMInputsPB 传值 vector 而非 const 引用，造成不必要的拷贝 @ rtp_llm/cpp/model_rpc/QueryConverter.h:26
- 建议：参数类型应为 const std::vector& 以避免拷贝整个 vector（含 torch::Tensor 等重对象）。声明和定义 (QueryConverter.cc:203) 都需要改。
BatchDecodeScheduler 中非 NONE 模式组可能被饿死 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:154
- 建议：evaluateWaitingStreams 每次选最大的非 NONE 组调度，如果持续有新请求补充大组，小组会被无限期跳过。建议维护 last_selected_mode_ 做轮转，或在某个组等待超过 N 轮后强制调度。实际触发需要同时存在 DEFAULT 和 ORIGINAL 两种模式请求且流量不均，生产中较罕见，降为 P2。
FIFOScheduler 中模式不兼容的 stream 缺少饿死保护 @ rtp_llm/cpp/engine_base/schedulers/FIFOScheduler.cc:258
- 建议：当 running_streams_ 中有非 NONE 模式时（DECODE 角色可混入 running），新的不兼容模式 stream 被 skip。FIFOScheduler 无 timeout 机制（不像 BatchDecodeScheduler 的 kFlushTimeoutMs），在高流量下不兼容模式的请求可能长期排队。建议为被跳过的 stream 增加等待时间上限或计数器，超过阈值时独立调度。
FMHAConfig/MoeConfig pickle 格式变更破坏滚动升级 @ rtp_llm/cpp/pybind/ConfigInit.cc:309
- 建议：multiprocessing.spawn 通过 pickle 跨进程传递 EngineInitParams（含 FMHAConfig/MoeConfig）。新旧代码的 tuple 大小不同（14 vs 12），滚动升级时 frontend/backend 进程版本不一致会导致反序列化崩溃。如需支持滚动升级，可在 unpickle lambda 中根据 tuple 大小兼容两种格式。如果始终原子部署则可接受。
getFeatureHash 对每个多模态输入做全量 D2H 拷贝计算哈希 @ rtp_llm/cpp/multimodal_processor/MultimodalProcessor.cc:28
- 建议：大型多模态模型（4096 tokens × 4096 hidden_dim × 2B = 32MB）每次请求都做 D2H 拷贝。建议：1) 如果 tensor 已在 CPU 上则跳过 .to(kCPU)；2) 或在 GPU 上计算哈希再只拷回结果。高并发多模态场景下此开销显著。
CUDA 版本 < 12.6 静默映射到 cuda12_6 配置（含 CUDA 11.x） @ _build/platform.py:314
- 建议：CUDA 11.x 或 12.0-12.5 用户会静默获得 cuda12_6 配置（含 CXX11_ABI=1），可能导致链接错误。建议对 version < (12, 0) 抛出 RuntimeError 明确提示不支持的版本，而非静默回退。
BatchDecodeScheduler 中 FINISHED stream 在 busy-wait 后未从 waiting_streams_ 移除 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:207
- 建议：在 busy-wait 期间变为 FINISHED 的 stream 不在 scheduled_ptrs 中，因此不会被从 waiting_streams_ 移除，要等到下一轮 evaluateWaitingStreams 开头的 remove_if（line 139-141）清理。虽不丢数据，但这些 FINISHED stream 暂时占用 waiting 名额，可能导致 waiting_streams_.size() >= batch_size_ 提前触发。建议在构建 scheduled_ptrs 时也包含 FINISHED stream。
shared_mutex 的 TOCTOU 竞态：多线程可能重复执行昂贵的 occupancy 查询 @ 3rdparty/trt_beam_search/efficient_topk/warp_topk.hpp:608
- 建议：在 unique_lock 写入前做二次检查（double-check locking）：unique_lock 拿到后先 find(k)，命中则直接返回，避免同一 k 值并发时重复调用 hipOccupancyMaxPotentialBlockSizeVariableSMem。这是从 exclusive mutex 改为 shared_mutex 引入的新问题。不过由于该函数仅在启动时对少量 k 值调用，且重复计算结果幂等，影响有限，归为 P2。
worker_status_comparer 中 finished_task_list 比较未排序，顺序不同会导致假阳性失败 @ rtp_llm/test/smoke/worker_status_comparer.py:59
- 建议：对 filtered_expect_tasks 和 filtered_actual_tasks 按稳定的 key（如 task_id、input_token_len 等）排序后再比较，或改用集合比较。服务端返回 finished_task_list 的顺序可能不确定，导致 flaky test。
worker_status_comparer 中 model_dump 结果 expect/actual 从未使用 @ rtp_llm/test/smoke/worker_status_comparer.py:37
- 建议：删除第 37-38 行的死代码。这两行调用 model_dump 但返回值 expect/actual 从未被引用，后续比较使用的是 getattr。
utils.create_temporary_copy 创建的临时文件永不清理 @ rtp_llm/test/smoke/utils.py:19
- 建议：考虑使用上下文管理器或在 CaseRunner/BaseComparer 的 finally 块中清理临时文件。当前每个包含图片/视频的 query 会泄漏一个临时文件。对于长跑 smoke suite（93+ 测试用例），积累的临时文件可能占用显著磁盘空间。
tau2_bench_comparer._run_tau2_script 子进程无超时保护 @ rtp_llm/test/smoke/tau2_bench_comparer.py:258
- 建议：为 proc.wait() 添加超时参数（如 timeout=3600），并在超时时 kill 子进程。当前若 tau2 脚本挂起，smoke 测试会无限阻塞，直到 pytest 外层超时（2 小时）才会终止。
tau2_bench_comparer 在测试运行时执行 pip install，可能影响并发测试 @ rtp_llm/test/smoke/tau2_bench_comparer.py:80
- 建议：文档标注此测试不可与其他 smoke test 并行运行，或将依赖安装移到 fixture/setup 阶段。在测试中执行 pip install 会修改运行时环境，可能导致并发的其他测试出现 import 异常。
base_comparer.run() 使用裸 except: 捕获所有异常包括 SystemExit 和 KeyboardInterrupt @ rtp_llm/test/smoke/base_comparer.py:144
- 建议：将 except: 改为 except Exception:。裸 except 会捕获 SystemExit 和 KeyboardInterrupt，虽然这里会 re-raise，但 logging.warning 的副作用（在 Ctrl-C 时打印无关警告）是不期望的行为。
comparer_registry 中 mainse comparer 注册在 OSS comparer 之前，但 _try_register_mainse_comparers 在模块尾部调用 @ rtp_llm/test/smoke/comparer_registry.py:111
- 建议：当前 mainse comparer 注册发生在 comparer_registry.py 导入时（第 111 行），而 OSS comparer 注册发生在 case_runner.py 导入时（第 94-106 行）。注册顺序取决于 Python 导入顺序，但 case_runner.py 在第 39 行 from comparer_registry import ... 会先触发 comparer_registry 加载。建议在文档或注释中明确说明导入顺序约束，避免后续重构意外改变优先级。
_iterate_modidfy_qr 断言检查了错误的变量 @ rtp_llm/test/smoke/case_runner.py:117
- 建议：将 assert isinstance(new, dict) 改为 assert isinstance(new[key], dict)，以便在类型不匹配时给出正确的错误信息。
SAVE_RESPONSE golden 更新中 line 510 是无效代码 @ rtp_llm/test/smoke/case_runner.py:510
- 建议：删除 line 510 的无效调用，只保留 lines 511-517 的实际更新逻辑。
comparer_registry mainse 注册顺序可能导致 mainse_arpc 被 mainse_module 抢先匹配 @ rtp_llm/test/smoke/comparer_registry.py:97
- 建议：将 mainse_arpc 的谓词注册放在 mainse_module 之前，或在 mainse_module 谓词中排除 mainse_arpc=True 的 case（添加 and not q_r.get('mainse_arpc', False)）。
EP 和 TP-only 路径的 shared_expert_gate 分支逻辑完全重复 @ rtp_llm/models_py/model_desc/generic_moe.py:179
- 建议：提取公共 gate 应用逻辑为内部方法 _apply_shared_expert_gate(shared_expert_output, experts_output, hidden_states)，EP 分支先 all_reduce 再调用，TP 分支直接调用，消除重复代码。
symm_mem all_reduce 每次调用都分配新 output tensor @ rtp_llm/models_py/distributed/collective_torch.py:620
- 建议：all_reduce() 在 hot path 上被频繁调用（每层 attention + MLP 的 TP allreduce）。每次分配新 output tensor 会增加 CUDA allocator 压力。建议在 TorchSymmMemCommunicator 内部维护一个 output buffer（类似 self.buffer 的做法），或让 collective_torch.all_reduce 传入 out=tensor 使其 in-place，避免每次 empty_like 分配。注意：NCCL 路径是 in-place 的（直接修改 tensor），而 symm_mem 路径返回新 tensor，这一不一致可能在调用者处引发 tensor 引用悬空问题。
deepgemm_wrapper 模块加载时 eagerly 初始化所有 6 个 GEMM 符号 @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:182
- 建议：该模块有 20+ 处被 import，包括 router/executor/linear 等模块。顶层调用 _lazy_init_deep_gemm_once() 使得任何 import 都触发 deep_gemm 完整加载。虽然 has_deep_gemm() 会短路退出（如果包不存在），但在包存在时会触发 JIT 环境准备 + import deep_gemm（可能编译 JIT 内核）。建议移除顶层调用，让各 wrapper 函数（fp8_gemm_nt 等）内部 lazy init 对应的符号子集。
symm_mem all_reduce 与 NCCL all_reduce 返回语义不一致 @ rtp_llm/models_py/distributed/collective_torch.py:615
- 建议：当前 all_reduce() 的 docstring 说 'will be modified in-place'，NCCL 路径确实如此。但 symm_mem 路径返回新 tensor，调用者如果没有用返回值（而是继续用原 tensor），结果将不正确。目前大部分调用者（如 generic_moe.py:183）确实用了返回值 shared_expert_output = all_reduce(...)，所以暂时安全。但 API 合约不一致是隐患。建议 symm_mem all_reduce 传入 out=tensor 实现 in-place 语义以匹配 NCCL 路径，或在文档中明确说明返回值必须被使用。
scatter_qkv Triton kernel 对大 hidden dim 场景并行度不足 @ rtp_llm/models_py/triton_kernels/common/scatter_qkv.py:89
- 建议：当前 grid = (M,) 只沿 row 维度并行。对于 MHA/MLA 模型的 k_dim (~1536-8192)，单个 program 要处理很大的数据量，可能超出单 program 的高效寄存器/shared memory 范围，且在 M 较小的 decode 场景下 GPU 占用率低。建议：1) 当 k_dim > 某阈值时沿隐藏维度做 2D tiling；2) 或对小 M 场景 fallback 到 torch.split + contiguous。当前实现对 prefill（大 M）场景仍有效。
generic_moe EP/TP-only 分支的 gate 逻辑完全重复 @ rtp_llm/models_py/model_desc/generic_moe.py:184
- 建议：将 shared_expert_gate 的应用逻辑提取到 if/else 外部，仅在 EP 分支内做 all_reduce。可减少 ~10 行重复代码，降低未来修改一侧忘记另一侧的风险。
GenericMoeLayer.forward 每次推理都分配 topk_weights 和 topk_ids tensor @ rtp_llm/models_py/model_desc/generic_moe.py:124
- 建议：在 decode 阶段 num_tokens 通常是固定的（batch_size），每次 forward 都分配小 tensor 会给 CUDA allocator 带来不必要的压力。在 CUDA graph 模式下可能已经被捕获，但非 graph 模式下可考虑在 init 中预分配最大 batch_size 的 buffer 并 slice 使用。这是较低优先级的优化，影响不大（small tensor allocation 通常走 caching allocator）。
destroy_distributed_environment 关闭 symm_mem 后未重置全局引用 @ rtp_llm/models_py/distributed/collective_torch.py:462
- 建议：close() 之后应将模块级变量也置 None：import rtp_llm.models_py.distributed.symm_mem as _symm_mod; _symm_mod._symm_mem_comm.close(); _symm_mod._symm_mem_comm = None，或在 symm_mem.py 中新增 reset_global() 函数。当前 should_torch_symm_mem_allreduce 会因 disabled=True 而安全降级，功能不受影响。
MultiprocessPreprocessExecutor._rebuild_pool 创建失败后 self.pool 为 None @ rtp_llm/multimodal/mm_process_engine.py:182
- 建议：在 _rebuild_pool 中 wrap _create_pool 的异常，如果失败则 log error 并 reraise，或者在 submit 中增加 self.pool is None 的检查并给出明确的 RuntimeError。
MMWorkItem.mm_timeout_ms 类型标注为 Optional[int] 但下游做除法 @ rtp_llm/multimodal/mm_process_engine.py:312
- 建议：将 mm_timeout_ms 参数类型改为 int = ******（去掉 Optional），或在赋值处做 int(mm_timeout_ms or ******) 兜底，避免 None / 1000.0 的 TypeError。当前调用方总是传 int，风险低但类型标注有误导性。
refresh() 持锁期间调用 time.sleep() 阻塞其他线程 @ rtp_llm/test/remote_tests/endpoint_info.py:128
- 建议：将 DNS resolve+sleep 逻辑移到锁外执行，只在更新 _hosts/_active_spec/_index 时短暂持锁。当前持锁 sleep 会阻塞 current_endpoint()/advance() 调用者最多 ~150ms。
ssrf_check.py 存在三次 DNS 解析：_validate_url + send() + get_connection() @ rtp_llm/utils/ssrf_check.py:180
- 建议：这是既有设计（非本 PR 引入），故标 P2。_validate_url 改为只做 scheme 检查，把 host 校验完全交给 _SSRFAdapter.send() 中的 _resolve_and_validate_host 即可消除一次冗余 DNS。PR 本身的 reorder 变更是正确的 fail-fast 改进。
warp_topk.hpp shared_mutex 升级存在冗余计算窗口 @ 3rdparty/trt_beam_search/efficient_topk/warp_topk.hpp:609
- 建议：可在 unique_lock 内再做一次 find 检查 (double-checked locking) 避免冗余计算。emplace 保证正确性不受影响，但 hipOccupancy 调用代价较高

P3

_bytestream_write_file_parallel 未处理空文件边界 @ rtp_llm/test/remote_tests/cas_client.py:383
- 建议：若文件为空，_chunks() 不会 yield 任何请求（含 finish_write=True），gRPC Write 收到空流可能行为未定义。虽然此路径仅在 size > BYTESTREAM_THRESHOLD(3MB) 时进入，不会实际遇到空文件，但作为防御可在循环后增加空文件的 finish_write 兜底。
StreamGroups::updateStreams 未做 spec_update_infos 越界检查 @ rtp_llm/cpp/engine_base/stream/StreamGroups.h:238
- 建议：如果 spec_update_infos.size() 小于 decode_streams_ + context_streams_ 的总数，会越界访问。建议增加 RTP_LLM_CHECK_WITH_INFO(spec_update_infos.size() == size(), ...) 断言。
getFeatureHash 用 h % INT32_MAX 将哈希空间减半 @ rtp_llm/cpp/multimodal_processor/MultimodalProcessor.cc:35
- 建议：模 INT32_MAX (2147483647) 后结果范围为 [0, 2147483646]，排除了所有负值，有效哈希空间从 32 位降为 31 位。如果负值 token ID 在下游无特殊语义，可改为 static_cast<int32_t>(h & 0x7FFFFFFF) 保留 31 位但更高效；或直接 static_cast<int32_t>(h) 使用完整 32 位空间。
QueryConverter::transQuery 中 crop_positions 逐元素 push_back 可优化 @ rtp_llm/cpp/model_rpc/QueryConverter.cc:140
- 建议：可使用迭代器范围构造: std::vector crop_positions(mm_preprocess_config->crop_positions().begin(), mm_preprocess_config->crop_positions().end()); 同样的模式在 line 179 重复出现。
RemoteMultimodalProcessor 对任意 gRPC 错误都移除连接 @ rtp_llm/cpp/multimodal_processor/RemoteMultimodalProcessor.h:62
- 建议：瞬态错误（UNAVAILABLE、DEADLINE_EXCEEDED）也会永久移除连接。如果 pool 不自动重建同一 ip_port 的连接，后续请求将无法到达该 VIT worker。建议区分瞬态和永久错误，仅对永久错误移除连接。
多个 comparer 的错误信息有语法错误 @ rtp_llm/test/smoke/embedding_comparer.py:74
- 建议：修正为 'list data not equal' / 'scores not equal'，并在消息中包含期望值和实际值以便调试。
DpSeperationCaseRunner 错误消息语法不完整 @ rtp_llm/test/smoke/multi_inst_case_runner.py:220
- 建议：修改为 'should not be empty'。
auto_model.py _prepare_decode_attention_inputs 每次 decode 都创建新的 kv_cache_block_id tensor @ rtp_llm/models_py/standalone/auto_model.py:217
- 建议：standalone auto_model 主要用于测试/调试，不是生产 hot path，所以影响低。但如果后续有性能要求，可以在首次 prefill 时缓存 block_id tensor 并在 decode 时复用（block 数量在序列长度增长前不变）。
get_cutlass_groupgemm_best_config 中 lru_cache 的 key 不含 device_name @ rtp_llm/models_py/kernels/cuda/fp8_kernel/get_best_config.py:103
- 建议：当前 lru_cache 的 key 是 (E, N, K)，但 device_name 是在函数内部通过 torch.cuda.get_device_name() 获取的。在同一进程内如果有混合 GPU 类型（如 H20+L20），第一次调用会缓存错误的 config。实际生产中通常单类型 GPU，影响很小，但建议将 device_name 加入 cache key 或在函数签名中暴露。
_preload_nvidia_deps 在 import 时遍历所有 site-packages 路径 @ rtp_llm/ops/__init__.py:57
- 建议：glob.glob 在每种 nvidia 库上遍历文件系统，共 6 * len(site_packages) 次。虽然只在 import 时执行一次，但如果 site-packages 路径很多（如虚拟环境嵌套），可能增加数十毫秒的启动延迟。可以考虑只在 editable install 模式下执行（检查 .so 是否在 repo 内而非 site-packages/rtp_llm/libs/），或用 @functools.cache 避免重复执行。当前已经只执行一次，影响较小。
crop_positions 解析未处理非数字字符串输入 @ rtp_llm/openai/renderers/llava_renderer.py:26
- 建议：用 try-except 包裹 float 转换，给出包含字段名的清晰错误信息：raise ValueError(f'crop_positions contains non-numeric value: {config.crop_positions}')。

Checklist ✅ (56 items passed)

Strengths

为 ExecutorEndpointPool 添加 threading.RLock 保护，修复多线程下 refresh/advance 竞态
ByteStream Write 新增 committed_size 校验，防止静默的部分上传
CAS BatchReadBlobs 异常从 silent pass 改为 log.warning，提升可观测性
prepare_venv.py 增加文件存在性检查，提升 open-source 构建兼容性
_write_final_stream_files 新增大小比较逻辑，避免用较短的 ActionResult 数据覆盖更完整的 ByteStream 流数据
FailoverRemoteExecutor 架构完善：阶段回退检测、wall-clock/queued 双层 watchdog、atexit + SIGTERM 取消远程操作
output_collector 的 _safe_extract 正确防御路径遍历和符号链接攻击
generic_moe.py 将 group_topk 相关属性（renormalize、num_expert_group 等）从 forward() 移至 init，避免了每次前向传播时重复初始化 nn.Module 属性，同时使 PyTorch 正确追踪这些子模块
TensorPbConvert::pbToTorch 中增加了 numel overflow 检查和 data size mismatch 校验，防止了恶意 protobuf 导致的内存越界
QueryConverter::transGenerateConfig 对 return_all_probs_mode 增加了 clamp 防止客户端构造非法枚举值

- LocalRpcServer.cc: remove PyErr_Fetch leak, use e.what() directly - comparer_registry: split mainse imports into per-file modules - reranker_comparer: Exception → SmokeException(COMPARE_FAILED) - embedding_comparer: Exception → SmokeException(COMPARE_FAILED) - multi_inst_case_runner: assert GPU count before slicing gpu_ids - openai_comparer: add weights_only=False to torch.load - llava_renderer: validate crop_positions h/w != 0 before division

LLLLKKKK · 2026-06-24T15:52:51Z

AI Code Review - PR #1121

Status: BLOCKING

Summary: P0/2 · P1/6 · P2/13 · P3/3

Blocking Issues

P0

VisionAttention 缺少 self.head_dim 导致 eager 路径 AttributeError 崩溃 @ rtp_llm/multimodal/multimodal_mixins/qwen2_vl/modeling_qwen2_vl.py:197
- 建议：在 init 中添加 self.head_dim = dim // num_heads。或若 eager 路径确认不使用，添加 raise NotImplementedError
VisionAttention 用 bool 加法做 attention mask 导致跨序列注意力泄漏 @ rtp_llm/multimodal/multimodal_mixins/qwen2_vl/modeling_qwen2_vl.py:198
- 建议：改为 attn_weights = attn_weights.masked_fill(~attention_mask, float('-inf'))。若 eager 路径不使用可标注 NotImplementedError

P1

OpenaiEndpoint: chat_render_/tokenizer_ 空指针解引用 @ rtp_llm/cpp/api_server/openai/OpenaiEndpoint.cc:77
- 建议：在 extract_generation_config 入口处检查 chat_render_ 和 tokenizer_ 非空，为空时跳过对应操作
CAS ByteStream.Read 创建的 gRPC channel 未关闭导致资源泄漏 @ rtp_llm/test/remote_tests/cas_client.py:321
- 建议：用 with grpc.insecure_channel(...) as channel: 包裹，或在 finally 中调用 channel.close()
XQA test_support 无断言，测试永远通过 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/test_xqa.py:129
- 建议：添加 assert result == expected_value 或至少 assert isinstance(result, bool)
XQA _test_decode_correctness 不支持配置时静默跳过 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/test_xqa.py:66
- 建议：改为 pytest.skip('XQA not supported for this config') 让 CI 报告区分 PASS 和 SKIP
test_gpu_isolation GPU 记录文件无清理，陈旧文件导致检查不可靠 @ rtp_llm/test/test_gpu_isolation.py:125
- 建议：在 fixture setup 中清理历史记录文件，或使用 tmpdir fixture 隔离
test_can_allocate_on_device_0 缺少 GPU 跳过检查 @ rtp_llm/test/test_gpu_isolation.py:79
- 建议：添加 _skip_if_no_isolation 或 pytest.mark.skipif(not torch.cuda.is_available()) 装饰器

Non-blocking Suggestions

P2

TensorPbConvert::torchToPb 未验证 tensor 在 CPU 上 @ rtp_llm/cpp/model_rpc/TensorPbConvert.cc:148
- 建议：序列化前加 auto cpu_tensor = tensor.to(torch::kCPU).contiguous()
FIFOScheduler: ReturnAllProbsMode 不兼容 stream 可能饥饿 @ rtp_llm/cpp/engine_base/schedulers/FIFOScheduler.cc:261
- 建议：当 waiting_streams_ 非空但本轮无新调度时，设置 schedule_trigger_ 确保下轮重试
deepseek_vl2 _create_config 缺失 config.json 时返回 None @ rtp_llm/models/deepseek_vl2/deepseek_vl2.py:23
- 建议：改为 raise FileNotFoundError，与 QWenV2._from_hf 改进一致
kimi_k25 _read_top_config 返回空 dict 导致静默配置失败 @ rtp_llm/models/kimi_k25/kimi_k25.py:37
- 建议：config.json 缺失时 raise FileNotFoundError
qwen_v2.py transformer_prefix 拼接顺序颠倒 @ rtp_llm/models/qwen_v2.py:63
- 建议：恢复为 self.prefix + self.model_prefix，或确认无模型使用此 compat 路径后删除
FMHAConfig/MoeConfig pickle 格式不兼容旧版本 @ rtp_llm/cpp/pybind/ConfigInit.cc:673
- 建议：若需兼容，在 setstate 根据 tuple size 做版本分支处理
FMHAConfig::use_triton_pa 默认值从 true 改为 false @ rtp_llm/cpp/config/ConfigModules.h:114
- 建议：确认此变更是有意的，ROCm 平台测试覆盖此路径切换
fused_recurrent.py B>1 且 inplace_final_state=False 时越界 @ rtp_llm/models_py/triton_kernels/fla/fused_recurrent.py:206
- 建议：验证 B>1 + inplace_final_state=False 路径的 index 和 reshape 逻辑
tarfile.extractall 缺少 filter 参数存在路径穿越风险 @ rtp_llm/test/smoke/tau2_bench_comparer.py:151
- 建议：添加 filter='data' 参数（Python 3.12+）或手动校验成员路径
ssrf_check.py 缺少 is_multicast 检查 @ rtp_llm/utils/ssrf_check.py:21
- 建议：添加 ip.is_multicast 检查到拒绝列表
HeadWisePrefillAttnOp.forward 热路径重复 .cpu().tolist() GPU→CPU 同步 @ rtp_llm/models_py/modules/factory/attention/cuda_headwise_impl/headwise.py:297
- 建议：将 prepare() 的 cpu().tolist() 结果缓存到 self._xxx_list，forward() 直接使用
TRTMHAImpl.forward CUDA graph 路径每次分配大 GPU buffer @ rtp_llm/models_py/modules/factory/attention/cuda_impl/trt.py:80
- 建议：预分配到 init 或首次使用时缓存，参考 PyFlashinferPrefillPagedAttnOp 模式
embeding_test randint 上界应为 vocab_size 而非 hidden_size @ rtp_llm/models_py/modules/base/common/test/embeding_test.py:43
- 建议：改为 torch.randint(0, vocab_size, ...) 或使用 vocab_size 变量

P3

QueryConverter::transMMInputsPB 参数按值传递 @ rtp_llm/cpp/model_rpc/QueryConverter.cc:203
- 建议：改为 const std::vector& mm_inputs
flashinfer alias 双向展开逻辑分散 @ rtp_llm/models_py/modules/factory/attention/attn_factory.py:104
- 建议：抽取 _expand_flashinfer_alias(names) 统一处理
deepgemm_wrapper.py getattr(globals(), ...) 永不生效 @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:133
- 建议：改为 globals().get(name, None) 或 getattr(sys.modules[name], name, None)

Checklist Violations (5 fail / 56 total)

General Principles Checklist

[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue XQA test_support 无断言，测试永远通过
_XQA test_support 无断言；test_decode_correctness 不支持时静默 return 而非 skip
[6.1] Tests — 边界 case 覆盖（空、单元素、最大值） → issue embeding_test randint 上界应为 vocab_size 而非 hidden_size
embeding_test randint 上界用 hidden_size 而非 vocab_size
[6.1] Quality — Mega-PR 已拆分为独立变更 → checklist-only
4585 commits / 559 files 的大规模合并，包含 packaging、测试迁移、多模态修复、attention 重构等不相关变更

Python Static-First Checklist

[P.A] 静态结构与类型纪律 — 禁止 getattr/setattr literal 访问 → issue deepgemm_wrapper.py getattr(globals(), ...) 永不生效
deepgemm_wrapper.py:133 getattr(globals(), name) 对 dict 无效
[P.B] 错误处理 — 资源获取用 with context manager → issue CAS ByteStream.Read 创建的 gRPC channel 未关闭导致资源泄漏
cas_client.py:321 grpc.insecure_channel 未用 with 语句

Strengths

ReturnAllProbsMode 调度隔离：FIFOScheduler 和 BatchDecodeScheduler 均实现了 DEFAULT/ORIGINAL 模式分区，从根本上解决了之前 StreamGroups 中 P1 级的概率分布混合问题
QueryConverter::transMMOutput 从 RTP_LLM_CHECK 崩溃改为 ErrorResult 优雅错误处理，提升了远程 VIT 场景的容错能力
RemoteMultimodalProcessor 新增 gRPC deadline 和失败连接剔除，防止单个 VIT worker 卡住拖慢全局推理
PyWrappedModel::planMicroBatches 正确禁用 MRoPE 和多模态 batch 的 micro-batch 分割，避免位置 ID 和特征索引错乱
FMHAConfig 从布尔标志迁移到字符串 backend 选择 API，提供更灵活的 attention 后端配置

…ve_v2 # Conflicts: # deps/requirements_base.txt # deps/requirements_lock_cuda12_arm.txt # deps/requirements_lock_rocm.txt # deps/requirements_lock_torch_arm.txt # deps/requirements_lock_torch_cpu.txt # deps/requirements_lock_torch_gpu_cuda12.txt # deps/requirements_lock_torch_gpu_cuda12_9.txt # rtp_llm/BUILD # rtp_llm/test/BUILD # rtp_llm/test/generate_config_test.py # rtp_llm/test/smoke/BUILD # rtp_llm/test/smoke/suites_h20_oss.bzl

The outer PID/EXIT_CODE wait loop only reports a host as failed when its per-host subshell exits non-zero, but commands were joined with ';' so a failing non-final command was masked. - build/kill/clean/test: leading `scp` of the executor now `|| exit $?`, so a failed dispatch is no longer hidden by the trailing ssh. - copy: the ssh result is captured via $(...) which masks its exit code; check it explicitly, guard empty TEST_OUTPUT_PATH, and fail on the critical process.log / *Result.json scp. Trace files (normal_*) stay best-effort (|| true). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…hells Address review: multi_copy_script continued after a failed scp. Adopt `set -e` at the top of every per-host subshell as the single, consistent error-propagation mechanism, replacing the scattered `|| exit $?` guards. - copy: a failed scp (or empty/failed remote exec) now aborts the host's subshell instead of running the remaining scp commands. - build/kill/clean/test: same `set -e` guard for consistency. - Trace files (normal_*) stay best-effort via `|| true`; empty TEST_OUTPUT_PATH is still guarded explicitly. multi_kill_script already had the PID/EXIT_CODE propagation pattern. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

CI failed on the rocm config with: no such package '@pip_gpu_rocm_torch//flydsl' ... referenced by '//rtp_llm/models_py/triton_kernels:fla' fla/flydsl_chunk_gdn_mi308x*.py import the ROCm/MI308X-only `flydsl` package, which is not a pip dependency in the rocm lockfile. These modules are lazy-imported at runtime (fla/chunk.py) and shipped via setup.py, so import-based dependency resolution over :fla's srcs should not pull flydsl into the Bazel graph. Exclude them from the glob. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LLLLKKKK · 2026-06-27T18:47:13Z

AI Code Review - PR #1121

Status: BLOCKING

Summary: P0/0 · P1/10 · P2/25 · P3/0

Blocking Issues

P1

FMHAConfig/MoeConfig pickle 格式不兼容，无降级回退 @ rtp_llm/cpp/pybind/ConfigInit.cc:308
- 建议：为 pickle setstate 添加 tuple 长度兼容逻辑：若 t.size() == 12（旧版本），用默认值填充新字段；仅在 t.size() 不在 {12, 14} 范围内时才抛异常。MoeConfig 同理（11 → 13）。这样可以支持滚动升级场景，前端进程已更新但后端进程还在用旧代码。
MoriEpIntranodeRouter._finalize_single 未检查 _dispatch_ids 是否为 None @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/routers/mori_ep_intranode_router.py:243
- 建议：在 _finalize_single 开头加 assert self._dispatch_ids is not None 或做 None 保护，避免 prepare/finalize 调用顺序异常时 crash（例如 exception 中断了 prepare 但 finalize 仍被调用）。
XQA test_support 无断言，测试永远通过 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/test_xqa.py:129
- 建议：在每个 case 循环中添加 self.assertTrue(is_supported, ...) 或 self.assertFalse(is_supported, ...)。末尾应断言 self.assertEqual(supported_count, len(supported_cases)) 和 self.assertEqual(unsupported_count, len(unsupported_cases))。
XQA test_support_functionality 同样无断言 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/test_xqa.py:333
- 建议：为每个 case 添加预期结果并断言，或至少断言 is_supported 是 bool 类型且对已知配置返回 True。
XQADecodeImpl.support() 使用 tokens_per_block 而非 kernel_tokens_per_block，与内核检查不一致 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/xqa.py:140
- 建议：将 XQADecodeImpl.support() 中的 attn_configs.tokens_per_block 改为 attn_configs.kernel_tokens_per_block，与 XQAWrapper.support() 和所有其他 attention impl 保持一致。或者像 XQAPrefillImpl.support() 一样委托给内部的 XQAWrapper.support()。
XQAWrapper.forward 中 decode_cu_seqlens_d 的 .cpu().tolist() 在 forward 热路径触发 GPU→CPU 同步 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/xqa.py:295
- 建议：将 seqlens 计算移到 prepare() 阶段预计算并缓存，或使用 host 端的 cu_seqlens 数据（如 decode_cu_seqlens_host），避免每次 forward 都触发 GPU→CPU 同步。
rocm_fmha_test 中 aiter 对比失败不返回 failed 状态，静默通过 @ rtp_llm/models_py/modules/base/rocm/test/rocm_fmha_test.py:828
- 建议：在 aiter 对比失败时也返回 {'test_status': 'failed'}，与 atrex 路径保持一致。当前数值错误会被静默忽略。
_tp2_worker 异常时 destroy_distributed_environment 不被调用，泄漏 NCCL 通信组 @ rtp_llm/models_py/modules/factory/attention/cuda_mla_impl/test/flashmla_sparse_cp_op_test.py:261
- 建议：将 destroy_distributed_environment() 放在 finally 块中，确保异常路径也能清理 NCCL 资源。当前异常会导致进程持有 NCCL communicator 直到被 kill。
deepep_low_latency_router_test 端口锁在 mp.spawn 异常时泄漏 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/routers/test/deepep_low_latency_router_test.py:305
- 建议：将锁释放移到 try/finally 块中：try: for ... mp.spawn(...) finally: for lock in locks: lock.exit(None, None, None)。
fp4_gemm_linear_test 多次设置 os.environ 但未在 tearDown 中恢复，测试顺序依赖 @ rtp_llm/models_py/modules/factory/linear/impl/cuda/test/fp4_gemm_linear_test.py:79
- 建议：使用 unittest.mock.patch.dict(os.environ, {'RTP_LLM_FP4_GEMM_BACKEND': 'xxx'}) 作为上下文管理器，或在 setUp 保存原始值、tearDown 恢复。

Non-blocking Suggestions

P2

transformers compat shim 在 sys.modules 注入后未设 spec @ rtp_llm/frontend/tokenizer_factory/tokenizers/__init__.py:52
- 建议：将 fallback module 直接注入 sys.modules 作为别名时，该 module 的 name/spec 仍指向原始路径。大多数场景不影响功能，但如果有代码检查 module.name 或 spec.name 来判断模块身份，可能产生混淆。建议添加注释说明此 trade-off 或考虑设置 name。
BatchDecodeScheduler busy-wait 期间变为 FINISHED 的 stream 未从 waiting_streams_ 移除 @ rtp_llm/cpp/engine_base/schedulers/BatchDecodeScheduler.h:199
- 建议：在构建 scheduled_ptrs 时使用 busy-wait 之前的完整 new_streams 列表（包含所有被选中的 stream），而不是过滤后的版本。或者，将所有参与调度的 stream（包括 FINISHED 的）都加入 scheduled_ptrs，确保它们从 waiting_streams_ 中移除。当前行为不会导致数据损坏，但可能造成短暂的资源浪费。
getFeatureHash 对大 embedding 做 D2H 拷贝可能阻塞推理流水线 @ rtp_llm/cpp/multimodal_processor/MultimodalProcessor.cc:28
- 建议：考虑使用 pinned memory + 异步拷贝，或在 GPU 上直接计算 hash（使用 CUDA kernel）。如果性能影响可接受（只在 prefill 阶段执行一次），可加注释说明。
FIFOScheduler ReturnAllProbsMode 隔离仅基于第一个 running stream 的 mode @ rtp_llm/cpp/engine_base/schedulers/FIFOScheduler.cc:211
- 建议：如果 running_streams_ 可能已经包含混合模式（例如旧代码遗留），建议扫描所有 running stream 检查是否已有冲突，而不是只取第一个。或者确认在新代码下 running_streams_ 永远不会混入不同模式。
EmbeddingEndpoint VIT 地址轮询无锁保护 @ rtp_llm/embedding/embedding_endpoint.py:108
- 建议：问题在实际运行中不太可能触发（asyncio 在 await 前不会切换），但建议用 asyncio.Lock 保护对 _vit_role_addrs/idx/ts 的更新操作，避免潜在的并发更新竞争。
cuda_graph combo_position_ids prepareInputs 不处理 prefill CUDA graph 路径 @ rtp_llm/cpp/cuda_graph/cuda_graph_runner.cc:161
- 建议：确认 MRoPE 模型是否会在 prefill CUDA graph 模式下运行。如果会，需在 else 分支（prefill 路径）也添加 combo_position_ids 拷贝逻辑。如果确认不会（由 planMicroBatches 已禁用），加注释说明。
MagaServerManager.get_free_port 分配 200 个连续端口但只使用 1 个 @ rtp_llm/test/utils/maga_server_manager.py:59
- 建议：每次调用锁定 200 个端口（创建 200 个文件锁）但只使用 ports[0]+100。实际所需端口范围取决于 MIN_WORKER_INFO_PORT_NUM * DP_SIZE，建议根据实际需要计算端口数而非硬编码 200，减少文件锁资源和 PortManager 清理开销。
VIT LoadBalancer 三个操作各持独立锁获取，高并发下锁竞争增加延迟 @ rtp_llm/server/vit_proxy_server.py:185
- 建议：每个 RPC 请求触发 3 次锁获取/释放（get_worker、increment、decrement 各持锁一次）。在 200 线程池下，建议将 get_worker + increment_connections 合并为一个原子操作（单次加锁），减少锁竞争次数。
conftest _gpu_mem_monitor 每个测试函数 import gc/torch 并调用 gc.collect() @ conftest.py:206
- 建议：gc.collect() + torch.cuda.empty_cache() 对每个测试函数都执行，在有大量 Python 对象的测试套件中开销显著（gc.collect 扫描所有代 heap）。建议增加条件判断：仅在实际观测到显存增长（delta_reserved > 阈值）时才调用 gc.collect + empty_cache，避免无 GPU 操作的纯 CPU 测试也承担此开销。
CASClient.download_blob 失败时静默返回空 bytes 可能导致下游误判 @ rtp_llm/test/remote_tests/cas_client.py:319
- 建议：考虑在调用侧区分 'blob 为空' vs 'RPC 失败'。当前 download_blob 返回 b'' 后，output_collector.download_and_extract 已有 'if not data' 保护，但 executor._parse 中的 stdout/stderr 下载失败会静默丢失日志内容，建议至少在 ExecutionResult 中标记 partial_download。
SparseMlaFp8CPOp.forward 中 layer_id==0 时修改 metadata 对象，可能影响后续 layer @ rtp_llm/models_py/modules/factory/attention/cuda_mla_impl/flashmla_sparse_cp_impl.py:269
- 建议：这里对 sched 对象的 tile_scheduler_metadata 和 num_splits 字段置 None 只在 layer_id==0 时执行，但 meta 被重复使用于后续 run_part 调用（line 305），如果 tile_scheduler_metadata 已被置 None 则 flash_mla_with_kvcache 可能收到 None scheduler。需确认 flash_mla 是否容忍 None 参数，或该清理逻辑是否应在 plan() 时执行而非 forward() 时。
PureDpRouter._pad_to_max 在每层 MoE 热路径上执行 D2H 同步 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/routers/pure_dp_router.py:127
- 建议：已有 TODO 标注。当前实现每层 MoE 都会产生 GPU→CPU 同步，高层数模型下会累积显著的 pipeline bubble。建议在 step 级别预计算 max_n 并传入，或使用 CUDA kernel 内原地 padding。
Python MMPreprocessConfig 缺少 crop_positions 和 mm_timeout_ms 字段 @ rtp_llm/multimodal/multimodal_util.py:74
- 建议：Python 版 MMPreprocessConfig 与 C++ 版缺少 crop_positions 和 mm_timeout_ms 字段。trans_mm_input 的 list 路径直接将 Python 对象传递给 _CppMultimodalInput 构造函数（line 317: mm_input.mm_preprocess_config），pybind11 会因类型不匹配而报错。当前该路径未被调用故不影响运行时，但新增使用 Python MultimodalInput 的代码会触发此问题。建议补齐这两个字段或在 list 路径中显式转换：_CppMMPreprocessConfig(width=..., crop_positions=[], mm_timeout_ms=0)。
删除 deps/pip.bzl 后 README_cn.md 引用了不存在的文件 @ docs/README.md:0
- 建议：README_cn.md 第 96 行仍引用已删除的 deps/pip.bzl，需要更新文档指向新的依赖管理方式（如 requirements_*.txt 或 pyproject.toml）。
sparse_mla_op_test 通过归一化隐藏 scaling 误差 @ rtp_llm/models_py/modules/factory/attention/cuda_mla_impl/test/sparse_mla_op_test.py:379
- 建议：分别独立归一化后比较会掩盖全局 scaling 差异。建议直接比较 output 和 ref_output (配合合理 atol/rtol)，或仅用 cosine similarity 作为辅助检查。当前的 cosine_sim > 0.99 断言保留即可。
ROCm conftest 静默跳过所有测试无 pytest.skip 提示 @ rtp_llm/models_py/modules/base/rocm/test/conftest.py:8
- 建议：改用 pytest.importorskip('aiter') 或在 conftest.py 中使用 pytest.skip('aiter not available', allow_module_level=True)，使跳过行为在测试报告中可见。
XQA _test_decode_correctness 不支持配置时静默 return 而非 skipTest @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/test_xqa.py:66
- 建议：将 return 改为 self.skipTest(f'XQAAttnOp does not support config: head_num={head_num}, ...')，使不支持的配置在测试报告中显示为 SKIP。
get_topk_ragged_cp_test 缺少值正确性断言 @ rtp_llm/models_py/modules/base/cuda/test/get_topk_ragged_cp_test.py:1
- 建议：添加针对已知输入的值断言，如 torch.allclose(output, expected_output) 或验证返回的索引确实指向输入中最大的 k 个值。
trtllm_gen.py 中 assert kv_cache 在 -O 模式下会被跳过 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/trtllm_gen.py:383
- 建议：改为 if kv_cache is None: raise ValueError('kv_cache is required for TRTLLMGen prefill')。assert 不应用于运行时参数校验。
test_error_handling 裸 try/except pass 不验证异常是否抛出 @ rtp_llm/models_py/modules/factory/linear/impl/cuda/test/fp4_gemm_linear_test.py:255
- 建议：改为 self.assertRaises(RuntimeError, fp4_linear, wrong_input) 或使用 with self.assertRaises(Exception): 明确验证异常行为。
ROCm deepep_normal_router_test 使用 mp.Process 而非 spawn context，CUDA fork 不安全 @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/test/deepep_normal_router_test.py:209
- 建议：改为 mp.get_context('spawn').Process(...)，与 CUDA 版本保持一致，避免 fork 模式下 CUDA context 死锁。
Embedding 测试用 hidden_size 作 randint 上界而非 vocab_size，语义不正确 @ rtp_llm/models_py/modules/base/common/test/embeding_test.py:43
- 建议：改为 torch.randint(0, w.shape[0], ...) 或 torch.randint(0, ******, ...) 以正确测试全词表范围。
rocm_fmha_test 使用 dout == None 而非 is None，Tensor 类型时行为异常 @ rtp_llm/models_py/modules/base/rocm/test/rocm_fmha_test.py:188
- 建议：改为 dout is None，避免当 dout 为 Tensor 时条件判断返回 Tensor 而非 bool。
mainse comparer 注册逻辑与旧代码行为不一致 @ rtp_llm/test/smoke/comparer_registry.py:95
- 建议：当 mainse 文件更新 import 后此 bug 会激活。应拆为三个注册：use_decode_arpc→MainseDecodeArpcComparer，use_emb_arpc→MainseEmbeddingArpcComparer，兜底 mainse_module→MainseComparer（需额外 import）。当前由于 import 失败触发 RuntimeError 而非静默错误，暂不阻塞。
torch.load(weights_only=False) 存在安全/性能隐患 @ rtp_llm/test/smoke/normal_comparer.py:170
- 建议：weights_only=False 允许任意 pickle 反序列化（安全隐患），且比 weights_only=True 慢。golden 数据文件只含 tensor，应使用 weights_only=True。同样问题存在于 openai_comparer.py:35 和 embedding_comparer.py:49。

Checklist ✅ (56 items passed)

Strengths

rejection sampling 实现了完善的参数验证（shape/dtype/device/contiguity 全覆盖），附带详细的错误消息
CUDA 和 ROCm 双平台实现保持逻辑一致，两端都有 template instantiation
transformers 5.x 兼容 shim 设计合理：模块级 side-effect 保证在任何 tokenizer 使用前生效，fallback 链路完整
Qwen35MoeWeight._process_meta 从硬编码 prefix 改为自动检测，增强了对不同 checkpoint 格式的鲁棒性
rejection sampling 的 reference test 实现覆盖了多种 case（全 accept、部分 reject、do_sample=false、0/1 speculative tokens、非法输入），验证充分
DeepSeekV2 的 moe_layer_freq 默认值修复使 GLM-4.7 等新模型能正确加载
TensorPbConvert::pbToTorch 新增完整的形状、溢出、数据大小验证，大幅提升了 RPC tensor 反序列化的安全性
RemoteMultimodalProcessor 新增 gRPC deadline 设置和连接失败时的自动移除，防止 VIT worker 卡死导致级联超时

- TensorPbConvert::torchToPb: move tensor to CPU before serializing data_ptr. - deepseek_vl2 / kimi_k25: raise FileNotFoundError on missing config.json instead of returning None/{} (fail fast, matches QWenV2). - qwen_v2: fix transformer_prefix order to prefix + model_prefix so multimodal ("language_model.model.") keys resolve correctly. - ssrf_check: also reject multicast and unspecified addresses. - embeding_test: index token ids by vocab_size, not hidden_size. - deepgemm_wrapper: globals().get() instead of getattr(globals(), ...). - tau2_bench_comparer: tarfile extractall(filter="data") with fallback. - QueryConverter::transMMInputsPB: take vector by const reference. - attn_factory: extract _expand_flashinfer_alias() for bidirectional alias. - headwise: cache input/kv length CPU lists in prepare(), reuse in forward(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- ConfigInit: FMHAConfig/MoeConfig pickle setstate accept legacy tuple sizes (12/14 and 11/13), filling new trailing fields with defaults for rolling upgrades instead of throwing. - mori_ep_intranode_router: assert _dispatch_ids is set in _finalize_single. - xqa: XQADecodeImpl.support uses kernel_tokens_per_block (kernel-consistent); XQAWrapper precomputes q_len_per_req in prepare() to drop the per-forward GPU->CPU sync (propagated through the cuda-graph update path). - test_xqa: assert per-case support/unsupported results and final counts. - rocm_fmha_test: return failed when the aiter comparison mismatches. - flashmla_sparse_cp_op_test: destroy_distributed_environment() in finally. - deepep_low_latency_router_test: release port locks in finally. - fp4_gemm_linear_test: snapshot/restore os.environ in setUp. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LLLLKKKK · 2026-06-29T03:11:05Z

AI Code Review - PR #1121

Status: BLOCKING

Summary: P0/0 · P1/1 · P2/31 · P3/10

Blocking Issues

P1

mainse comparer 注册谓词过于宽泛，路由错误 @ rtp_llm/test/smoke/comparer_registry.py:95
- 建议：将 MainseDecodeArpcComparer 的注册谓词限定为 q_r.get('mainse_module', False) and q_r.get('use_decode_arpc', False)；为 MainseEmbeddingArpcComparer 改为 q_r.get('mainse_module', False) and q_r.get('use_emb_arpc', False)；为无子标志的 mainse 注册 MainseComparer（如果仍需要）

Non-blocking Suggestions

P2

test_support 中 boundary_cases 断言依赖 desc 字符串约定，脆弱 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/test_xqa.py:347
- 建议：将 expected 值作为元组的一个字段（如 (32, 2, 128, 64, 'fp16', True, 'desc')），而不是从 description 字符串中解析 'SHOULD BE UNSUPPORTED'。字符串解析容易因重命名 desc 而引入静默失败。
embedding test 的 vocab_size 修复后 model_config.vocab_size 仍为 1 @ rtp_llm/models_py/modules/base/common/test/embeding_test.py:36
- 建议：model_config.vocab_size 应设为 vocab_size (******) 以保持一致。当前值 1 与权重表大小不匹配，如果 Embedding 内部引用了 model_config.vocab_size 做校验，可能产生误导。建议改为 model_config.vocab_size = vocab_size。
ssrf_check 函数文档字符串未更新以反映新增的 multicast/unspecified 检查 @ rtp_llm/utils/ssrf_check.py:16
- 建议：文档字符串应更新为 'private/loopback/reserved/link-local/multicast/unspecified'，与实际检查一致，避免后续维护者遗漏。
TensorPbConvert 中 .to(kCPU).contiguous() 顺序可优化 @ rtp_llm/cpp/model_rpc/TensorPbConvert.cc:151
- 建议：改为 tensor.contiguous().to(torch::kCPU) 或直接 tensor.to(torch::kCPU) 即可（.to() 从 CUDA 到 CPU 拷贝时总会产生 contiguous 输出）。当前写法功能正确但有冗余调用。
XQA prepare() 中 sequence_lengths.max().item() 仍有 GPU→CPU 同步 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/xqa.py:238
- 建议：这不是新引入的问题（已存在于 prepare() 中），但既然本 PR 的主题是把 GPU→CPU 同步移出 forward() 热路径，可考虑一并用 CPU 端 seqlens 列表的 max() 替代 GPU .max().item()，减少一次隐式同步。
tarfile fallback 路径缺少安全审计日志 @ rtp_llm/test/smoke/tau2_bench_comparer.py:155
- 建议：TypeError fallback 说明运行在旧版 Python 上。建议在 except 分支添加 logging.warning 提示当前无 tarfile 安全过滤，以便 CI 环境升级时跟踪。不影响正确性（测试代码），但可提高可观测性。
rocm_fmha_test 新增 return 后 atrex 分支不会执行 @ rtp_llm/models_py/modules/base/rocm/test/rocm_fmha_test.py:834
- 建议：在 aiter 比较失败时新增了 return {"test_status": "failed"}，这意味着 aiter 失败时不再继续执行后续的 atrex 对比。如果测试意图是两者都要验证，应使用 test_result 变量记录失败，最后统一返回，而非提前 return。如果意图确实是 aiter 失败即停止，则当前行为正确。建议确认测试意图。
_finalize_chunked 缺少与 _finalize_single 一致的 None 防御 @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/routers/mori_ep_intranode_router.py:274
- 建议：在 _finalize_chunked 开头添加类似的 assert 检查：assert self._chunk_dispatch_ids is not None and len(self._chunk_dispatch_ids) > 0，与 _finalize_single 保持一致的防御风格。
deepep_normal_router_test 同样的端口锁泄漏未修复 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/routers/test/deepep_normal_router_test.py:247
- 建议：将 deepep_normal_router_test.py 中进程启动和 join 逻辑也包裹在 try/finally 中释放锁，与本 PR 对 low_latency 版本的修复保持一致。
worker_status_comparer 中 model_dump() 结果未使用 @ rtp_llm/test/smoke/worker_status_comparer.py:37
- 建议：删除第37-38行的 model_dump() 调用，它们的结果 expect 和 actual 从未被使用，每次比较都浪费一次 Pydantic 序列化开销。
openai_comparer.extract_logprobs 每次比较做 deepcopy 整个 choices 树 @ rtp_llm/test/smoke/openai_comparer.py:157
- 建议：改为只提取 logprob 数值列表而不 deepcopy choices。将 choices 比较改为独立步骤：先收集 logprob 值，再单独比较 choices 结构（排除 logprob 字段），避免整棵 Pydantic 树的深拷贝。
concurrency_test 用 str() 比较复杂 Pydantic 对象 @ rtp_llm/test/smoke/case_runner.py:317
- 建议：使用 result.model_dump() == results[0].model_dump() 或结构化比较代替 str() 字符串化。str() 每次递归序列化整棵对象树为字符串，开销大且不稳定（格式依赖实现细节）。
create_temporary_copy 创建的临时文件从不清理 @ rtp_llm/test/smoke/utils.py:19
- 建议：在 CaseRunner 或 test entry 结束时统一清理临时文件（维护一个列表，在 finally 中删除），或使用 tempfile.TemporaryDirectory 作为生命周期容器。当前每次 multimodal 查询都会泄露一个临时文件到 /tmp。
rocm/test/conftest.py collect_ignore_glob 过于激进 @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/test/conftest.py:13
- 建议：在注释中明确说明：non-ROCm 测试文件名不能以 rocm_/deepep_/moriep_ 开头，否则会被忽略
indexer_ref.py tilelang fallback 用 _FakeTilelang 替身可能隐藏运行时错误 @ rtp_llm/models_py/modules/hybrid/test/indexer_ref.py:39
- 建议：由于 indexer_test.py 已经 CUDA_VERSION_OK = False 跳过了所有 tilelang 测试，_FakeTilelang 不会被执行到；但如果未来 CUDA_VERSION_OK 改为 True，这个 fallback 会产生静默错误。考虑在被装饰的函数中 raise SkipTest('tilelang not available')
fused_moe_test.py subTest dtype 参数从 torch.dtype 改为 str @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/test/fused_moe_test.py:218
- 建议：确认修改原因（pytest subTest 不支持 torch.dtype 作为参数？）并在 commit message 中说明
TRTMHAImpl.forward 每次调用分配 GPU 缓冲区 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/trt.py:80
- 建议：仿照 PyFlashinferPrefillPagedAttnOp 中的 _aligned_q_buf / compact_out_buf 模式，在 init 或 prepare 阶段预分配缓冲区并缓存为实例属性，forward 中重用并 zero() 即可。每层每步各省两次 cudaMalloc。
PureDpRouter._pad_to_max 每层执行 GPU→CPU 同步 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/routers/pure_dp_router.py:125
- 建议：代码中已有 TODO 注释说明了此问题。建议将 max_n 提升到 step 级别元数据，在进入 MoE 层栈前计算一次，避免每层 hot path 上的 D2H sync + all_gather。已标记在 PR feat: PureCP/PureDP allgather+RS routing for FP8 per-block MoE #968 review 中跟踪。
StrategyRegistry.register 每次注册重建完整 key 集合 @ rtp_llm/models_py/modules/factory/fused_moe/strategy_registry.py:41
- 建议：注册只在初始化阶段调用，不在 hot path 上，因此 O(N) 重建 set 可以接受。但如果策略数量增多，可考虑维护一个 _registered_keys: set 做增量更新。LinearFactory.register 有同样模式。影响可忽略（P3 级别），保留为 P2 仅因为两个 factory 都有。
HeadWisePrefillAttnOp._get_headwise_config 中 .sum().cpu() 同步 @ rtp_llm/models_py/modules/factory/attention/cuda_headwise_impl/headwise.py:192
- 建议：此处有按 layer_idx 缓存，所以只在首次遇到新层时触发 GPU→CPU 同步。在稳态推理中不会重复执行。可接受，但如果层数很多，可以考虑批量预计算所有层的 retrieval head 数量。
headwise forward 逐序列串行处理，未批量化 @ rtp_llm/models_py/modules/factory/attention/cuda_headwise_impl/headwise.py:307
- 建议：当前 headwise prefill 在 forward() 中逐序列迭代处理。对于大 batch（多序列同时 prefill），每个序列独立启动 FlashInfer wrapper + sparse kernel，带来多次 kernel launch 开销。未来可考虑将相同 headwise 策略（use_headwise=True/False）的序列合批处理。当前设计在 batch_size=1 的典型长序列 prefill 场景中可接受。
HeadWiseFP8 _extract_kv_fp8 每次分配两个临时 tensor @ rtp_llm/models_py/modules/factory/attention/cuda_headwise_impl/headwise_fp8.py:247
- 建议：FP8 headwise 路径的 _extract_kv_fp8 在每个序列的每层 forward 中分配 kv_len 大小的临时 tensor。对长序列（128K+ tokens）这意味着每层 ~128K * head * dim * 2 bytes 的分配。考虑在 prepare 阶段按 max(kv_len) 预分配并重用。
resolve_symbol 返回 None 时 except AttributeError 是死代码 @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:165
- 建议：resolve_symbol() 不会抛出 AttributeError（缺失符号时返回 None），所以 except 块永远不会执行。如需检测符号缺失，应在 resolve_symbol 返回 None 时主动 raise，或改为 if result is None: raise RuntimeError(...)
FlyDSL sentinel 缓存被 .to() 调用部分架空 @ rtp_llm/models_py/triton_kernels/fla/flydsl_chunk_gdn_mi308x.py:1145
- 建议：当 cu_seqlens/chunk_offsets 为 None 时，dummy.to(torch.int64) 每次调用都会分配新 tensor（即使 size=0），绕过了 _sentinel_cache 的目的。建议将 int64 也加入 sentinel 缓存：_get_zero_sentinel(q.device, torch.int64) 替代 dummy.to(torch.int64)。_launch_fast_into 中同一模式也需修改。
mm_process_engine submit 在正常路径也持锁 @ rtp_llm/multimodal/mm_process_engine.py:198
- 建议：将 _pool_lock 包裹整个 submit（包括正常路径的 apply_async）修复了 TOCTOU race，但所有并发 submit 请求现在被序列化。multiprocessing.Pool.apply_async 本身是线程安全的，所以对于无 BrokenPipe 的正常路径，锁争用是不必要的开销。建议改为：先在锁外尝试 apply_async，只在 BrokenPipeError/OSError 时获取锁做 rebuild+retry。或者使用 RLock + 在 _rebuild_pool 里加锁即可。当多模态请求量大时此处会成为瓶颈。
GenericMoeLayer EP 模式下 allreduce 与 gate 操作顺序变更 @ rtp_llm/models_py/model_desc/generic_moe.py:183
- 建议：新顺序（先 allreduce 再 fused sigmoid_gate_scale_add）用 fused kernel 替代了临时 tensor + element-wise 乘法，是正确的性能优化（数学等价：gate_output 在所有 rank 一致）。但 allreduce 的数据量从 [T, hidden] 变为整个 shared_expert_output（未经 gate 缩放），通信量不变，实际收益来自减少一次 kernel launch 和临时内存分配。建议添加注释说明等价性依赖 gate_output 在各 rank 一致的前提。
GenericMoeLayer TP-only 模式仍执行 2 次独立 allreduce @ rtp_llm/models_py/model_desc/generic_moe.py:162
- 建议：PR 已在 TODO 中识别此问题。TP-only 模式下 routed + shared 各自 allreduce（2 次通信），若 fused_moe 支持 skip_allreduce 后可合并为 1 次，通信延迟减半。当前不阻塞，但对 MoE 模型（如 Qwen3-MoE）decode 延迟有显著影响。建议跟进实现。
get_preprocess_config 未处理 crop_positions 中非数字字符 @ rtp_llm/openai/renderers/llava_renderer.py:26
- 建议：float(x) 在非数字输入时抛出 ValueError，错误信息不够友好。建议 catch ValueError 并包装为更有意义的错误提示。不过此为边界输入校验，非核心逻辑。
multimodal_util.py 中 request_get 移除 fallback 可能影响无 ssrf_check 的部署 @ rtp_llm/multimodal/multimodal_util.py:28
- 建议：旧代码在 ssrf_check 不可用时 fallback 到 requests.get；新代码直接 raise ImportError。确认所有部署环境（包括 standalone/test）都有 rtp_llm.utils.ssrf_check 可用。若有环境缺少此模块，此改动将导致 multimodal URL 下载完全不可用。
MMProfiler._profiled_count 在 profiler 执行前递增 @ rtp_llm/multimodal/mm_profiler.py:225
- 建议：如果 torch.profiler.profile() 构造或 yield 中的用户代码失败，_profiled_count 已增但未产出 trace 文件，导致 end_profile 汇报的 profiled_count 偏大、trace 文件编号有空洞。考虑在 profiler 成功完成后再递增 count，或在 except 中回退 count。作为调试工具可以接受，但会让用户困惑。
XQA prepare() 中 sequence_lengths.max().item() 的 GPU→CPU 同步已从 per-layer 移至 per-step，但仍可避免 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/xqa.py:238
- 建议：在 C++ PyWrappedModel::buildPyAttentionInputs 中预计算 max_sequence_length 并添加到 PyAttentionInputs 字段中，Python 侧直接读取，避免 GPU→CPU 同步

P3

_is_private_host docstring 同样未包含新增检查项 @ rtp_llm/utils/ssrf_check.py:32
- 建议：与 _is_private_ip 保持一致，更新 docstring 包含 multicast/unspecified
assert 在 -O 模式下会被跳过 @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/routers/mori_ep_intranode_router.py:245
- 建议：在实际推理路径中，如果希望防御始终生效，可考虑用 if self._dispatch_ids is None: raise RuntimeError(...) 替代 assert。不过 RTP-LLM 不使用 -O 模式运行，当前方式可接受。
normal_comparer.is_close_list 为少量标量创建 torch.tensor @ rtp_llm/test/smoke/normal_comparer.py:393
- 建议：对于 cum_log_probs、softmax_probs 等短列表（通常 <10 元素），可以用 math.isclose 的纯 Python 循环替代 torch.tensor 创建，避免 tensor 分配开销。对测试运行时间影响小，但更符合惯用做法。
stability repeat 中 per-query os.environ.get 重复调用 @ rtp_llm/test/smoke/case_runner.py:356
- 建议：将 os.environ.get("TEST_UNDECLARED_OUTPUTS_DIR") 提到循环外缓存为局部变量。虽然 os.environ.get 本身不贵（dict 查找），但在 repeat_count * num_queries 次循环中避免重复查找是更干净的做法。
platform_ext_loader 使用相对路径推断 project_root @ rtp_llm/models_py/modules/factory/platform_ext_loader.py:16
- 建议：基于 parents[4] 的路径推断比较脆弱。如果 platform_ext_loader.py 的目录层级发生变化会悄悄失效。考虑使用 git root 或环境变量来定位 internal_source。不影响正确性但降低了可维护性。
list_strategies 每次调用都重新排序 @ rtp_llm/models_py/modules/factory/fused_moe/strategy_registry.py:56
- 建议：list_strategies 返回排序后的列表但不缓存结果。如果频繁调用可考虑在 register 后维护排序列表。当前仅在模型初始化时调用，不影响推理热路径。
reduce_scatter output_tensor 校验可简化 @ rtp_llm/models_py/distributed/collective_torch.py:712
- 建议：3 个独立校验会在热路径（每次 reduce_scatter 调用）执行 tuple 构造和比较。对于内部代码可省略 device/dtype 校验（调用方保证一致），或在第一次校验失败时才做详细诊断。当前 overhead 很小（Python 层），不阻塞。
collective_torch_test.py 和 multimodal_util_test.py 重复 if name 块 @ rtp_llm/models_py/distributed/test/collective_torch_test.py:574
- 建议：合并为一个 if name == 'main' 块。multimodal_util_test.py:43 也有同样问题。
测试文件重复 if name == "main" 块 @ rtp_llm/models_py/distributed/test/collective_torch_test.py:575
- 建议：合并为单个 if name == "main" 块。
torchToPb 中 .to(kCPU).contiguous() 的顺序对非连续 CUDA tensor 次优 @ rtp_llm/cpp/model_rpc/TensorPbConvert.cc:151
- 建议：改为 tensor.contiguous().to(torch::kCPU)，使 GPU 端先做 contiguous 再做单次连续 D2H 拷贝。虽然当前实际影响极小，但语义上更正确

Checklist ✅ (56 items passed)

Strengths

将 GPU→CPU 同步从每层 forward() 热路径提升到 prepare() 阶段（headwise.py、xqa.py），显著减少推理延迟
flashmla_sparse_cp_op_test 使用 finally 块确保 destroy_distributed_environment() 在异常路径也执行，防止 NCCL 进程组泄漏
deepep_low_latency_router_test 用 try/finally 包裹端口锁释放，防止 mp.spawn 失败时泄漏端口
fp4_gemm_linear_test 用 patch.dict(os.environ) 隔离环境变量，消除测试间排序依赖
embeding_test 修复 randint 上界从 hidden_size 改为 vocab_size，修正潜在越界 bug
test_xqa.py 将原本仅 log-and-continue 的 support() 验证改为真正的 assertTrue/assertFalse 断言，杜绝测试假绿
mori_ep_intranode_router 添加 _dispatch_ids 非空断言，防止 prepare 失败后 finalize 静默使用 None
_expand_flashinfer_alias 提取为公共函数，消除 blocklist/known-names 两处的重复别名扩展逻辑
tar.extractall 添加 filter='data' 防止路径穿越攻击，并优雅兼容不支持 filter 参数的旧 Python 版本
SSRF 检查新增 is_multicast 和 is_unspecified，堵住了通过多播地址 (224.0.0.0/4) 或未指定地址 (0.0.0.0) 绕过 SSRF 保护的可能

LLLLKKKK and others added 18 commits May 14, 2026 13:40

build: migrate OSS packaging to pyproject and setup.py

c9b9bf1

- Move OSS build/package metadata into pyproject and setup.py.\n- Carry forward overlay platform detection and dependency/package-data fixes.\n- Keep Bazel/deps metadata aligned with the pyproject migration.

fix(import): make python-native platform bootstrap safe

2f444b7

- Use string-based attention backend selection.\n- Avoid platform import cycles during python-native bootstrap.\n- Keep model, op, and server-arg imports safe across CUDA, ROCm, and OSS runtimes.

test: migrate OSS unit tests to pytest profiles

211b9bd

- Port OSS unit test entrypoints from Bazel wrappers to pytest profile execution.\n- Add shared pytest profile support and platform skips.\n- Preserve existing unit coverage under the new profile layout.

test(smoke): port smoke and perf suites to pytest

45175bd

- Move smoke and perf suites to pytest-based profiles.\n- Restore smoke runtime behavior for ROCm and SM100 profiles.\n- Update SM100 golden data and related platform-specific test coverage.

feat(remote): add REAPI pytest execution

26bf22c

- Add remote pytest execution over REAPI with CAS inputs and output collection.\n- Harden executor failover, timeout budgeting, and cleanup.\n- Include perf data inputs for per-test remote execution.

ci: run perf profile on remote h20

cbeb615

- Keep the OSS history aligned with the internal perf-profile CI rollout.

fix(package): address packaging review issues

692bf18

ci: run gb200 eval bench via pytest

5ce731e

fix(smoke): register tau2 bench comparer

34994a6

fix(ci): route cpp utility tests to A10

4c6906f

fix(ci): pin tokenizers for transformer deps

dce3a2f

fix(smoke): use bundled tau2 data in ci

705effc

build: precompile bytecode during package installs

82cb500

baohengyi requested a review from LLLLKKKK as a code owner June 21, 2026 06:27

baohengyi force-pushed the feature/python_native_v2 branch from a9d8112 to a468563 Compare June 21, 2026 06:54

Merge branch 'upstream/main' into feature/python_native_v2

16591ed

fix: restore py_standalone_testlib BUILD target

a073b14

baohengyi added 2 commits June 22, 2026 07:57

docs: migrate README_cn install instructions to pyproject extras

2a93558

baohengyi added 2 commits June 22, 2026 11:02

baohengyi added 3 commits June 23, 2026 14:40

baohengyi added 2 commits June 24, 2026 14:24

baohengyi and others added 4 commits June 26, 2026 15:35

baohengyi and others added 2 commits June 29, 2026 00:53

Uh oh!

Conversation

baohengyi commented Jun 21, 2026

Uh oh!

CLAassistant commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LLLLKKKK commented Jun 21, 2026

AI Code Review - PR #1121

Blocking Issues

P1

Non-blocking Suggestions

P3

Checklist ✅ (56 items passed)

Strengths

Uh oh!

LLLLKKKK commented Jun 21, 2026

AI Code Review - PR #1121

Blocking Issues

P0

P1

Checklist Violations (6 fail / 56 total)

Strengths

Uh oh!

LLLLKKKK commented Jun 22, 2026

AI Code Review - PR #1121

Blocking Issues

P1

Non-blocking Suggestions

P2

Checklist Violations (12 fail / 56 total)

Strengths

Uh oh!

LLLLKKKK commented Jun 22, 2026

AI Code Review - PR #1121

Blocking Issues

P0

P1

Checklist ✅ (56 items passed)

Strengths

Uh oh!

LLLLKKKK commented Jun 22, 2026

AI Code Review - PR #1121

Blocking Issues

P1

Non-blocking Suggestions

P2

Checklist ✅ (56 items passed)

Strengths

Uh oh!

LLLLKKKK commented Jun 22, 2026

AI Code Review - PR #1121

Blocking Issues

P1

Non-blocking Suggestions

P2

P3

Checklist ✅ (56 items passed)

Strengths

Uh oh!

LLLLKKKK commented Jun 22, 2026

AI Code Review - PR #1121

Blocking Issues

P1

Non-blocking Suggestions

P2

P3

Checklist ✅ (56 items passed)

Strengths

Uh oh!

LLLLKKKK commented Jun 23, 2026

AI Code Review - PR #1121

Blocking Issues

P0

P1

Non-blocking Suggestions

P2

Checklist ✅ (56 items passed)

Strengths

Uh oh!

CLAassistant commented Jun 21, 2026 •

edited

Loading