Feature/python native v2#1108
Open
baohengyi wants to merge 109 commits into
Open
Conversation
Add MoRI-based expert parallelism (EP) as an alternative to DeepEP for ROCm GPUs, enabling efficient intranode and internode dispatch/combine for MoE models. Core components: - MoriEPWrapper: singleton wrapper around mori EpDispatchCombineOp - MoriEpIntranodeRouter: EP router with dispatch/combine flow, chunked dispatch for large token counts, and global-to-local ID remapping for fused expert kernels - MoriEpFp4Strategy: strategy pairing MoRI router with FP4 executor - RocmEpNormalStrategy: MoRI router as first-class option alongside DeepEP, with mutual exclusion validation Config flow: - --use_mori_ep CLI flag (env: USE_MORI_EP) → deep_ep_config → auto_configure_deepep() → moe_config.use_mori_ep - C++ MoeConfig.use_mori_ep field with pybind binding - MoEConfigAdapter exposes use_mori_ep and use_deepep_moe Bazel: moriep_wrapper py_library target with modules dependency.
- FakeBalanceExpert ROCm C++ op for expert load balancing testing - MoriEpIntranodeRouter unit tests - BlockPoolConfigHelper adjustment for MoE cache allocation
- Remove TP-only unified allreduce optimization that caused numerical differences vs main branch (12 smoke test failures) - Remove dead skip_allreduce code from FusedMoe.forward() and the entire router finalize() chain (base class, all implementations, associated test) - Fix golden JSON files with Unicode curly quotes (rebuild with json.dump) - Update stale golden values from CI actual output - Refactor moriep_intranode_router_test to use production MoriEpIntranodeRouter instead of duplicated inline helpers - Fix use_all_gather early return swallowing explicit USE_MORI_EP config
BatchGenerateCall (used by /batch_infer) bypasses PrefillRpcServer, so PD separation is not applied. Add explicit endpoint "/" to the PD golden so that prompt_batch queries go through the PD-aware GenerateStreamCall path, matching main's behavior.
The model generates Unicode right single quote (') in PD streaming
path, not ASCII apostrophe ('). Update golden to match actual output.
MoriEP was passing parallelism_config.world_size to mori config, which would mismatch ep_rank when TP×EP/DP co-existence is introduced. Fix by using ep_size and adding explicit assertions that ep_size == world_size (the only currently supported configuration) in both init paths. Add corresponding validation test.
… in paged prefill
Replace triton.autotune with CachedAutotuner that loads checked-in per-kernel JSON configs, eliminating cross-process autotune races that caused non-deterministic kimi-linear smoke test failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace 6 separate elementwise HIP kernel launches (2x compare, 1x OR, 2x masked_fill, 1x add) in MoriEpIntranodeRouter with a single fused Triton kernel that performs the same remap logic in one pass. Reduces per-layer decode overhead by ~39us (6 launches → 1 launch). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Guard remap_to_local_ids against empty dispatch (N=0) to prevent 0-block Triton kernel launch - Move test to moe/test/ with Bazel BUILD (exec_properties MI308X-ROCM7), add module-level pytest.skip when no GPU available - Change router import to lazy (inside _remap_to_local_ids method) so stub-based MoriEP tests are not broken by top-level triton import - Preload remap_local_ids_kernel in MoriEP multi-GPU test stub loader - Add parameterized test suite covering N=0, all-local, all-non-local, mixed, boundary IDs, various dtypes and topk values Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…kernel Adds a ROCm-only single-pass QK RMSNorm path tuned for wave64 (one warp per (token, head)). Replaces the per-head reshape+rmsnorm chain on the hot path; falls back to the baseline when norm_size != 256 or bias is present. Default-on via ``rtp_llm_ops.fused_qk_rmsnorm_v2`` from ``modules/base/rocm/norm.py::FusedQKRMSNorm``. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…k kernel Hand-tuned BF16 prefill kernel: vec=8, 4 tokens per block, smem-based partner exchange for the rotary half-swap, and cos/sin reuse across all heads inside one block (cos/sin only depends on token+rope_d, not on head). Replaces the per-(token,head) launch pattern of the original V1 kernel; ~5.77x kernel speedup measured on Qwen3.5-9B prefill 15k TP=2, on both the ASM and NonAsm dispatch paths. V3 is gated by the production hot-path config (BF16, paged_fmha, prefix=0, RopeStyle::Base, store_q+store_cache, no store_qkv/store_kv, Tcache=BASE, no logn, head_dim%8==0, rot_dim%(VEC*2)==0, qkv_bias==nullptr); anything else falls back to the original kernel. The ASM and NonAsm invokers each select their own V_VEC_LAYOUT template to match the corresponding FMHA reader (templated <BASE> vs non-templated getVLocalIdx). Also defensively null-checks position_ids in the Mrope branches of the existing V1 kernels (orthogonal but caught while touching the file). Includes precision regression suite (NonAsm + ASM dispatch paths, Qwen3.5-9B uniform/varied/single-long, tail-block %4 residues, partial rotary, prefix-prompt fallback, multi-batch packed consistency, smaller head_dim fallback). Each hot-path test allocates a real LayerKVCache + block table so store_cache=true and V3 actually fires (the dispatch guard rejects kv_cache=None), then asserts Q against a torch fp32 ref AND decodes K/V back from the paged pool using the kernel's documented layout (kv_cache_utils.h:213-241) for direct comparison. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When use_paged_fmha=true && pad_query=false the prefill Q output buffer is sized exactly [token_num, heads, dim] and the V3/V1 kernels write every cell unconditionally, so the torch::zeros initialization is dead HBM bandwidth. Switch to torch::empty in that case; keep torch::zeros for the pad_query=true layout where padded slots between sequences remain unwritten and downstream FMHA reads them as masked zeros. Profile (Qwen3.5-9B prefill, seq_len=15000, ASM=1 TRITON=1): FillFunctor<BFloat16> 58 calls → 50 calls -8 calls × ~27µs = -219µs / prefill Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per-layer FMHA prefill setup recomputed seqlen_k = input_lengths + prefix_lengths every layer in AiterPrefillAttnOp._forward_paged, firing a per-layer trio of kernels (prefix-zeros alloc + cu_seqlens diff + add + dtype-cast) that produced the same value 28 times in a 28-layer model. Hoist the computation into FMHAParams.__init__ once per prefill and read the cached int32 tensor in forward. Cached value is bit-exact identical to the per-layer recomputation — this is a hoist refactor, not a new operator. Cuda-graph replay path uses AiterPrefillAttnOpPaged (not _forward_paged), so it is untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ositions Address PR review (P1): the ROCm fused RoPE+KV-cache prefill/decode dispatch always passes position_ids=nullptr at all four call sites (ASM/Non-ASM x Prefill/Decode). The kernel's Mrope branch only sets position_id when position_ids is non-null, otherwise it silently falls back to -1 and produces wrong RoPE positions. combo_position_ids is not plumbed through PyAttentionInputs in PyWrappedModel::buildPyAttentionInputs yet, so Mrope on this path is unsupported. Throw at op construction with an actionable message. Plumbing real combo position_ids through the prepare/forward chain is left for a follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Introduce two MoE communication paths for pure-parallel topologies that bypass DeepEP and use NCCL allgather + reduce_scatter: - PureCpRouter: tp == ep > 1, dp == 1. allgather across TP, reduce_scatter back. No padding needed since CP splits context evenly across ranks. - PureDpRouter: tp == 1, ep == dp > 1. allgather across DP_AND_TP with -1 sentinel padding for unequal batch sizes, reduce_scatter back. Reuses the existing valid-mask guard in recompute_topk_ids_sum_expert_count. Supporting infrastructure: - MoeConfigResolver.is_cp_equal_ep(): physical-tp view (parallelism_config .tp_size == ep_size) used to detect pure CP topology even when adapter tp_size is squashed to 1 by get_attn_tp_size(). - collective_torch.reduce_scatter helper. - Document the -1 sentinel contract on recompute_topk_ids_sum_expert_count (slots not counted, output preserves -1, out-of-range collapses to -1) with a dedicated kernel test. - multiprocessing + NCCL closed-loop test for the allgather → partial → reduce_scatter cycle on tp4 and tp2_dp2 topologies.
Wire CudaFp8PerBlockPureCPStrategy and CudaFp8PerBlockPureDPStrategy into the factory and CLI: - Add fp8_per_block_pure_cp / fp8_per_block_pure_dp to --moe_strategy choices. - moe_strategy=auto picks PureCP for pure CP+EP (dp==1, tp==ep>1, prefill CP enabled) and PureDP for pure DP+EP (tp==1, ep==dp>1). - Mixed tp>1+dp>1 still falls back to DeepEP intentionally. Gate use_all_gather=True on pure-parallel topologies so other configurations route through DeepEP. can_handle unit tests cover both strategies' explicit/auto paths and all false branches (dp_gt_1, tp_ne_ep, cp_disabled, ep_ne_dp, no_all_gather, etc.) plus router_type / executor_type priority assertions.
- moe_pure_dp_fp8_dp2: Qwen3-30B-A3B FP8, tp=1 dp=2 ep=2, --use_deepep_moe 0 --use_all_gather 1, fp8_per_block_pure_dp - mla_pure_cp_pd: GLM-5 PD-separated; prefill switched to allgather+RS (fp8_per_block_pure_cp), decode unchanged Model paths use /mnt/nas1/hf/ to match existing OSS smoke convention. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- PureCP/PureDP strategies require explicit --moe_strategy=fp8_per_block_pure_cp/_dp; auto-config falls back to DeepEP to keep the default MoE path unchanged. - Preserve use_all_gather in auto_configure_deepep when --moe_strategy explicitly opts into PureCP/PureDP and the topology matches; otherwise the strategy's check_conditions (which requires use_all_gather) would never hold. - Drop implicit cross-call instance state in PureDpRouter; recover local batch size via extra_finalize_args["original_num_tokens"] (matches deepep_normal_router). - Fix triton ep_kernels build deps + non-blocking P2/P3 cleanups from review. - Add TODO at PureDpRouter._pad_to_max for hot-path .item() D2H sync follow-up.
…shapes to cktile Add dimension-aware kernel dispatch for gemm_a8w8_bpreshuffle: - K < 192: always cktile (small-K specialized kernel) - M >= 1536: cktile FlatmmKernel (large-M prefill, +4~23%) - M >= 512 AND N > 1536: cktile (large-N crossover at M~512, +22%) - otherwise: aiter default (decode-friendly, protects M<=256)
Cover threshold boundaries M=511/512/1535/1536 and N<=1536/N>1536 to verify both cktile and default paths produce correct results.
Fix Qwen3.5-397B FP8_PER_CHANNEL_COMPRESSED startup OOM by quantizing MoE expert weights to FP8 inline during fastsafetensors loading, inspired by sglang's load-time quantization approach (6d98b53). Three root-cause bugs fixed: 1. _build_stacked_key_config: scale template overwrites kernel template for the same checkpoint key, causing per-expert keys to mismatch collector target keys — collectors never complete 2. LoadQuantPerChannelFp8Weight.get_tensor_names: includes scale expert keys that don't exist in BF16 checkpoints — collectors never complete 3. _load_moe_inline_quant: missing transpose_stack_moe_w1 in allowed process_fun list — fused gate_up_proj skips inline quant path Key changes: - TensorCollector: add FP8 pre-quantization storage (store_fp8_quantized, has_prequantized_scale, get_scale) - loader._load_from_fastsafetensor: inline BF16->FP8 quantization per expert tensor, periodic empty_cache + gc.collect - loader._is_memory_enough_for_fastsafetensor: skip model_mem doubling when inline FP8 quantization is active - per_channel_fp8_quant_weight: _load_moe_inline_quant reuses pre-quantized FP8 data from collector without re-quantization Verified: 480 MoE weights (60 layers x 2 x 4 ranks) all prequantized=True, fastsafetensors 100% loaded, zero GPU/CPU OOM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cover the three root-cause bugs fixed in the PTPC OOM commit: - TensorCollector FP8 storage: store/load/clear/completion - _build_stacked_key_config: no-overwrite when kernel and scale share the same checkpoint key - per_channel_cast_to_fp8_expert: matches 3D stacked path semantics - _load_moe_inline_quant: transpose_stack_moe_w1 gate/up swap - get_tensor_names: excludes scale expert keys Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…te view copies
In Qwen3-Next GDN prefill, splitting mixed_qkv along the channel dim and
then .view(1, M, n_heads, head_dim) triggers an implicit .contiguous() copy
per q/k/v slice because the slice stride doesn't match the contig stride
required by the target shape. With 24 GDN layers this is ~72 direct_copy
kernel events per prefill.
This replaces the split + view chain with scatter_qkv (Triton kernel,
adapted from SGLang's scatter_fused_proj) which reads mixed_qkv contig
and writes 3 contig (1, M, n_heads, head_dim) buffers in one pass.
Threshold M >= 2048 to avoid kernel-launch overhead at small batches
(microbench: scatter wins from M=4096, neutral around M=2048, loses
~2us below). Decode path uses a different reshape+split that doesn't
trigger view copies, so this only applies to prefill (Prefill._fla and
GatedDeltaNet._forward_cp_prefill).
Measured on Qwen3.5-9B TP2, 15K prefill + 100 decode (n=10, unique
prompts to avoid REUSE_CACHE hits):
- TTFT: 1036.37 ms -> 1026.21 ms (-10.16 ms, -0.98%)
- direct_copy events per prefill: 107 -> 38 (-69 events, -3.50 ms)
- scatter_qkv adds 24 events / +1.97 ms; net copy time -1.53 ms
- TTFT savings exceed raw copy time savings, suggesting better cache
locality for downstream FLA chunk_gated_delta_rule
Equivalence tests cover:
- Qwen3-Next TP=2 production shape (M=15384, 8/16 heads, head_dim=128)
- Threshold boundary (M = 2047, 2048, 2049) -- the kernel must be correct
even though qwen3_next.py only invokes scatter_qkv at M>=2048
- Dtypes: bf16 (production), fp16, fp32
- Head configurations: Qwen3-Next TP=1/2/4 plus MQA-like and 1:1 ratio
- Small M (1, 16, 256, 1024) for chunked-prefill correctness
- Exact call-site mirror for Qwen3NextGatedDeltaNetPrefill._fla
- Negative cases: 3D input, non-contiguous input, wrong last-dim
scatter_qkv is pure data movement (no math), so equivalence is checked
bit-exact via torch.equal, not numerically close. All 7 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… path)
In Qwen3-Next GDN, in_proj_qkvz and in_proj_ba are two separate column-parallel
linears that take the SAME hidden_states input and produce non-overlapping
output ranges. Fusing them into a single in_proj_fused GEMM (cat along dim=1
of the weights) reads hidden_states only once and amortizes the GEMM launch.
The fusion is a pure compute-graph rearrangement:
out_qkvz = hidden_states @ qkvz_w (N=6144 on Qwen3.5-9B TP=2)
out_ba = hidden_states @ ba_w (N=32)
becomes:
fused_out = hidden_states @ cat([qkvz_w, ba_w], dim=1) (N=6176)
out_qkvz = fused_out[..., :6144]
out_ba = fused_out[..., 6144:]
Weight buffer sharing:
Rather than torch.cat (which would leave ~48MB redundant copies of qkvz_w
per layer, ~1.16GB across 24 GDN layers), allocate the fused buffer once
and copy qkvz/ba into it, THEN replace the dict entries with views into
the fused buffer. This achieves both:
(a) zero redundant weight memory (originals get released when init
returns; only fused_w stays resident), and
(b) online-weight-update correctness: WeightManager.update_layer_weight
runs `ori_tensor.copy_(data)` against the dict entry; with the entry
being a view into fused_w, the update writes directly into the right
slice of the fused buffer that in_proj_fused.weight references, so
the next forward sees the new weights. copy_ accepts non-contig
destinations, so the view's stride mismatch is fine.
A new _input_project helper hides the fusion vs 2-GEMM dispatch from forward
and any external caller (notably the existing CP linear-attn test), so the
fused branch (where in_proj_qkvz / in_proj_ba are None) doesn't break code
that was directly invoking those modules.
Decode (M=1) is HBM-bandwidth bound, so reading the hidden tensor once for
both projections matters. Trace shows the savings come from eliminating 24
small in_proj_ba kernel launches (~12us each) — CUDA Graph doesn't fully
amortize launch overhead.
Measured on Qwen3.5-9B TP=2, 15K prefill + 100 decode (n=10, unique prompts):
- TTFT: 1036.37 ms -> 1035.06 ms (-1.31 ms, noise)
- TPOT: 6.992 ms -> 6.775 ms (-0.217 ms, -3.1%)
- Total: 1728.6 ms -> 1705.8 ms (-22.8 ms / request)
Decode-bound workloads (long generation) get most of the win; short-output
workloads see ~0 net change. Independent of (and composes with) the
scatter_qkv prefill optimization on feat/gdn-scatter-qkv-triton.
FP8/quantized fallback: qkvz has scales but ba doesn't, dtype mismatch makes
direct cat impossible. The fallback keeps the original 2-GEMM path.
Tests added (rtp_llm/models_py/model_desc/test/qwen3_next_qkvz_ba_fusion_test.py):
- test_fused_slice_equals_separate_gemms: low-level math equivalence
cat([qkvz_w, ba_w]) @ x == [qkvz_w @ x | ba_w @ x] within bf16 tolerance.
- test_bf16_path_takes_fusion: verifies _qkvz_ba_fused=True and that the
single in_proj_fused module is constructed.
- test_quantized_path_falls_back_in_constructor: mocks LinearFactory to
bypass FP8 strategy lookup and asserts qkvz_s presence triggers the
2-GEMM fallback (in_proj_fused is None, both in_proj_qkvz and in_proj_ba
are constructed, factory is invoked with the scale key).
- test_input_project_helper_shapes: verifies the new helper API.
- test_dict_entries_are_views_into_fused_buffer: verifies qkvz / ba dict
entries share storage with in_proj_fused.weight (no redundant memory).
- test_online_weight_update_propagates_to_fused: simulates WeightManager's
in-place copy_() on the qkvz dict entry and verifies the fused buffer
sees the update — i.e. online weight update is preserved.
All six pass locally.
Existing CP linear-attn test (test_cp_linear_attn.py) updated to call the
new _input_project helper instead of in_proj_qkvz / in_proj_ba directly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ive prefetch - Extract _sanitize_and_pad_block_table() shared helper for both AiterPrefillAttnOp and AiterPrefillAttnOpPaged classes - Remove global .clamp() that silently masked invalid block ids; only padding/speculative columns are now filled with last-valid-block-id - Remove per-layer .item() GPU sync in _forward_paged hot path; use pre-computed fmha_params.max_seqlen_k directly - Add tokens_per_block and _block_positions to AiterPrefillAttnOpPaged and apply sanitize+pad in its forward() to prevent CK kernel OOB - Update test assertions to match new compact buffer size (unique+1 for trailing dummy block) and last-valid-block-id fill semantics Squashed from 6 original commits.
…filter QuantWeight.get_components() returns [self] without recursing into sub_weights. For FP8 quantized models, this caused _collect_ckpt_tensor_name_regexes to miss expert tensor patterns, leading to over-aggressive checkpoint file filtering (94 -> 2 files) and model loading failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AiterPrefillImplAsm/NonAsm's kv_cache=None shortcut bypassed RoPE, sending raw QKV straight into varlen FMHA. Models that rely on positional encoding (e.g. tbstars with rope_theta=10000) produced severely diverged outputs. Run rope_kvcache_impl first when need_rope_kv_cache=True, and have the C++ FusedRopeKVCachePrefillOp return padded K/V (instead of empty tensors) on the non-FP8 no-cache path so the existing _forward_varlen tuple path can consume them.
Collaborator
AI Code Review - PR #1108Status: BLOCKING Summary: P0/2 · P1/17 · P2/36 · P3/0 Blocking IssuesP0
P1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
P1: - LRO response type mismatch / empty response handling in remote executor - Prevent EXIT_CODE in stdout from overriding non-OK REAPI status - BatchDecodeScheduler: treat NONE ReturnAllProbsMode as wildcard - Add rtp_llm.utils.ssrf_check and route HTTP downloads through it - DeepSeek VL2: use rtp_llm.ops.MMPreprocessConfig (9-arg) consistently - FlashInfer CUDA graph replay: avoid calling plan() on replay path - standalone/BUILD: drop deleted arch_select symbols and stale labels P2: - MoriEP router tests: register pytest gpu(type/count) markers - MultimodalInput: align field name with trans_mm_input, avoid mutable defaults - verify_smoke_suites: discover test_smoke_*.py and fail on empty match - perf_runner: fail when configured baseline file is missing or empty - ROCm fused_moe conftest: only ignore GPU-specific tests - CAS client: fail batch upload / ByteStream write on size/status errors
Collaborator
AI Code Review - PR #1108Status: BLOCKING Summary: P0/2 · P1/13 · P2/26 · P3/0 Blocking IssuesP0
P1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
P0: - SSRF redirect validation: manual redirect follow with per-Location scheme/host/IP re-check and relative URL resolution. - BatchDecodeScheduler: partial-batch fallback when incompatible modes prevent filling batch_size_. P1: - FIFOScheduler: initialize batch return_all_probs mode from running streams and reject incompatible DEFAULT/ORIGINAL waiting streams. - multimodal_util: restore cache_size <= 0 disables cache; recreate LRU when disabled cache gets positive size. - triton_kernels/BUILD: restore py_library targets for common/fla/kimi_kda/ moe/causal_conv1d/sparse_mla. - verify_smoke_suites: AST-only SMOKE_CASES parsing (stdlib-only). - attn_factory: restore get_global_weight_or_none for RoPE-less MLA; validate attn_backend/disable_attn_backends names and add flashinfer alias for py_flashinfer. - case_runner: server_manager/remote_kvcm cleanup in finally with idempotent stop/log copy. - conftest/__init__: defer heavy torch/triton/ops imports during pytest collect-only and plugin discovery. - ConfigModules: default use_triton_pa=false to align ROCm defaults. - smoke_framework/runner: copy shared env list to every role in multi-role smoke cases. - comparer_registry: auto-register internal mainse comparers before OSS fallback. - arch_select.bzl: restore requirement/internal_deps/triton_deps shims. - validation: add 'eval' to known smoke markers.
- P0-1: pass layer_idx through Qwen3NextAttention to CausalAttention - P0-2: SSRF check validates redirects and pins connections to resolved IPs - P1-1: reset MoriEP singleton before destroying process groups - P1-2: CPU backend sends ready signal via local_rank_start - P1-3: reuse self.tokenizer.image_token_id in DeepSeek-VL2 - P1-4: cache URL bytes and return fresh BytesIO copies - P1-5: align local MM batch timeout semantics with remote path - P1-6/P1-7: fix ROCm filtered_probs scope and cum_log_probs distribution - P1-8: support 2-D dispatch_ids/weights in remap_local_ids - P1-9: disable TRT allreduce for unsupported world sizes - P1-10: avoid permanent stall in BatchDecodeScheduler mixed logprobs mode - P1-11: abort VIT RPC on empty embeddings - P1-12: choose Local/RemoteMultimodalProcessor by vit_separation/role/tp_rank
Collaborator
AI Code Review - PR #1108Status: BLOCKING Summary: P0/0 · P1/10 · P2/21 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
- P2-1: gate large BeamSearch test matrix behind env var for CI - P2-2: cache VIT role address in EmbeddingEndpoint with TTL refresh - P2-4: throttle gc.collect() by weight count threshold in FP8 loader - P2-7: replace assert with explicit ValueError in reduce_scatter - P2-8: replace assert with explicit RuntimeError in MoriEP topology checks - P2-9: clean up VIT proxy workers on startup exception - P2-10: use 'is not None' instead of truthiness for Tensor default - P2-11: align scatter_qkv test exception type with implementation - P2-12: fix profiler active count leak on creation failure - P2-16: use VitConfig.mm_timeout_ms for VIT proxy default timeout - P2-17: short-circuit GPU sync in CP prefill attention routing - P2-18: downgrade MoriEP router hot-path log from info to debug - P2-20: remove duplicate trans_input serialization in enqueue - P2-21: add per-dim boundary check and checked multiply in TensorPbConvert - P2-22: convert remote multimodal bad response to ErrorInfo - P2-23: precompute host-side max_seqlen for Qwen2VL visual attention - P2-24: optimize BatchDecodeScheduler stream removal with set + remove_if - P2-25: reuse output buffer in PureCpRouter reduce_scatter hot path
Collaborator
AI Code Review - PR #1108Status: BLOCKING Summary: P0/0 · P1/17 · P2/31 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
P1 fixes: - HTTPS SSRF: stop rewriting URL to IP; pin connection pool host instead, preserving TLS SNI/cert verification via server_hostname/assert_hostname - VIT role separation: VIT_SEPARATION_ROLE with non-VIT role now uses RemoteMultimodalProcessor instead of requiring local mm_process_engine - flashinfer alias: expand 'flashinfer' <-> 'py_flashinfer' in blocklist for both auto and explicit backend modes - legacy FMHA: consume enable_fmha (global switch) and enable_open_source_fmha in _is_fmha_impl_disabled_legacy - kimi_kda BUILD: add autotune_cache dependency - LRO error: classify recoverable infra failures (exit_code=-1 + infra_category) instead of treating all LRO errors as normal test failures - proto generation: add build_py/sdist custom commands to generate proto files before packaging - smoke runner: use per-test data_root to compute local REL_PATH instead of import-time captured value P2 fixes: - SSRF redirect: close response before following next hop to avoid FD leak Already fixed in prior commits (verified): - CudaSampleOp cum_log_probs sampling distribution - TensorPbConvert shape boundary/overflow check - BatchDecodeScheduler stream removal optimization
Collaborator
AI Code Review - PR #1108Status: BLOCKING Summary: P0/0 · P1/6 · P2/39 · P3/1 Blocking IssuesP1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
- BatchDecode: partial scheduling + 100ms flush timeout + set reserve - ROCm/CUDA sampling: clone probs, per-request seed, gather+log cum_log_probs - VIT: round-robin addrs, proxy timeout fallback, URL singleflight - MoriEP: shmem finalize, local_world_size, finalize debug log - Quant: MXFP4 stacked MoE keys, fp4_moe_op pickle, fake_balance dtype check - MM: input length/batch output validation, extra_input count check - Renderers: preprocess_config for llama3/kimi_k25, empty TensorPB shape/dtype - Qwen2-VL/Qwen3VL: max_seqlen once, CPU locs reuse - Kernels: FlyDSL sentinel cache, remap fp32 cast, PureDP RS buffer cache - BeamSearch: default large-beam boundary test
Collaborator
AI Code Review - PR #1108Status: BLOCKING Summary: P0/0 · P1/3 · P2/27 · P3/3 Blocking IssuesP1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
- multi_runner.sh: PID tracking + EXIT_CODE propagation for build/kill/copy/clean/test - Fix broken [ -z "$TP_SIZE" ] / [ -z "$MODEL_TYPE" ] spacing - .bazelrc: document cuda12 as shared base config; use cuda12_6/12_9/12_9_arm - pyproject.toml: document transformers 4.51.2 pin rationale - oss_optional_extras.toml: document aiter 0.1.13.dev14 pin rationale
- Restore rtp_llm/test/smoke/defs.bzl and thin BUILD wrapper for internal CI.
- test_gdn_block_prefill.py: run _test_one_case for bs > 1.
- BatchDecodeScheduler.h: schedule partial batch when flush timeout fires.
- ssrf_check.py: narrow ValueError catch so private-IP validation propagates.
- case_runner.py: OpenaiComparer get("query"), concurrency compare fix, LoRA or validation.
- multi_inst_case_runner.py: try/finally cleanup for Pd/Dp/Vit/FrontApp; null-safe stop.
- setup.py: move install_requires/extras computation under if __name__ == "__main__"; append retry logs.
- comparer_registry.py: narrow bare except to ImportError/ModuleNotFoundError.
- conftest.py: preserve explicit empty CUDA_VISIBLE_DEVICES pool.
- concurrency_limit_test.py: patch env vars so tearDown restores them.
- fp8_kernel.py: guard fp8_grouped_gemm_ptpc None with RuntimeError.
- kimi_k25_renderer.py: use part.preprocess_config, not image_url.preprocess_config.
- CudaSampleOp: conditional clone only when return_original_all_probs - grpc_util: preserve multi-D shape for empty tensors; graceful dtype fallback - multimodal_embedding: fix forward() type annotation for multimodal_locs - case_runner: raise Tau2BenchComparer priority; add as_completed timeout - utils.py: use common_def.REL_PATH dynamic reference instead of snapshot - normal_comparer: guard against empty chunks IndexError; detect aux_info mismatch - mm_process_engine: fix excessive indentation in for-loop body - deepgemm_wrapper: add ImportError to exception catch list - ops/__init__: catch TypeError when LIBDIR is None - maga_server_manager: replace mutable default args with None - server_args: move _env_mappings from class var to instance var - vit_rpc_server: add explicit return after context.abort() - ssrf_check: close response before raising on max redirects - conftest: log GPU cleanup exceptions; register atexit for faulthandler fd - mixed_fp4_quant_weight: assert stacked/per-expert weight consistency
Collaborator
AI Code Review - PR #1108Status: BLOCKING Summary: P0/3 · P1/10 · P2/40 · P3/13 Blocking IssuesP0
P1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
P0: - server_args: fix EnvArgumentParser._env_mappings → self._env_mappings in _register_env_mapping, print_env_mappings, get_env_mappings P1: - normal_comparer: add QueryStatus.VISIT_FAILED to SmokeException - multi_inst_case_runner: init self.remote_kvcm_server = None in Pd/Dp runners - case_runner: add break on first concurrency inconsistency - grpc_util: preserve shape on unsupported empty tensor dtype (FP32 fallback) - BatchDecodeScheduler: conditional timeout (5s idle / 100ms busy); remove FINISHED zombie streams from waiting_streams_ - docs/README.md: inline docs deps, remove deleted requirements.txt ref P2: - multi_inst_case_runner: try/except in _stop_server_safe and cleanup - ssrf_check: use session context manager; init response=None - case_runner: assign results[0] on consistent concurrency; OpenaiComparer type-safe predicate with isinstance check - conftest: _fh.disable() before closing fault file - sparse_mla_decode_op_test: remove hardcoded sys.path - norm.py: FusedQKRMSNorm None flashinfer check in __init__ - docs/start/install.md: update source build instructions for pip wheel
Collaborator
AI Code Review - PR #1108Status: BLOCKING Summary: P0/2 · P1/8 · P2/40 · P3/2 Blocking IssuesP0
P1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
P0: - rtp_llm/BUILD: restore 12 py_library stub targets (pip shims + testlib/sdk) required by smoke defs.bzl for Bazel analysis P1: - executor.py: _write_final_stream_files only overwrites when ByteStream not started or result data is longer - platform.py: fallback to nvcc --version when version.json missing, ultimate default cuda12_6 - remote_exec_rtp.py: guard prepare_venv.py with if-exists check - comparer_registry.py: fix mainse import path and class names (MainseDecodeArpcComparer / MainseEmbeddingArpcComparer) - generic_moe.py: move GroupTopK() and config attrs to __init__ - ssrf_check.py: re-raise ValueError in get_connection fallback for security parity with send()
P2: - cas_client: log.warning in download_blob on gRPC failure - cas_client: add committed_size check in _bytestream_write_file_parallel - endpoint_info: add threading.RLock to ExecutorEndpointPool - BatchDecodeScheduler: predicate checks !waiting_streams_.empty() for immediate first-request wakeup - trtllm_gen: assert kv_cache is not None in forward() - plugin: validate MAX_RETRIES >= 0 in _execute_with_retry - distributed_server_test: dedent 17 test methods from tearDown to class body level - runner: add ENABLE_STABLE_SCATTER_ADD=ON in multi-role path - OpenaiEndpoint: add comment explaining logprobs=false priority - warp_topk: upgrade to std::shared_mutex for occupancy cache - MultimodalProcessor: .to(kCPU).contiguous() avoids GPU temp alloc - MultimodalProcessor: replace FNV-1a with std::hash<string_view> - generic_moe: EP path uses fused sigmoid_gate_scale_add operator - ssrf_check: _validate_url before session creation P3: - multi_inst_case_runner: fix DpSeperation error message - runs_plugin: shallow copy keywords in _clone_item
- LocalRpcServer.cc: remove PyErr_Fetch leak, use e.what() directly - comparer_registry: split mainse imports into per-file modules - reranker_comparer: Exception → SmokeException(COMPARE_FAILED) - embedding_comparer: Exception → SmokeException(COMPARE_FAILED) - multi_inst_case_runner: assert GPU count before slicing gpu_ids - openai_comparer: add weights_only=False to torch.load - llava_renderer: validate crop_positions h/w != 0 before division
Collaborator
AI Code Review - PR #1108Status: BLOCKING Summary: P0/1 · P1/11 · P2/8 · P3/0 Blocking IssuesP0
P1
Non-blocking SuggestionsP2
Checklist Violations (1 fail / 56 total)General Principles Checklist
Strengths
|
…ve_v2 # Conflicts: # deps/requirements_base.txt # deps/requirements_lock_cuda12_arm.txt # deps/requirements_lock_rocm.txt # deps/requirements_lock_torch_arm.txt # deps/requirements_lock_torch_cpu.txt # deps/requirements_lock_torch_gpu_cuda12.txt # deps/requirements_lock_torch_gpu_cuda12_9.txt # rtp_llm/BUILD # rtp_llm/test/BUILD # rtp_llm/test/generate_config_test.py # rtp_llm/test/smoke/BUILD # rtp_llm/test/smoke/suites_h20_oss.bzl
The outer PID/EXIT_CODE wait loop only reports a host as failed when its per-host subshell exits non-zero, but commands were joined with ';' so a failing non-final command was masked. - build/kill/clean/test: leading `scp` of the executor now `|| exit $?`, so a failed dispatch is no longer hidden by the trailing ssh. - copy: the ssh result is captured via $(...) which masks its exit code; check it explicitly, guard empty TEST_OUTPUT_PATH, and fail on the critical process.log / *Result.json scp. Trace files (normal_*) stay best-effort (|| true). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hells Address review: multi_copy_script continued after a failed scp. Adopt `set -e` at the top of every per-host subshell as the single, consistent error-propagation mechanism, replacing the scattered `|| exit $?` guards. - copy: a failed scp (or empty/failed remote exec) now aborts the host's subshell instead of running the remaining scp commands. - build/kill/clean/test: same `set -e` guard for consistency. - Trace files (normal_*) stay best-effort via `|| true`; empty TEST_OUTPUT_PATH is still guarded explicitly. multi_kill_script already had the PID/EXIT_CODE propagation pattern. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI failed on the rocm config with: no such package '@pip_gpu_rocm_torch//flydsl' ... referenced by '//rtp_llm/models_py/triton_kernels:fla' fla/flydsl_chunk_gdn_mi308x*.py import the ROCm/MI308X-only `flydsl` package, which is not a pip dependency in the rocm lockfile. These modules are lazy-imported at runtime (fla/chunk.py) and shipped via setup.py, so import-based dependency resolution over :fla's srcs should not pull flydsl into the Bazel graph. Exclude them from the glob. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
AI Code Review - PR #1108Status: BLOCKING Summary: P0/1 · P1/12 · P2/22 · P3/2 Blocking IssuesP0
P1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Sync feature/python_native_v2 with the latest main branch. This merge brings all recent main-branch updates (multimodal refactor, new model support, VIT proxy server, etc.) into the python-native migration branch, resolving 37 merge conflicts.
Conflict Resolution Details
Modify/Delete Conflicts (27 files) - Followed feature branch deletions
All Bazel build artifacts were removed per the python-native migration:
Dependency lock files (deps/requirements_.txt): 9 files. Reason: Replaced by setup.py / pyproject.toml.
BUILD files: 13 files. Reason: Bazel test targets migrated to pytest.
.bzl test suites (suites_h20_oss.bzl, suites_rocm_oss.bzl): 2 files. Reason: Migrated to test_smoke_.py.
.pyi type stubs: 4 files. Reason: No longer needed in python-native mode.
Content Conflicts (10 files) - Manually merged
.bazelrc: Merged main's conda/java paths with feature's updated cache_bust date.
arch_config/arch_select.bzl: Adopted feature branch: removed whl_deps/platform_deps/torch_deps.
deps/http.bzl: Adopted feature branch: removed torch/aiter http_archive definitions.
rtp_llm/BUILD: Adopted feature branch: minimal exports_files only.
rtp_llm/cpp/config/ConfigModules.h: Both sides merged: feature's string-based attention API + main's enable_paged_open_source_fmha, enable_trtv1_fmha, use_triton_pa=false (with regression comment).
rtp_llm/multimodal/multimodal_util.py: Kept main's refactored class imports + trans_config/trans_mm_input; fixed SSRF import path.
rtp_llm/tools/api/model_basic_info_analyzer.py: Adopted feature branch: profiling_debug_logging_config enhancement.
test_py_flashinfer_mha_decode.py: Kept main branch: feature's run_bs variable was undefined.
models_py/triton_kernels/BUILD (2 files): Adopted feature branch: minimal visibility declaration.
Main-Branch Features Included
Multimodal refactor: rtp_llm/multimodal/multimodal_mixins/ directory restructure (ChatGLM4V, DeepSeek-VL2, Kimi-K25, LLaVA, Qwen2-VL, Qwen2-Audio, Qwen3-VL, Qwen3.5-MoE).
VIT Proxy Server: vit_app.py, vit_proxy_server.py, vit_rpc_server.py.
New models: Qwen3-VL, Qwen3-VL-MoE, Qwen3.5-MoE-VL.
Beam Search: bfloat16/ROCm support, efficient topk.
FMHA: enable_paged_open_source_fmha, enable_trtv1_fmha, Triton PA default-off.
Items Requiring Follow-up
The following main-branch smoke test cases were defined in .bzl files (now deleted) and not yet ported to pytest:
mla_pure_cp_pd, moe_pure_dp_fp8_dp2 (new MLA/MoE configs)
kimi_tool_call (Kimi linear model)
qwen3_vl, qwen3_vl_moe, qwen35_moe_vl_fp8 (VL multimodal suite)
rocm_basic_beam_search_tp2, rocm_dense_qwen3_8b_ptpc_no_asm_pa
These will need manual migration to the pytest smoke framework in a follow-up commit.
Testing
No conflict markers remain (grep verified)
CICD pending