feat: merge python_native_v2 into main with review fixes#1121
feat: merge python_native_v2 into main with review fixes#1121baohengyi wants to merge 41 commits into
Conversation
- Move OSS build/package metadata into pyproject and setup.py.\n- Carry forward overlay platform detection and dependency/package-data fixes.\n- Keep Bazel/deps metadata aligned with the pyproject migration.
- Use string-based attention backend selection.\n- Avoid platform import cycles during python-native bootstrap.\n- Keep model, op, and server-arg imports safe across CUDA, ROCm, and OSS runtimes.
- Port OSS unit test entrypoints from Bazel wrappers to pytest profile execution.\n- Add shared pytest profile support and platform skips.\n- Preserve existing unit coverage under the new profile layout.
- Move smoke and perf suites to pytest-based profiles.\n- Restore smoke runtime behavior for ROCm and SM100 profiles.\n- Update SM100 golden data and related platform-specific test coverage.
- Add remote pytest execution over REAPI with CAS inputs and output collection.\n- Harden executor failover, timeout budgeting, and cleanup.\n- Include perf data inputs for per-test remote execution.
- Keep the OSS history aligned with the internal perf-profile CI rollout.
Sync feature/python_native_v2 with latest main branch updates. Conflict resolution (all adopted upstream/feature version): - .bazelrc: kept upstream cache_bust date - arch_select.bzl, http.bzl: kept upstream (bazel dep functions removed) - ConfigModules.h: kept upstream (no enable_paged_open_source_fmha/enable_trtv1_fmha) - ConfigInit.cc: auto-merged (upstream removed registerMultimodal/QuarkMXFP4) - test_py_flashinfer_mha_decode.py: kept upstream (run_bs fix) - multimodal_util.py: kept upstream (class definitions inline) - model_basic_info_analyzer.py: kept upstream (profiling_debug_logging_config) - BUILD files: kept upstream (minimal per python-native migration) - Deleted 26 files (BUILD/bzl/pyi/lock) per python-native migration
P0: - Restore trans_mm_input/trans_config in multimodal_util.py (mm_process_engine import crash fix) P1: - Remove invalid 'from smoke.*' imports in multi_inst_case_runner.py - Convert VitParameters to dataclass with field(default_factory=dict) - Fix FLA initial_state direction: .contiguous() instead of .transpose().contiguous() - Add default MMPreprocessConfig in qwen_vl_renderer/llava_renderer when missing - Bind VIT worker to specific GPU via torch.cuda.set_device - Serialize pool access in mm_process_engine.submit to fix race condition
- FusedRopeKVCacheOp: remove unconditional Mrope reject in prefill/decode ctors (moved to runtime check) - CudaSampleOp: fix top_k=1 fast path skipping cum_log_probs update - CudaSampleOp: fix ROCm cum_log_probs using wrong tensor shape (use log_softmax + gather instead of raw probs.log()) - FLA chunk: add zero-length sequence validation for FlyDSL path
P0-2: Multimodal feature injection for DeepSeek-VL2/KimiK25/QWenV2Audio - DeepSeek-VL2: switch from GenericMoeModel to MultimodalGenericModel so visual features are injected into text embeddings - DeepSeek-VL2: use tokenizer_path (fallback to ckpt_path) for AutoTokenizer - KimiK25: override _create_python_model to use MultimodalGenericModel instead of inheriting DeepSeekV2's text-only GenericMoeModel - QWenV2Audio: wrap Qwen2MtpModel with multimodal embedding injector so audio features are no longer silently dropped P1: RemoteMultimodalProcessor gRPC deadline - Set deadline from max(mm_timeout_ms) across all mm_inputs - Remove bad connection on RPC failure to avoid reusing stuck VIT workers P1: Empty TensorPB deserialization (grpc_util.py) - Handle default-constructed TensorPB (data_type==0) by returning torch.empty(0) instead of raising 'unknown error type' P1: FlyDSL 0-length sequence validation - Add explicit check in megakernel_fwd for cu_seqlens with zero-length sequences; raise ValueError instead of silent underflow P1: CUDA Graph MRoPE position_ids on CPU - Use options_cuda_int32_ instead of options_cpu_int32_ + pin_memory for combo_position_ids capture buffer P1: generic_moe TP-only all_reduce - Document optimization opportunity: when fused_moe supports skip_allreduce, switch to unified TP all_reduce on combined output P1: StreamGroups.h logprobs mixing - Existing one-shot warning + CORRECTNESS RISK comment is sufficient mitigation; full fix requires scheduler partitioning by ReturnAllProbsMode
P1 (blocking fixes): - ReturnAllProbsMode scheduling bucketing in BatchDecodeScheduler/FIFOScheduler - StreamGroups needReturnAllProbs documentation update - OpenaiEndpoint/TensorPbConvert/PyWrappedModel multimodal tensor safety - RemoteMultimodalProcessor constructor alignment - CudaSampleOp top1 logprob via logsumexp to avoid full log_probs materialization - FusedRopeKVCacheOp MRoPE position_ids validation - multimodal_embedding deepstack length/dim/divisibility validation - RtpEmbeddingLookup optional int tensor dtype/device/shape validation - mm_profiler session snapshot and active-request synchronization - scatter_qkv assert -> ValueError - remap_local_ids_kernel shape/device/contiguous/dtype guards - mm_process_engine extra_input handling fixes - qwen_v2_audio/deepseek_vl2/qwen3_next/mori_ep_intranode_router/fla fixes P2 (non-blocking suggestions): - MultimodalProcessor.cc: GPU per-row statistical hash instead of full embedding D2H - warp_topk.hpp + hip_utils.hpp: cache occupancy launch params; stack/fixed workspace - loader.py: threshold-based CUDA empty_cache (only when reserved > 85%) - per_channel_fp8_quant_weight.py: half-size temp buffer for w1 gate/up reorder - moriep_wrapper.py: validate expert_num % ep_size == 0 - multimodal_util.py: list branch MultimodalInput type guard - collective_torch.py: allow caller-provided output_tensor in reduce_scatter
a9d8112 to
a468563
Compare
AI Code Review - PR #1121Status: BLOCKING Summary: P0/0 · P1/1 · P2/0 · P3/1 Blocking IssuesP1
Non-blocking SuggestionsP3
Checklist ✅ (56 items passed)Strengths
|
AI Code Review - PR #1121Status: BLOCKING Summary: P0/2 · P1/18 · P2/0 · P3/0 Blocking IssuesP0
P1
Checklist Violations (6 fail / 56 total)General Principles Checklist
Python Static-First Checklist
Strengths
|
P0: - Restore full row byte-content FNV-1a hash for multimodal feature cache keys (replaces low-dimensional statistics that caused silent KV-cache reuse) - Fail-closed when ssrf_check module is missing instead of falling back to bare requests.get P1: - Restore CP filtering and parallelism_config in MLA get_mla_impl - Restore captured sequence_lengths reference and in-place copy in XQA CUDA graph - Register MI308X_ROCM7 pytest marker - Restore headwise_config assignment to attn_inputs - Restore MoeConfig::use_mori_ep member - BatchDecodeScheduler: group streams by ReturnAllProbsMode to avoid starvation - Call TorchSymmMemCommunicator.close() before destroy_process_group - Add mutex to protect ROCm TopK occupancy cache - Restore 3rdparty/six/six.BUILD - mm_profiler: move yield outside lock; wait for active profiles before clearing - Fix DeepSeek VL2 default preprocess config to use MMPreprocessConfig(-1,...) - Use minimum positive mm_timeout_ms across batch instead of maximum - Treat non-OK REAPI Execute response status as test failure - Rename test_single to run_single to avoid pytest collection - Restore test_util py_library Bazel target - Forbid fallback comparer for mainse-flagged cases - Wrap remote_tests proto imports with clear error message
AI Code Review - PR #1121Status: BLOCKING Summary: P0/0 · P1/8 · P2/7 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist Violations (12 fail / 56 total)General Principles Checklist
Python Static-First Checklist
Strengths
|
P1: - LRO response type mismatch / empty response handling in remote executor - Prevent EXIT_CODE in stdout from overriding non-OK REAPI status - BatchDecodeScheduler: treat NONE ReturnAllProbsMode as wildcard - Add rtp_llm.utils.ssrf_check and route HTTP downloads through it - DeepSeek VL2: use rtp_llm.ops.MMPreprocessConfig (9-arg) consistently - FlashInfer CUDA graph replay: avoid calling plan() on replay path - standalone/BUILD: drop deleted arch_select symbols and stale labels P2: - MoriEP router tests: register pytest gpu(type/count) markers - MultimodalInput: align field name with trans_mm_input, avoid mutable defaults - verify_smoke_suites: discover test_smoke_*.py and fail on empty match - perf_runner: fail when configured baseline file is missing or empty - ROCm fused_moe conftest: only ignore GPU-specific tests - CAS client: fail batch upload / ByteStream write on size/status errors
AI Code Review - PR #1121Status: BLOCKING Summary: P0/1 · P1/15 · P2/0 · P3/0 Blocking IssuesP0
P1
Checklist ✅ (56 items passed)Strengths
|
P0: - SSRF redirect validation: manual redirect follow with per-Location scheme/host/IP re-check and relative URL resolution. - BatchDecodeScheduler: partial-batch fallback when incompatible modes prevent filling batch_size_. P1: - FIFOScheduler: initialize batch return_all_probs mode from running streams and reject incompatible DEFAULT/ORIGINAL waiting streams. - multimodal_util: restore cache_size <= 0 disables cache; recreate LRU when disabled cache gets positive size. - triton_kernels/BUILD: restore py_library targets for common/fla/kimi_kda/ moe/causal_conv1d/sparse_mla. - verify_smoke_suites: AST-only SMOKE_CASES parsing (stdlib-only). - attn_factory: restore get_global_weight_or_none for RoPE-less MLA; validate attn_backend/disable_attn_backends names and add flashinfer alias for py_flashinfer. - case_runner: server_manager/remote_kvcm cleanup in finally with idempotent stop/log copy. - conftest/__init__: defer heavy torch/triton/ops imports during pytest collect-only and plugin discovery. - ConfigModules: default use_triton_pa=false to align ROCm defaults. - smoke_framework/runner: copy shared env list to every role in multi-role smoke cases. - comparer_registry: auto-register internal mainse comparers before OSS fallback. - arch_select.bzl: restore requirement/internal_deps/triton_deps shims. - validation: add 'eval' to known smoke markers.
- P0-1: pass layer_idx through Qwen3NextAttention to CausalAttention - P0-2: SSRF check validates redirects and pins connections to resolved IPs - P1-1: reset MoriEP singleton before destroying process groups - P1-2: CPU backend sends ready signal via local_rank_start - P1-3: reuse self.tokenizer.image_token_id in DeepSeek-VL2 - P1-4: cache URL bytes and return fresh BytesIO copies - P1-5: align local MM batch timeout semantics with remote path - P1-6/P1-7: fix ROCm filtered_probs scope and cum_log_probs distribution - P1-8: support 2-D dispatch_ids/weights in remap_local_ids - P1-9: disable TRT allreduce for unsupported world sizes - P1-10: avoid permanent stall in BatchDecodeScheduler mixed logprobs mode - P1-11: abort VIT RPC on empty embeddings - P1-12: choose Local/RemoteMultimodalProcessor by vit_separation/role/tp_rank
AI Code Review - PR #1121Status: BLOCKING Summary: P0/0 · P1/10 · P2/34 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
- P2-1: gate large BeamSearch test matrix behind env var for CI - P2-2: cache VIT role address in EmbeddingEndpoint with TTL refresh - P2-4: throttle gc.collect() by weight count threshold in FP8 loader - P2-7: replace assert with explicit ValueError in reduce_scatter - P2-8: replace assert with explicit RuntimeError in MoriEP topology checks - P2-9: clean up VIT proxy workers on startup exception - P2-10: use 'is not None' instead of truthiness for Tensor default - P2-11: align scatter_qkv test exception type with implementation - P2-12: fix profiler active count leak on creation failure - P2-16: use VitConfig.mm_timeout_ms for VIT proxy default timeout - P2-17: short-circuit GPU sync in CP prefill attention routing - P2-18: downgrade MoriEP router hot-path log from info to debug - P2-20: remove duplicate trans_input serialization in enqueue - P2-21: add per-dim boundary check and checked multiply in TensorPbConvert - P2-22: convert remote multimodal bad response to ErrorInfo - P2-23: precompute host-side max_seqlen for Qwen2VL visual attention - P2-24: optimize BatchDecodeScheduler stream removal with set + remove_if - P2-25: reuse output buffer in PureCpRouter reduce_scatter hot path
AI Code Review - PR #1121Status: BLOCKING Summary: P0/0 · P1/11 · P2/36 · P3/1 Blocking IssuesP1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
P1 fixes: - HTTPS SSRF: stop rewriting URL to IP; pin connection pool host instead, preserving TLS SNI/cert verification via server_hostname/assert_hostname - VIT role separation: VIT_SEPARATION_ROLE with non-VIT role now uses RemoteMultimodalProcessor instead of requiring local mm_process_engine - flashinfer alias: expand 'flashinfer' <-> 'py_flashinfer' in blocklist for both auto and explicit backend modes - legacy FMHA: consume enable_fmha (global switch) and enable_open_source_fmha in _is_fmha_impl_disabled_legacy - kimi_kda BUILD: add autotune_cache dependency - LRO error: classify recoverable infra failures (exit_code=-1 + infra_category) instead of treating all LRO errors as normal test failures - proto generation: add build_py/sdist custom commands to generate proto files before packaging - smoke runner: use per-test data_root to compute local REL_PATH instead of import-time captured value P2 fixes: - SSRF redirect: close response before following next hop to avoid FD leak Already fixed in prior commits (verified): - CudaSampleOp cum_log_probs sampling distribution - TensorPbConvert shape boundary/overflow check - BatchDecodeScheduler stream removal optimization
AI Code Review - PR #1121Status: BLOCKING Summary: P0/0 · P1/8 · P2/50 · P3/10 Blocking IssuesP1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
- BatchDecode: partial scheduling + 100ms flush timeout + set reserve - ROCm/CUDA sampling: clone probs, per-request seed, gather+log cum_log_probs - VIT: round-robin addrs, proxy timeout fallback, URL singleflight - MoriEP: shmem finalize, local_world_size, finalize debug log - Quant: MXFP4 stacked MoE keys, fp4_moe_op pickle, fake_balance dtype check - MM: input length/batch output validation, extra_input count check - Renderers: preprocess_config for llama3/kimi_k25, empty TensorPB shape/dtype - Qwen2-VL/Qwen3VL: max_seqlen once, CPU locs reuse - Kernels: FlyDSL sentinel cache, remap fp32 cast, PureDP RS buffer cache - BeamSearch: default large-beam boundary test
AI Code Review - PR #1121Status: BLOCKING Summary: P0/2 · P1/15 · P2/22 · P3/0 Blocking IssuesP0
P1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
- multi_runner.sh: PID tracking + EXIT_CODE propagation for build/kill/copy/clean/test - Fix broken [ -z "$TP_SIZE" ] / [ -z "$MODEL_TYPE" ] spacing - .bazelrc: document cuda12 as shared base config; use cuda12_6/12_9/12_9_arm - pyproject.toml: document transformers 4.51.2 pin rationale - oss_optional_extras.toml: document aiter 0.1.13.dev14 pin rationale
- Restore rtp_llm/test/smoke/defs.bzl and thin BUILD wrapper for internal CI.
- test_gdn_block_prefill.py: run _test_one_case for bs > 1.
- BatchDecodeScheduler.h: schedule partial batch when flush timeout fires.
- ssrf_check.py: narrow ValueError catch so private-IP validation propagates.
- case_runner.py: OpenaiComparer get("query"), concurrency compare fix, LoRA or validation.
- multi_inst_case_runner.py: try/finally cleanup for Pd/Dp/Vit/FrontApp; null-safe stop.
- setup.py: move install_requires/extras computation under if __name__ == "__main__"; append retry logs.
- comparer_registry.py: narrow bare except to ImportError/ModuleNotFoundError.
- conftest.py: preserve explicit empty CUDA_VISIBLE_DEVICES pool.
- concurrency_limit_test.py: patch env vars so tearDown restores them.
- fp8_kernel.py: guard fp8_grouped_gemm_ptpc None with RuntimeError.
- kimi_k25_renderer.py: use part.preprocess_config, not image_url.preprocess_config.
- CudaSampleOp: conditional clone only when return_original_all_probs - grpc_util: preserve multi-D shape for empty tensors; graceful dtype fallback - multimodal_embedding: fix forward() type annotation for multimodal_locs - case_runner: raise Tau2BenchComparer priority; add as_completed timeout - utils.py: use common_def.REL_PATH dynamic reference instead of snapshot - normal_comparer: guard against empty chunks IndexError; detect aux_info mismatch - mm_process_engine: fix excessive indentation in for-loop body - deepgemm_wrapper: add ImportError to exception catch list - ops/__init__: catch TypeError when LIBDIR is None - maga_server_manager: replace mutable default args with None - server_args: move _env_mappings from class var to instance var - vit_rpc_server: add explicit return after context.abort() - ssrf_check: close response before raising on max redirects - conftest: log GPU cleanup exceptions; register atexit for faulthandler fd - mixed_fp4_quant_weight: assert stacked/per-expert weight consistency
AI Code Review - PR #1121Status: BLOCKING Summary: P0/1 · P1/8 · P2/31 · P3/8 Blocking IssuesP0
P1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
P0: - server_args: fix EnvArgumentParser._env_mappings → self._env_mappings in _register_env_mapping, print_env_mappings, get_env_mappings P1: - normal_comparer: add QueryStatus.VISIT_FAILED to SmokeException - multi_inst_case_runner: init self.remote_kvcm_server = None in Pd/Dp runners - case_runner: add break on first concurrency inconsistency - grpc_util: preserve shape on unsupported empty tensor dtype (FP32 fallback) - BatchDecodeScheduler: conditional timeout (5s idle / 100ms busy); remove FINISHED zombie streams from waiting_streams_ - docs/README.md: inline docs deps, remove deleted requirements.txt ref P2: - multi_inst_case_runner: try/except in _stop_server_safe and cleanup - ssrf_check: use session context manager; init response=None - case_runner: assign results[0] on consistent concurrency; OpenaiComparer type-safe predicate with isinstance check - conftest: _fh.disable() before closing fault file - sparse_mla_decode_op_test: remove hardcoded sys.path - norm.py: FusedQKRMSNorm None flashinfer check in __init__ - docs/start/install.md: update source build instructions for pip wheel
AI Code Review - PR #1121Status: BLOCKING Summary: P0/1 · P1/6 · P2/14 · P3/2 Blocking IssuesP0
P1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
P0: - rtp_llm/BUILD: restore 12 py_library stub targets (pip shims + testlib/sdk) required by smoke defs.bzl for Bazel analysis P1: - executor.py: _write_final_stream_files only overwrites when ByteStream not started or result data is longer - platform.py: fallback to nvcc --version when version.json missing, ultimate default cuda12_6 - remote_exec_rtp.py: guard prepare_venv.py with if-exists check - comparer_registry.py: fix mainse import path and class names (MainseDecodeArpcComparer / MainseEmbeddingArpcComparer) - generic_moe.py: move GroupTopK() and config attrs to __init__ - ssrf_check.py: re-raise ValueError in get_connection fallback for security parity with send()
P2: - cas_client: log.warning in download_blob on gRPC failure - cas_client: add committed_size check in _bytestream_write_file_parallel - endpoint_info: add threading.RLock to ExecutorEndpointPool - BatchDecodeScheduler: predicate checks !waiting_streams_.empty() for immediate first-request wakeup - trtllm_gen: assert kv_cache is not None in forward() - plugin: validate MAX_RETRIES >= 0 in _execute_with_retry - distributed_server_test: dedent 17 test methods from tearDown to class body level - runner: add ENABLE_STABLE_SCATTER_ADD=ON in multi-role path - OpenaiEndpoint: add comment explaining logprobs=false priority - warp_topk: upgrade to std::shared_mutex for occupancy cache - MultimodalProcessor: .to(kCPU).contiguous() avoids GPU temp alloc - MultimodalProcessor: replace FNV-1a with std::hash<string_view> - generic_moe: EP path uses fused sigmoid_gate_scale_add operator - ssrf_check: _validate_url before session creation P3: - multi_inst_case_runner: fix DpSeperation error message - runs_plugin: shallow copy keywords in _clone_item
AI Code Review - PR #1121Status: BLOCKING Summary: P0/0 · P1/7 · P2/44 · P3/11 Blocking IssuesP1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
- LocalRpcServer.cc: remove PyErr_Fetch leak, use e.what() directly - comparer_registry: split mainse imports into per-file modules - reranker_comparer: Exception → SmokeException(COMPARE_FAILED) - embedding_comparer: Exception → SmokeException(COMPARE_FAILED) - multi_inst_case_runner: assert GPU count before slicing gpu_ids - openai_comparer: add weights_only=False to torch.load - llava_renderer: validate crop_positions h/w != 0 before division
AI Code Review - PR #1121Status: BLOCKING Summary: P0/2 · P1/6 · P2/13 · P3/3 Blocking IssuesP0
P1
Non-blocking SuggestionsP2
P3
Checklist Violations (5 fail / 56 total)General Principles Checklist
Python Static-First Checklist
Strengths
|
…ve_v2 # Conflicts: # deps/requirements_base.txt # deps/requirements_lock_cuda12_arm.txt # deps/requirements_lock_rocm.txt # deps/requirements_lock_torch_arm.txt # deps/requirements_lock_torch_cpu.txt # deps/requirements_lock_torch_gpu_cuda12.txt # deps/requirements_lock_torch_gpu_cuda12_9.txt # rtp_llm/BUILD # rtp_llm/test/BUILD # rtp_llm/test/generate_config_test.py # rtp_llm/test/smoke/BUILD # rtp_llm/test/smoke/suites_h20_oss.bzl
The outer PID/EXIT_CODE wait loop only reports a host as failed when its per-host subshell exits non-zero, but commands were joined with ';' so a failing non-final command was masked. - build/kill/clean/test: leading `scp` of the executor now `|| exit $?`, so a failed dispatch is no longer hidden by the trailing ssh. - copy: the ssh result is captured via $(...) which masks its exit code; check it explicitly, guard empty TEST_OUTPUT_PATH, and fail on the critical process.log / *Result.json scp. Trace files (normal_*) stay best-effort (|| true). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hells Address review: multi_copy_script continued after a failed scp. Adopt `set -e` at the top of every per-host subshell as the single, consistent error-propagation mechanism, replacing the scattered `|| exit $?` guards. - copy: a failed scp (or empty/failed remote exec) now aborts the host's subshell instead of running the remaining scp commands. - build/kill/clean/test: same `set -e` guard for consistency. - Trace files (normal_*) stay best-effort via `|| true`; empty TEST_OUTPUT_PATH is still guarded explicitly. multi_kill_script already had the PID/EXIT_CODE propagation pattern. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI failed on the rocm config with: no such package '@pip_gpu_rocm_torch//flydsl' ... referenced by '//rtp_llm/models_py/triton_kernels:fla' fla/flydsl_chunk_gdn_mi308x*.py import the ROCm/MI308X-only `flydsl` package, which is not a pip dependency in the rocm lockfile. These modules are lazy-imported at runtime (fla/chunk.py) and shipped via setup.py, so import-based dependency resolution over :fla's srcs should not pull flydsl into the Bazel graph. Exclude them from the glob. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AI Code Review - PR #1121Status: BLOCKING Summary: P0/0 · P1/10 · P2/25 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
- TensorPbConvert::torchToPb: move tensor to CPU before serializing data_ptr.
- deepseek_vl2 / kimi_k25: raise FileNotFoundError on missing config.json
instead of returning None/{} (fail fast, matches QWenV2).
- qwen_v2: fix transformer_prefix order to prefix + model_prefix so multimodal
("language_model.model.") keys resolve correctly.
- ssrf_check: also reject multicast and unspecified addresses.
- embeding_test: index token ids by vocab_size, not hidden_size.
- deepgemm_wrapper: globals().get() instead of getattr(globals(), ...).
- tau2_bench_comparer: tarfile extractall(filter="data") with fallback.
- QueryConverter::transMMInputsPB: take vector by const reference.
- attn_factory: extract _expand_flashinfer_alias() for bidirectional alias.
- headwise: cache input/kv length CPU lists in prepare(), reuse in forward().
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- ConfigInit: FMHAConfig/MoeConfig pickle setstate accept legacy tuple sizes (12/14 and 11/13), filling new trailing fields with defaults for rolling upgrades instead of throwing. - mori_ep_intranode_router: assert _dispatch_ids is set in _finalize_single. - xqa: XQADecodeImpl.support uses kernel_tokens_per_block (kernel-consistent); XQAWrapper precomputes q_len_per_req in prepare() to drop the per-forward GPU->CPU sync (propagated through the cuda-graph update path). - test_xqa: assert per-case support/unsupported results and final counts. - rocm_fmha_test: return failed when the aiter comparison mismatches. - flashmla_sparse_cp_op_test: destroy_distributed_environment() in finally. - deepep_low_latency_router_test: release port locks in finally. - fp4_gemm_linear_test: snapshot/restore os.environ in setUp. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AI Code Review - PR #1121Status: BLOCKING Summary: P0/0 · P1/1 · P2/31 · P3/10 Blocking IssuesP1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
Continues from Continues from #985
This PR keeps the head branch on baohengyi/rtp-llm:feature/python_native_v2 and targets alibaba/rtp-llm:main.
It migrates the native build/test path to setup.py/pyproject.toml, pytest CI profiles, remote execution based test orchestration, and updated smoke/perf coverage. It also addresses all P0/P1 blocking issues and P2 non-blocking suggestions raised in the latest review round.
Key changes:
Validation:
python3 -m py_compile