Skip to content

feat(moe): add bf16 DeepEP-normal MoE path via DeepGEMM grouped GEMM#1111

Open
Tanmo-ai wants to merge 3 commits into
alibaba:mainfrom
Tanmo-ai:feature/moe-bf16-deepep-deepgemm
Open

feat(moe): add bf16 DeepEP-normal MoE path via DeepGEMM grouped GEMM#1111
Tanmo-ai wants to merge 3 commits into
alibaba:mainfrom
Tanmo-ai:feature/moe-bf16-deepep-deepgemm

Conversation

@Tanmo-ai

Copy link
Copy Markdown

Summary

Add an opt-in expert path for unquantized (bf16) MoE under DeepEP normal mode
that uses DeepGEMM grouped GEMM instead of the Triton fused_moe_kernel.

Changes

  • DeepGemmBf16HybridExecutor: runtime-dispatches between a masked 3D layout
    (small token count / decode) and a contiguous flat layout (large token count /
    prefill) for better memory utilization.
  • ep_scatter_bf16 / ep_scatter_v2_bf16: bf16 variants of the existing fp8
    scatter kernels (flat → contiguous, flat → 3D masked).
  • CudaNoQuantDpNormalDeepGemmStrategy: opt-in only, selected via
    --moe_strategy no_quant_dp_normal_deepgemm, gated on bf16 + has_deep_gemm +
    SM≥9 + no CUDA graph. It is not part of "auto" selection, so the default
    MoE path on existing deployments is unchanged.

deepgemm_wrapper.py changes (backward-compatible)

These changes enable the bf16 grouped-GEMM path and do not affect existing
fp8/bf16 callers:

  • bf16 fallback symbol names corrected to the real deep_gemm symbols
    (gemm_bf16_bf16_bf16_nt*). resolve_symbol() tries the standard name first
    and only falls back, so existing resolution is unchanged — this only makes the
    previously dormant bf16 path resolvable. The stale compiled_dims argument is
    dropped from the contiguous/masked bf16 calls to match the actual deep_gemm
    signature.
  • Symbol resolution deferred to first use (_ensure_initialized) instead of
    at import time. Functionally identical, only lazy: the same symbols are
    resolved, just on the first actual GEMM call.
  • has_deep_gemm() re-checks until the first successful import (then caches
    True) instead of caching the first result. For normal processes where
    deep_gemm is importable at import time it returns True on the first call
    exactly as before; this only adds resilience when the package becomes
    importable slightly later, and does not change existing behavior.

Testing

  • New unit test test_ep_scatter_bf16.py covers bf16 scatter kernel correctness.
  • Existing fp8/bf16 MoE paths and default auto-selection are unchanged.

Add an opt-in expert path for unquantized (bf16) MoE under DeepEP normal
mode that uses DeepGEMM grouped GEMM instead of the Triton fused_moe_kernel.

- DeepGemmBf16HybridExecutor: runtime-dispatches between a masked 3D layout
  (small token count / decode) and a contiguous flat layout (large token
  count / prefill) for better memory utilization.
- ep_scatter_bf16 / ep_scatter_v2_bf16: bf16 variants of the existing fp8
  scatter kernels (flat -> contiguous, flat -> 3D masked).
- CudaNoQuantDpNormalDeepGemmStrategy: opt-in only, selected via
  --moe_strategy no_quant_dp_normal_deepgemm, gated on bf16 + has_deep_gemm
  + SM>=9 + no CUDA graph. It is NOT part of "auto" selection, so the default
  MoE path on existing CUDA deployments is unchanged.

deepgemm_wrapper.py changes are backward-compatible and do not affect existing
fp8/bf16 callers:
- has_deep_gemm() re-checks until the first successful import (then caches
  True) instead of caching the first result; for normal processes where
  deep_gemm is importable at import time it returns True on the first call
  exactly as before. Needed for spawned subprocesses whose sys.path is set
  up after module import.
- Symbol resolution is deferred from import-time to first use
  (_ensure_initialized); functionally identical, only lazy.
- bf16 grouped-GEMM legacy fallback names corrected to the real deep_gemm
  symbols (gemm_bf16_bf16_bf16_nt*). resolve_symbol() tries the standard name
  first, so existing resolution is unchanged; this only makes the previously
  dormant bf16 path resolvable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Tanmo-ai Tanmo-ai requested a review from LLLLKKKK as a code owner June 17, 2026 01:59
@CLAassistant

CLAassistant commented Jun 17, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@LLLLKKKK

Copy link
Copy Markdown
Collaborator

AI Code Review - PR #1111

Status: BLOCKING

Summary: P0/0 · P1/1 · P2/1 · P3/0

Blocking Issues

P1

  • contiguous scatter 测试分配容量不足会越界写 @ rtp_llm/models_py/triton_kernels/moe/test/test_ep_scatter_bf16.py:315
    • 建议:按 recv_topk 统计每个 expert 的真实 token 数并 align 后再分配,或降低 token_num/topk,确保 sum(aligned_counts) 覆盖所有有效写入。

Non-blocking Suggestions

P2

  • 新增 BF16 DeepGEMM executor 缺少端到端正确性覆盖 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:158
    • 建议:补充 executor 级 BF16 no-quant 测试,覆盖 token_num 小于和大于阈值两条路径、strategy 显式选择,并校验 router weight 后的输出。

Checklist Violations (5 fail / 104 total)

General Principles Checklist

  • [6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 新增 BF16 DeepGEMM executor 缺少端到端正确性覆盖
    新增 executor/strategy 缺少端到端正确性测试;现有新增测试只覆盖 scatter kernel,未覆盖完整 GEMM/activation/gather 输出。
  • [6.1] Tests — 边界 case 覆盖(空、单元素、最大值) → issue contiguous scatter 测试分配容量不足会越界写
    test_large_random 的 per-expert 容量固定为 128,合计 1024 slot 小于 2048 个有效 assignment,边界压测本身可能越界。
  • [6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue 新增 BF16 DeepGEMM executor 缺少端到端正确性覆盖
    新路径用于 DeepEP normal 分布式 MoE,但未看到 executor 级 EP/strategy 覆盖。

RTP-LLM Checklist

  • [B] 正确性与逻辑 — CUDA kernel batch 索引维度与 host buffer shape 匹配 → issue contiguous scatter 测试分配容量不足会越界写
    新增 contiguous scatter 压测传入的 host buffer 容量与 kernel 实际写入规模不匹配,可能越界写。
  • [H] 测试与 CI — 测试覆盖充分:大重构等价覆盖,新功能端到端测试 → issue 新增 BF16 DeepGEMM executor 缺少端到端正确性覆盖
    新增 BF16 DeepGEMM executor/strategy 是新功能,但缺少端到端输出对比测试;scatter large_random 还存在容量错误。

Strengths

  • 新策略保持显式 opt-in,没有进入 auto 选择,降低默认 MoE 路径回归风险。
  • DeepGEMM wrapper 改为 lazy init,避免可选依赖在模块 import 阶段直接失败。

…r e2e

test_ep_scatter_bf16: the scatter kernels assign output slots with a
non-deterministic tl.atomic_add, so row order within an expert is not fixed.
Rewrite all checks to be order-independent by following output_index (the
authoritative token->slot map the gather stage uses) instead of assuming a
token-sequential layout. This replaces the previously order-sensitive
torch.equal comparisons that could spuriously fail. Also:
- fix the roundtrip tests to use hidden_size % 512 == 0 (ep_gather BLOCK_D=512);
- size the contiguous stress test's per-expert capacity from the real routing
  histogram (bincount, aligned) instead of a fixed count, matching the
  executor's allocation and avoiding under-allocation.

deepgemm_bf16_hybrid_executor: new end-to-end test for the bf16 DeepEP-normal
hybrid executor (scatter -> grouped GEMM -> silu_and_mul -> grouped GEMM ->
gather with router weight), covering both the masked (small token count) and
contiguous (large token count) runtime paths against a plain-torch reference.
Tagged open_skip + H20 (requires deep_gemm + SM>=9).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@LLLLKKKK

Copy link
Copy Markdown
Collaborator

AI Code Review - PR #1111

Status: LGTM

Summary: P0/0 · P1/0 · P2/3 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

  • CUDA 计数张量可能落到错误设备 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:308
    • 建议:改为 .to(device=device, non_blocking=True),并让 expert_start_loc 跟随同一 device。
  • BF16 grouped GEMM 的 compiled_dims 参数被静默忽略 @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:595
    • 建议:若底层支持则传递 compiled_dims;否则在 compiled_dims != "nk" 时显式抛错或移除该公开参数。
  • 新增 DeepEP Normal 路径缺少多 rank 覆盖 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/test/deepgemm_bf16_hybrid_executor_test.py:72
    • 建议:补充至少一个 ep_size>1 的集成测试,经过 DeepepNormalRouter prepare/finalize,并校验跨 rank 输出与 torch reference 一致。

Checklist Violations (7 fail / 104 total)

General Principles Checklist

  • [6.1] Architecture — 状态不变量:创建/更新/失败/重试/回滚路径有效 → issue CUDA 计数张量可能落到错误设备
    num_recv_tokens_per_expert_gpu 在 executor 中用 .cuda() 依赖当前 device,可能破坏与 hidden_states.device 同设备的不变量。
  • [6.1] Architecture — 错误语义:fail-fast/retry/fallback/silent 行为显式 → issue BF16 grouped GEMM 的 compiled_dims 参数被静默忽略
    m_grouped_bf16_gemm_nt_contiguous/masked 保留 compiled_dims 参数,但传入非默认值时被静默忽略。
  • [6.1] Architecture — 兼容性:公开 API/持久数据/配置/环境迁移安全 → issue BF16 grouped GEMM 的 compiled_dims 参数被静默忽略
    wrapper 公开参数 compiled_dims 未传递到底层实现,调用方会得到与参数含义不一致的行为。
  • [6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 新增 DeepEP Normal 路径缺少多 rank 覆盖
    已有 kernel/executor 单卡测试,但缺少真实 DeepepNormalRouter 多 rank 集成覆盖。
  • [6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue 新增 DeepEP Normal 路径缺少多 rank 覆盖
    DeepEP Normal 是分布式路径,新增测试固定 ep_size=1,未覆盖跨 rank dispatch/combine。

RTP-LLM Checklist

  • [E] 分布式 — 跨 rank 数据一致性 → issue 新增 DeepEP Normal 路径缺少多 rank 覆盖
    测试未覆盖 EP>1 的跨 rank dispatch/combine,无法验证 global/local expert id 与 per-rank token count 一致性。
  • [H] 测试与 CI — 测试覆盖充分:大重构等价覆盖,新功能端到端测试 → issue 新增 DeepEP Normal 路径缺少多 rank 覆盖
    新增端到端测试只覆盖 ep_size=1 且绕过 router,缺少该 DeepEP Normal strategy 的真实多 rank 覆盖。

Strengths

  • scatter kernel 测试通过 output_index 做顺序无关校验,避免把 tl.atomic_add 的非确定顺序误判为错误。
  • DeepGEMM 策略为显式 opt-in,避免改变现有 auto MoE 选择。
  • 新增 BF16 DeepGEMM executor 分别覆盖 masked 与 contiguous 两条计算路径。

@Tanmo-ai Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch 2 times, most recently from e4564ea to 68b0a44 Compare June 17, 2026 08:54
@LLLLKKKK

Copy link
Copy Markdown
Collaborator

AI Code Review - PR #1111

Status: LGTM

Summary: P0/0 · P1/0 · P2/0 · P3/0

lgtm ready to ci

Checklist ✅ (104 items passed)

Strengths

  • DeepGEMM BF16 executor 覆盖 masked/contiguous 两条路径,并补充 scatter/gather 与 executor 级端到端测试。

@Tanmo-ai Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from 2012875 to 8e0e71d Compare June 17, 2026 12:01
@LLLLKKKK

Copy link
Copy Markdown
Collaborator

AI Code Review - PR #1111

Status: LGTM

Summary: P0/0 · P1/0 · P2/2 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

  • BF16 DeepGEMM 符号未在策略选择时校验 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/strategy/no_quant.py:121
    • 建议:在 check_conditions 中校验 BF16 grouped GEMM 符号可解析,或新增 has_deep_gemm_bf16_grouped() 能力检查后再允许选中策略。
  • 新增策略缺少工厂选路覆盖 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/strategy/no_quant.py:120
    • 建议:补充 strategy/factory 单测,设置 moe_strategy="no_quant_dp_normal_deepgemm",验证选中 CudaNoQuantDpNormalDeepGemmStrategy 及对应 router/executor。

Checklist Violations (2 fail / 56 total)

General Principles Checklist

  • [6.1] Architecture — 错误语义:fail-fast/retry/fallback/silent 行为显式 → issue BF16 DeepGEMM 符号未在策略选择时校验
    策略选择阶段只检查 deep_gemm 包存在,未检查 BF16 grouped GEMM 符号,错误会延迟到首个执行请求。
  • [6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 新增策略缺少工厂选路覆盖
    已有 executor/kernel 聚焦测试,但新增 CLI strategy 没有 registry/factory 选路测试。

Strengths

  • 新增 BF16 DeepGEMM 路径保持显式 opt-in,未进入 auto 默认选择,降低了默认 MoE 路径回归风险。
  • 新增 executor 与 scatter kernel 测试覆盖 masked/contiguous 路径,并覆盖 EP rank0/rank1 的 global-to-local expert id 映射。

@Tanmo-ai Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from 8e0e71d to 0a39628 Compare June 17, 2026 14:25
@LLLLKKKK

Copy link
Copy Markdown
Collaborator

AI Code Review - PR #1111

Status: LGTM

Summary: P0/0 · P1/0 · P2/3 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

  • BF16 contiguous padding 行被当作有效 expert 参与 GEMM @ rtp_llm/models_py/triton_kernels/moe/ep_kernels.py:262
    • 建议:将 padding 行标记为 m_indices=-1 或显式填零并固定该契约,补充 padding 行测试。
  • 未校验 DeepGEMM BF16 grouped 符号可用性 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/strategy/no_quant.py:121
    • 建议:在策略或 wrapper 增加 BF16 grouped symbol 探测,缺失时拒绝该策略或给出明确 fallback。
  • 新增 DeepGEMM 策略缺少选择路径测试 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/strategy/no_quant.py:120
    • 建议:在 test_cuda_strategies.py 增加该策略的正向选择和 fp16/auto/cuda_graph 等负向用例。

Checklist Violations (5 fail / 104 total)

General Principles Checklist

  • [6.1] Architecture — 错误语义:fail-fast/retry/fallback/silent 行为显式 → issue 未校验 DeepGEMM BF16 grouped 符号可用性
    策略只检查 deep_gemm 包存在,未检查 BF16 grouped 符号是否可解析;缺失符号会推迟到 executor 运行期失败。
  • [6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 新增 DeepGEMM 策略缺少选择路径测试
    executor/kernel 有覆盖,但新增 moe_strategy 的 registry/can_handle 正负向选择路径未覆盖。
  • [6.1] Tests — 边界 case 覆盖(空、单元素、最大值) → issue BF16 contiguous padding 行被当作有效 expert 参与 GEMM
    contiguous scatter 测试验证 occupied rows 和 roundtrip,但未断言 padding 行的 m_indices/填零契约。

RTP-LLM Checklist

  • [A] 兼容性与配置 — 可选依赖 lazy import,pybind 新字段有 C++ 默认值 → issue 未校验 DeepGEMM BF16 grouped 符号可用性
    deep_gemm 依赖已 lazy,但策略层只确认包存在,未验证新增 BF16 grouped 符号可用。
  • [H] 测试与 CI — 测试覆盖充分:大重构等价覆盖,新功能端到端测试 → issue 新增 DeepGEMM 策略缺少选择路径测试
    新增 no-quant DeepGEMM 策略是配置入口新功能,缺少策略选择层面的正负用例。

Strengths

  • 新增 BF16 DeepGEMM executor 与 Triton scatter 测试覆盖 masked/contiguous、EP rank0/rank1 和 scatter-gather roundtrip,核心数值路径有针对性验证。

@Tanmo-ai Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from 0a39628 to e284541 Compare June 18, 2026 03:05
@LLLLKKKK

Copy link
Copy Markdown
Collaborator

AI Code Review - PR #1111

Status: BLOCKING

Summary: P0/0 · P1/1 · P2/1 · P3/0

Blocking Issues

P1

  • 新策略探测会让非 opt-in 配置触发 DeepGEMM 符号解析异常 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/strategy/no_quant.py:126
    • 建议:在检查 moe_strategy 后显式短路,或让 has_deep_gemm_bf16_grouped 捕获符号解析失败并返回 False。

Non-blocking Suggestions

P2

  • 测试跳过条件未覆盖 BF16 grouped DeepGEMM 符号 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/test/deepgemm_bf16_hybrid_executor_test.py:71
    • 建议:改用 has_deep_gemm_bf16_grouped() 作为 skip 条件,并确保该 helper 对缺符号返回 False 而不是抛异常。

Checklist Violations (4 fail / 60 total)

General Principles Checklist

  • [6.1] Architecture — 错误语义:fail-fast/retry/fallback/silent 行为显式 → issue 新策略探测会让非 opt-in 配置触发 DeepGEMM 符号解析异常
    可选 deep_gemm bf16 grouped 符号缺失应让新策略不可用,但当前探测会抛 RuntimeError 并中断策略枚举。
  • [6.1] Architecture — 兼容性:公开 API/持久数据/配置/环境迁移安全 → issue 新策略探测会让非 opt-in 配置触发 DeepGEMM 符号解析异常
    旧 deep_gemm 包缺新 bf16 grouped 符号时,新注册策略的探测可能影响现有 auto 配置初始化。
  • [6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 测试跳过条件未覆盖 BF16 grouped DeepGEMM 符号
    新增 e2e 测试只按 has_deep_gemm() 跳过,未覆盖策略实际依赖的 BF16 grouped symbols 可用性。

RTP-LLM Checklist

  • [H] 测试与 CI — 测试覆盖充分:大重构等价覆盖,新功能端到端测试 → issue 测试跳过条件未覆盖 BF16 grouped DeepGEMM 符号
    新增 e2e 测试未按 has_deep_gemm_bf16_grouped() 跳过,覆盖条件与策略门控不一致。

Strengths

  • 新增 bf16 DeepGEMM 执行路径设计为显式 opt-in,未在策略意图上纳入 auto 默认选择。
  • 测试覆盖 masked/contiguous 两条执行路径,并包含 scatter/gather 与 EP rank 的 expert id 映射场景。

@Tanmo-ai Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from e284541 to cc2025a Compare June 18, 2026 04:01
@LLLLKKKK

Copy link
Copy Markdown
Collaborator

AI Code Review - PR #1111

Status: BLOCKING

Summary: P0/0 · P1/1 · P2/1 · P3/0

Blocking Issues

P1

  • DeepGEMM 初始化把 BF16 符号变成 FP8 路径的硬依赖 @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:171
    • 建议:按调用路径拆分初始化:FP8 wrapper 只解析 FP8 符号,BF16 能力探测只解析 BF16 grouped 符号,缺失时仅让新策略不可选。

Non-blocking Suggestions

P2

  • BF16 grouped GEMM 的 compiled_dims 参数被静默忽略 @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:621
    • 建议:要么继续把 compiled_dims 传给支持该参数的 deep_gemm 符号,要么移除/显式拒绝该参数,避免静默忽略。

Checklist Violations (3 fail / 104 total)

General Principles Checklist

  • [6.1] Architecture — 兼容性:公开 API/持久数据/配置/环境迁移安全 → issue DeepGEMM 初始化把 BF16 符号变成 FP8 路径的硬依赖
    _ensure_initialized() 统一解析 BF16 grouped 符号会阻断旧 deep_gemm 环境的既有 FP8 wrapper;另有 BF16 grouped compiled_dims 参数静默忽略。

RTP-LLM Checklist

  • [A] 兼容性与配置 — 可选依赖 lazy import,pybind 新字段有 C++ 默认值 → issue DeepGEMM 初始化把 BF16 符号变成 FP8 路径的硬依赖
    deep_gemm 可选依赖虽改为 lazy 初始化,但一次解析所有符号,BF16 grouped 缺失会扩散到既有 FP8 wrapper。
  • [B] 正确性与逻辑 — 接口返回类型变更有兼容处理 → checklist-only
    单个 draft 认为 bf16_gemm_nt legacy fallback 签名不兼容;未达到 issue 保留阈值,作为兼容风险提示保留。

Strengths

  • 新增 BF16 hybrid executor 覆盖 masked/contiguous 两条路径,并补充 scatter/gather、端到端数值与 EP rank 映射测试。

@Tanmo-ai Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from cc2025a to 9ecf9d8 Compare June 18, 2026 05:28
@LLLLKKKK

Copy link
Copy Markdown
Collaborator

AI Code Review - PR #1111

Status: BLOCKING

Summary: P0/0 · P1/1 · P2/0 · P3/0

Blocking Issues

P1

  • 空 token 会进入 masked 路径并以 0 grid 启动 Triton @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:159
    • 建议:在分流前处理 token_num==0 或 sum(expert_num_tokens)==0,直接返回同 shape 的空/零 fused_expert_output,并补空 batch/空 rank 测试。

Checklist Violations (5 fail / 104 total)

General Principles Checklist

  • [6.1] Architecture — 状态不变量:创建/更新/失败/重试/回滚路径有效 → issue 空 token 会进入 masked 路径并以 0 grid 启动 Triton
    空 token/空 rank 状态未在分流前处理,会进入 masked 路径并触发 0-grid kernel。
  • [6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue 空 token 会进入 masked 路径并以 0 grid 启动 Triton
    DeepEP 小 batch/路由倾斜可能让某个 rank 的 token_num 为 0,当前测试没有覆盖该边界且实现会继续进入 masked path。
  • [6.1] Tests — 边界 case 覆盖(空、单元素、最大值) → issue 空 token 会进入 masked 路径并以 0 grid 启动 Triton
    DeepEP 小 batch/路由倾斜可能让某个 rank 的 token_num 为 0,当前测试没有覆盖该边界且实现会继续进入 masked path。

RTP-LLM Checklist

  • [B] 正确性与逻辑 — 边界 case(空输入、单元素、最大值) → issue 空 token 会进入 masked 路径并以 0 grid 启动 Triton
    token_num == 0 时 execute() 仍选择 masked path,后续 0 alignment 的 scatter/DeepGEMM 没有保护。
  • [H] 测试与 CI — 测试覆盖充分:大重构等价覆盖,新功能端到端测试 → issue 空 token 会进入 masked 路径并以 0 grid 启动 Triton
    新增 executor/scatter 测试覆盖 masked/contiguous 和 ep2,但没有覆盖本 rank 收到 0 token 的 DeepEP 边界。

Strengths

  • 新增策略为显式 opt-in,避免改变 auto 默认 MoE 路径。
  • 新增测试覆盖 bf16 scatter kernel、DeepGEMM executor masked/contiguous 以及 EP rank0/rank1 主要路径。

…d bf16 init

Review fixes on the bf16 DeepEP-Normal deepgemm MoE path. All changes are
confined to the opt-in no_quant_dp_normal_deepgemm path, the bf16 deep_gemm
wrappers, and tests — the fp8 path and other default/cross-arch paths are
unaffected.

- DeepGemmBf16HybridExecutor.execute: handle empty rank before dispatch. DeepEP
  small-batch / skewed routing can leave a rank with token_num == 0; that would
  otherwise enter the masked path with alignment == 0 and launch 0-grid Triton
  scatter / 0-size DeepGEMM. Return an empty same-shape [0, K] bf16 output.

- DeepGemmBf16HybridExecutor (contiguous path): build the per-expert token-count
  tensor with .to(device=hidden_states.device, non_blocking=True) instead of
  .cuda() so it honors the hidden-states device invariant.

- deepgemm_wrapper: decouple bf16 symbol resolution from the fp8 path. Previously
  _ensure_initialized() resolved fp8 AND bf16 symbols together, so an older
  deep_gemm build missing the bf16 symbols would raise from _ensure_initialized()
  and break the fp8 wrappers. Now _ensure_initialized() resolves only _FP8_SYMBOLS
  (raises if missing — fp8 is core), while _ensure_bf16_initialized() resolves
  _BF16_SYMBOLS independently and tolerantly (missing -> impls stay None, never
  propagate). bf16 wrappers call _ensure_bf16_initialized(); fp8 wrappers keep
  _ensure_initialized(). has_deep_gemm_bf16_grouped() reports False (never raises)
  when the bf16 symbols are unavailable.

- deepgemm_wrapper bf16 grouped wrappers: reject a non-default compiled_dims
  explicitly (NotImplementedError) instead of silently ignoring it; the wrapper
  does not forward compiled_dims (forwarding perturbs bf16 numerics on this shared
  path). No current caller passes a non-"nk" value.

- CudaNoQuantDpNormalDeepGemmStrategy: fail fast at selection via
  has_deep_gemm_bf16_grouped(), and gate on the explicit opt-in moe_strategy FIRST
  with a short-circuit return (ConditionChecker does not stop at the first failed
  check, so this keeps the probe from running for non-opt-in / "auto" configs).

- Tests: empty-rank (token_num==0) executor cases (ep1 + ep2); ep_size>1 executor
  coverage (rank 0/1, _to_local_expert_ids mapping + masking) vs a per-rank torch
  reference; strategy selection pos/neg; has_deep_gemm_bf16_grouped no-raise and
  _ensure_bf16_initialized tolerance; executor test skip uses
  has_deep_gemm_bf16_grouped() to match the gating.

The ep_kernels contiguous padding-row m_indices contract is left to the feature
kernel owner (padding output is discarded by the gather; a real fix needs a kernel
signature change + a deep_gemm -1 skip contract).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Tanmo-ai Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from 9ecf9d8 to 2658cff Compare June 18, 2026 05:58
@LLLLKKKK

Copy link
Copy Markdown
Collaborator

AI Code Review - PR #1111

Status: LGTM

Summary: P0/0 · P1/0 · P2/0 · P3/0

lgtm ready to ci

Checklist ✅ (56 items passed)

Strengths

  • 新增 bf16 DeepGEMM 路径保持显式 opt-in,未进入 auto 默认策略,降低默认路径回归风险。
  • 新增 bf16 scatter/gather 与 executor 测试覆盖 masked、contiguous、EP rank 映射和空 token/空 rank 路径,验证主要数据流。

@wht21

wht21 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

internal source has been updated, please review the changes!

Honglei-Qiu added a commit to Honglei-Qiu/rtp-llm that referenced this pull request Jun 18, 2026
Replace runtime signature introspection with explicit validation:
bf16 grouped GEMM only supports compiled_dims='nk', reject others
with ValueError. Matches PR alibaba#1111 approach.

Remove _has_param helper and all inspect.signature usage — eliminates
unintrospectable callable, **kwargs, and positional-vs-keyword edge cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants