Skip to content

Conversation

@ZhangLirong-amd
Copy link
Contributor

@ZhangLirong-amd ZhangLirong-amd commented Dec 18, 2025

Motivation

How to run with dp attention + mori moe:
--enable-dp-attention --enable-expert-parallel

Technical Details

Test Plan

Test Result

Submission Checklist

Copilot AI review requested due to automatic review settings December 18, 2025 13:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Data Parallel (DP) issues in benchmark execution and adds support for the Mori library in Mixture of Experts (MoE) implementations. The changes improve distributed computing functionality and extend MoE capabilities.

  • Removes problematic DP metadata initialization in forward context
  • Integrates Mori library for efficient MoE communication across DP ranks
  • Enhances DP synchronization logic in engine core to prevent deadlocks

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
atom/utils/forward_context.py Comments out DP metadata creation to fix benchmark issues
atom/utils/dbo/ubatching.py Adds placeholder function returning False since DBO is not supported
atom/model_ops/topK.py Adds Mori module detection and disables shared expert fusion when DP size > 1
atom/model_ops/moe.py Major refactoring to support Mori kernels with new base class methods and modular kernel integration
atom/model_ops/fused_moe/*.py New files implementing Mori prepare/finalize, modular kernels, config, and utilities
atom/model_ops/base_attention.py Fixes output dtype when fusion rmsnorm and quant is enabled
atom/model_ops/attentions/backends.py Refactors build method to calculate cu_seqlens_q earlier
atom/model_loader/loader.py Adds initialization call for Mori prepare/finalize after weight loading
atom/model_engine/scheduler.py Adds helper methods for request tracking and next batch info
atom/model_engine/model_runner.py Removes DP preprocessing, adds dummy prefill execution, and improves profiler directory naming
atom/model_engine/engine_core_mgr.py Implements parallel READY signal waiting and broadcast utility commands
atom/model_engine/engine_core.py Major refactoring of DP synchronization with new state syncing and dummy prefill support
atom/config.py Changes data_parallel_base_port to use dynamic port allocation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@property
def use_all2all_kernels(self):
return self.dp_size > 1 and self.use_ep
return self.dp_size > 1 and _has_module("mori")
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property name 'use_all2all_kernels' is misleading since it now checks for Mori module availability rather than all2all kernel usage. Consider renaming to 'use_mori_all2all' or updating the logic to match the original intent.

Copilot uses AI. Check for mistakes.
self.intermediate_size = intermediate_size_per_partition_after_pad
self.hidden_size = hidden_size
self.hidden_pad = self.hidden_size - layer.hidden_size
# Update moe.hidden_dim to match the padded hidden size for Mori kernels
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says 'for Mori kernels' but this padding applies regardless of whether Mori is used. The comment should clarify this is general padding behavior needed for the MoE computation.

Suggested change
# Update moe.hidden_dim to match the padded hidden size for Mori kernels
# Update moe.hidden_dim to match the padded hidden size used by the MoE computation (including Mori kernels)

Copilot uses AI. Check for mistakes.
Comment on lines +244 to +249
# Now mori now supported shared expert
if self.shared_experts is None:
return output
else:
assert shared_output is not None
return shared_output, output

Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unreachable code detected. Lines 244-249 are never executed because line 242 returns unconditionally. Either remove the dead code or fix the control flow.

Suggested change
# Now mori now supported shared expert
if self.shared_experts is None:
return output
else:
assert shared_output is not None
return shared_output, output

Copilot uses AI. Check for mistakes.
@valarLip
Copy link
Collaborator

please merge main to trigger new ci

Copilot AI review requested due to automatic review settings December 19, 2025 06:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

self.use_chunked = (get_dp_group().world_size > 1) and (
not envs.ATOM_ENFORCE_EAGER
)
self.use_chunked = get_dp_group().world_size > 1
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition for use_chunked has been simplified to only check dp_group().world_size > 1, removing the check for envs.ATOM_ENFORCE_EAGER. This changes the behavior - previously chunked mode could be disabled via the environment variable even with DP > 1. Consider whether this change is intentional or if the eager mode check should be preserved.

Suggested change
self.use_chunked = get_dp_group().world_size > 1
self.use_chunked = get_dp_group().world_size > 1 and not envs.ATOM_ENFORCE_EAGER

Copilot uses AI. Check for mistakes.
Comment on lines +179 to +180
# if scheduled_batch is None:
# return False
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic has changed to check has_requests() before calling schedule(), but the commented-out code suggests this might not handle all edge cases. The previous pattern checked if scheduled_batch is None after scheduling. Ensure this new pattern properly handles cases where the scheduler has requests but cannot schedule them.

Suggested change
# if scheduled_batch is None:
# return False
if scheduled_batch is None:
return False

Copilot uses AI. Check for mistakes.
expert_mask=expert_map,
activation=activation.value,
quant_type=self.quant_type.value,
# per_Tensor not support num_local_tokens so not use mori
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'Tensor' to 'tensor' in the comment to match Python naming conventions.

Suggested change
# per_Tensor not support num_local_tokens so not use mori
# per_tensor not support num_local_tokens so not use mori

Copilot uses AI. Check for mistakes.
assert False, "Now DBO async is not supported"
return output

# Now mori now supported shared expert
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected the comment to 'Mori does not support shared expert' or 'Mori now supports shared expert' for clarity.

Suggested change
# Now mori now supported shared expert
# Mori does not support shared expert

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings December 20, 2025 02:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if is_fp8 and quant_type is not None
else torch.bfloat16
)
# mori_dtype = torch.bfloat16
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out code should be removed if it's not needed. If it's meant as documentation for future work, convert it to a proper TODO comment explaining why bfloat16 might be preferred.

Suggested change
# mori_dtype = torch.bfloat16

Copilot uses AI. Check for mistakes.
return prepare_finalize

def maybe_make_prepare_finalize(self) -> FusedMoEPrepareAndFinalize | None:
# if True:
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out debug code should be removed. The # if True: line serves no purpose and clutters the codebase.

Suggested change
# if True:

Copilot uses AI. Check for mistakes.
# Note: init_prepare_finalize should only be called by
# prepare_communication_buffer_for_model.
def init_prepare_finalize(self, layer: torch.nn.Module):
# print("init_prepare_finalize")
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print statement should be removed from production code.

Suggested change
# print("init_prepare_finalize")

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings December 26, 2025 09:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings January 6, 2026 13:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings January 9, 2026 08:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

from aiter.jit.utils.chip_info import get_gfx
from atom.utils import envs

from atom.utils import envs, mark_spliting_op
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate import of envs on lines 55 and 56. Remove the duplicate import from line 56.

Suggested change
from atom.utils import envs, mark_spliting_op
from atom.utils import mark_spliting_op

Copilot uses AI. Check for mistakes.
from aiter.jit.utils.chip_info import get_gfx
from atom.utils import envs

from atom.utils import envs, mark_spliting_op
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import mark_spliting_op appears unused in this file. Consider removing it if it's not needed elsewhere in the code.

Suggested change
from atom.utils import envs, mark_spliting_op

Copilot uses AI. Check for mistakes.
Comment on lines +823 to +824
# num_pad, num_tokens_across_dp = self.get_dp_padding(scheduled_bs)
# padded_scheduled_bs = scheduled_bs + num_pad
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented-out code on lines 823-824 if it's no longer needed, or add a TODO comment explaining why it's kept.

Suggested change
# num_pad, num_tokens_across_dp = self.get_dp_padding(scheduled_bs)
# padded_scheduled_bs = scheduled_bs + num_pad

Copilot uses AI. Check for mistakes.
hidden_states = self.model(input_ids, positions)
hidden_states = self.model(input_ids, positions)
else:
graph_bs = context.graph_bs
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented-out synchronization call or add a comment explaining why it's disabled.

Suggested change
graph_bs = context.graph_bs
graph_bs = context.graph_bs
# NOTE: Explicit CUDA synchronization is not required here in normal execution
# because CUDA graph replay and subsequent operations already ensure correct
# ordering. This call is kept commented out for potential debugging use only.

Copilot uses AI. Check for mistakes.
Comment on lines +118 to +121
# self.input_thread = threading.Thread(
# target=self.process_input_sockets, args=(self.input_address,), daemon=True
# )
# self.input_thread.start()
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove large block of commented-out code (lines 118-121) if it's no longer needed, or add a TODO comment explaining the reason for keeping it.

Suggested change
# self.input_thread = threading.Thread(
# target=self.process_input_sockets, args=(self.input_address,), daemon=True
# )
# self.input_thread.start()

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants