-
Notifications
You must be signed in to change notification settings - Fork 5
Fix DP issues in benchmark and support Mori in Moe #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses Data Parallel (DP) issues in benchmark execution and adds support for the Mori library in Mixture of Experts (MoE) implementations. The changes improve distributed computing functionality and extend MoE capabilities.
- Removes problematic DP metadata initialization in forward context
- Integrates Mori library for efficient MoE communication across DP ranks
- Enhances DP synchronization logic in engine core to prevent deadlocks
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| atom/utils/forward_context.py | Comments out DP metadata creation to fix benchmark issues |
| atom/utils/dbo/ubatching.py | Adds placeholder function returning False since DBO is not supported |
| atom/model_ops/topK.py | Adds Mori module detection and disables shared expert fusion when DP size > 1 |
| atom/model_ops/moe.py | Major refactoring to support Mori kernels with new base class methods and modular kernel integration |
| atom/model_ops/fused_moe/*.py | New files implementing Mori prepare/finalize, modular kernels, config, and utilities |
| atom/model_ops/base_attention.py | Fixes output dtype when fusion rmsnorm and quant is enabled |
| atom/model_ops/attentions/backends.py | Refactors build method to calculate cu_seqlens_q earlier |
| atom/model_loader/loader.py | Adds initialization call for Mori prepare/finalize after weight loading |
| atom/model_engine/scheduler.py | Adds helper methods for request tracking and next batch info |
| atom/model_engine/model_runner.py | Removes DP preprocessing, adds dummy prefill execution, and improves profiler directory naming |
| atom/model_engine/engine_core_mgr.py | Implements parallel READY signal waiting and broadcast utility commands |
| atom/model_engine/engine_core.py | Major refactoring of DP synchronization with new state syncing and dummy prefill support |
| atom/config.py | Changes data_parallel_base_port to use dynamic port allocation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
atom/model_ops/moe.py
Outdated
| @property | ||
| def use_all2all_kernels(self): | ||
| return self.dp_size > 1 and self.use_ep | ||
| return self.dp_size > 1 and _has_module("mori") |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The property name 'use_all2all_kernels' is misleading since it now checks for Mori module availability rather than all2all kernel usage. Consider renaming to 'use_mori_all2all' or updating the logic to match the original intent.
| self.intermediate_size = intermediate_size_per_partition_after_pad | ||
| self.hidden_size = hidden_size | ||
| self.hidden_pad = self.hidden_size - layer.hidden_size | ||
| # Update moe.hidden_dim to match the padded hidden size for Mori kernels |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says 'for Mori kernels' but this padding applies regardless of whether Mori is used. The comment should clarify this is general padding behavior needed for the MoE computation.
| # Update moe.hidden_dim to match the padded hidden size for Mori kernels | |
| # Update moe.hidden_dim to match the padded hidden size used by the MoE computation (including Mori kernels) |
| # Now mori now supported shared expert | ||
| if self.shared_experts is None: | ||
| return output | ||
| else: | ||
| assert shared_output is not None | ||
| return shared_output, output | ||
|
|
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unreachable code detected. Lines 244-249 are never executed because line 242 returns unconditionally. Either remove the dead code or fix the control flow.
| # Now mori now supported shared expert | |
| if self.shared_experts is None: | |
| return output | |
| else: | |
| assert shared_output is not None | |
| return shared_output, output |
|
please merge main to trigger new ci |
6747b69 to
2940c7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.use_chunked = (get_dp_group().world_size > 1) and ( | ||
| not envs.ATOM_ENFORCE_EAGER | ||
| ) | ||
| self.use_chunked = get_dp_group().world_size > 1 |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition for use_chunked has been simplified to only check dp_group().world_size > 1, removing the check for envs.ATOM_ENFORCE_EAGER. This changes the behavior - previously chunked mode could be disabled via the environment variable even with DP > 1. Consider whether this change is intentional or if the eager mode check should be preserved.
| self.use_chunked = get_dp_group().world_size > 1 | |
| self.use_chunked = get_dp_group().world_size > 1 and not envs.ATOM_ENFORCE_EAGER |
| # if scheduled_batch is None: | ||
| # return False |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic has changed to check has_requests() before calling schedule(), but the commented-out code suggests this might not handle all edge cases. The previous pattern checked if scheduled_batch is None after scheduling. Ensure this new pattern properly handles cases where the scheduler has requests but cannot schedule them.
| # if scheduled_batch is None: | |
| # return False | |
| if scheduled_batch is None: | |
| return False |
| expert_mask=expert_map, | ||
| activation=activation.value, | ||
| quant_type=self.quant_type.value, | ||
| # per_Tensor not support num_local_tokens so not use mori |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'Tensor' to 'tensor' in the comment to match Python naming conventions.
| # per_Tensor not support num_local_tokens so not use mori | |
| # per_tensor not support num_local_tokens so not use mori |
| assert False, "Now DBO async is not supported" | ||
| return output | ||
|
|
||
| # Now mori now supported shared expert |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected the comment to 'Mori does not support shared expert' or 'Mori now supports shared expert' for clarity.
| # Now mori now supported shared expert | |
| # Mori does not support shared expert |
2940c7f to
b56a57d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if is_fp8 and quant_type is not None | ||
| else torch.bfloat16 | ||
| ) | ||
| # mori_dtype = torch.bfloat16 |
Copilot
AI
Dec 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out code should be removed if it's not needed. If it's meant as documentation for future work, convert it to a proper TODO comment explaining why bfloat16 might be preferred.
| # mori_dtype = torch.bfloat16 |
| return prepare_finalize | ||
|
|
||
| def maybe_make_prepare_finalize(self) -> FusedMoEPrepareAndFinalize | None: | ||
| # if True: |
Copilot
AI
Dec 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out debug code should be removed. The # if True: line serves no purpose and clutters the codebase.
| # if True: |
| # Note: init_prepare_finalize should only be called by | ||
| # prepare_communication_buffer_for_model. | ||
| def init_prepare_finalize(self, layer: torch.nn.Module): | ||
| # print("init_prepare_finalize") |
Copilot
AI
Dec 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statement should be removed from production code.
| # print("init_prepare_finalize") |
e879b09 to
de1cc23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 22 out of 22 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f662cf9 to
18b0e37
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
18b0e37 to
5c34247
Compare
5c34247 to
d79c8d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from aiter.jit.utils.chip_info import get_gfx | ||
| from atom.utils import envs | ||
|
|
||
| from atom.utils import envs, mark_spliting_op |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate import of envs on lines 55 and 56. Remove the duplicate import from line 56.
| from atom.utils import envs, mark_spliting_op | |
| from atom.utils import mark_spliting_op |
| from aiter.jit.utils.chip_info import get_gfx | ||
| from atom.utils import envs | ||
|
|
||
| from atom.utils import envs, mark_spliting_op |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import mark_spliting_op appears unused in this file. Consider removing it if it's not needed elsewhere in the code.
| from atom.utils import envs, mark_spliting_op |
| # num_pad, num_tokens_across_dp = self.get_dp_padding(scheduled_bs) | ||
| # padded_scheduled_bs = scheduled_bs + num_pad |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented-out code on lines 823-824 if it's no longer needed, or add a TODO comment explaining why it's kept.
| # num_pad, num_tokens_across_dp = self.get_dp_padding(scheduled_bs) | |
| # padded_scheduled_bs = scheduled_bs + num_pad |
| hidden_states = self.model(input_ids, positions) | ||
| hidden_states = self.model(input_ids, positions) | ||
| else: | ||
| graph_bs = context.graph_bs |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented-out synchronization call or add a comment explaining why it's disabled.
| graph_bs = context.graph_bs | |
| graph_bs = context.graph_bs | |
| # NOTE: Explicit CUDA synchronization is not required here in normal execution | |
| # because CUDA graph replay and subsequent operations already ensure correct | |
| # ordering. This call is kept commented out for potential debugging use only. |
| # self.input_thread = threading.Thread( | ||
| # target=self.process_input_sockets, args=(self.input_address,), daemon=True | ||
| # ) | ||
| # self.input_thread.start() |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove large block of commented-out code (lines 118-121) if it's no longer needed, or add a TODO comment explaining the reason for keeping it.
| # self.input_thread = threading.Thread( | |
| # target=self.process_input_sockets, args=(self.input_address,), daemon=True | |
| # ) | |
| # self.input_thread.start() |
Motivation
How to run with dp attention + mori moe:
--enable-dp-attention --enable-expert-parallel
Technical Details
Test Plan
Test Result
Submission Checklist