Fix DP issues in benchmark and support Mori in Moe #72

ZhangLirong-amd · 2025-12-18T13:28:11Z

Motivation

How to run with dp attention + mori moe:
--enable-dp-attention --enable-expert-parallel

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR addresses Data Parallel (DP) issues in benchmark execution and adds support for the Mori library in Mixture of Experts (MoE) implementations. The changes improve distributed computing functionality and extend MoE capabilities.

Removes problematic DP metadata initialization in forward context
Integrates Mori library for efficient MoE communication across DP ranks
Enhances DP synchronization logic in engine core to prevent deadlocks

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
atom/utils/forward_context.py	Comments out DP metadata creation to fix benchmark issues
atom/utils/dbo/ubatching.py	Adds placeholder function returning False since DBO is not supported
atom/model_ops/topK.py	Adds Mori module detection and disables shared expert fusion when DP size > 1
atom/model_ops/moe.py	Major refactoring to support Mori kernels with new base class methods and modular kernel integration
atom/model_ops/fused_moe/*.py	New files implementing Mori prepare/finalize, modular kernels, config, and utilities
atom/model_ops/base_attention.py	Fixes output dtype when fusion rmsnorm and quant is enabled
atom/model_ops/attentions/backends.py	Refactors build method to calculate cu_seqlens_q earlier
atom/model_loader/loader.py	Adds initialization call for Mori prepare/finalize after weight loading
atom/model_engine/scheduler.py	Adds helper methods for request tracking and next batch info
atom/model_engine/model_runner.py	Removes DP preprocessing, adds dummy prefill execution, and improves profiler directory naming
atom/model_engine/engine_core_mgr.py	Implements parallel READY signal waiting and broadcast utility commands
atom/model_engine/engine_core.py	Major refactoring of DP synchronization with new state syncing and dummy prefill support
atom/config.py	Changes data_parallel_base_port to use dynamic port allocation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/model_ops/topK.py

Copilot · 2025-12-18T13:30:13Z

atom/model_ops/moe.py

    @property
    def use_all2all_kernels(self):
-        return self.dp_size > 1 and self.use_ep
+        return self.dp_size > 1 and _has_module("mori")


The property name 'use_all2all_kernels' is misleading since it now checks for Mori module availability rather than all2all kernel usage. Consider renaming to 'use_mori_all2all' or updating the logic to match the original intent.

atom/model_ops/moe.py

Copilot · 2025-12-18T13:30:14Z

atom/model_ops/moe.py

        self.intermediate_size = intermediate_size_per_partition_after_pad
        self.hidden_size = hidden_size
        self.hidden_pad = self.hidden_size - layer.hidden_size
+        # Update moe.hidden_dim to match the padded hidden size for Mori kernels


The comment says 'for Mori kernels' but this padding applies regardless of whether Mori is used. The comment should clarify this is general padding behavior needed for the MoE computation.

Suggested change

# Update moe.hidden_dim to match the padded hidden size for Mori kernels

# Update moe.hidden_dim to match the padded hidden size used by the MoE computation (including Mori kernels)

atom/model_ops/fused_moe/modular_kernel.py

Copilot · 2025-12-18T13:30:14Z

atom/model_ops/fused_moe/modular_kernel.py

+        # Now mori now supported shared expert
+        if self.shared_experts is None:
+            return output
+        else:
+            assert shared_output is not None
+            return shared_output, output
+


Unreachable code detected. Lines 244-249 are never executed because line 242 returns unconditionally. Either remove the dead code or fix the control flow.

Suggested change

# Now mori now supported shared expert

if self.shared_experts is None:

return output

else:

assert shared_output is not None

return shared_output, output

atom/model_engine/engine_core.py

valarLip · 2025-12-18T14:52:53Z

please merge main to trigger new ci

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-19T06:32:34Z

atom/model_ops/moe.py

-        self.use_chunked = (get_dp_group().world_size > 1) and (
-            not envs.ATOM_ENFORCE_EAGER
-        )
+        self.use_chunked = get_dp_group().world_size > 1


The condition for use_chunked has been simplified to only check dp_group().world_size > 1, removing the check for envs.ATOM_ENFORCE_EAGER. This changes the behavior - previously chunked mode could be disabled via the environment variable even with DP > 1. Consider whether this change is intentional or if the eager mode check should be preserved.

Suggested change

self.use_chunked = get_dp_group().world_size > 1

self.use_chunked = get_dp_group().world_size > 1 and not envs.ATOM_ENFORCE_EAGER

Copilot · 2025-12-19T06:32:34Z

atom/model_engine/engine_core.py

+        # if scheduled_batch is None:
+        #     return False


The logic has changed to check has_requests() before calling schedule(), but the commented-out code suggests this might not handle all edge cases. The previous pattern checked if scheduled_batch is None after scheduling. Ensure this new pattern properly handles cases where the scheduler has requests but cannot schedule them.

Suggested change

# if scheduled_batch is None:

# return False

if scheduled_batch is None:

return False

atom/model_engine/model_runner.py

Copilot · 2025-12-19T06:32:35Z

atom/model_ops/moe.py

-            expert_mask=expert_map,
-            activation=activation.value,
-            quant_type=self.quant_type.value,
+        # per_Tensor not support num_local_tokens so not use mori


Corrected spelling of 'Tensor' to 'tensor' in the comment to match Python naming conventions.

Suggested change

# per_Tensor not support num_local_tokens so not use mori

# per_tensor not support num_local_tokens so not use mori

Copilot · 2025-12-19T06:32:35Z

atom/model_ops/fused_moe/modular_kernel.py

+            assert False, "Now DBO async is not supported"
+        return output
+
+        # Now mori now supported shared expert


Corrected the comment to 'Mori does not support shared expert' or 'Mori now supports shared expert' for clarity.

Suggested change

# Now mori now supported shared expert

# Mori does not support shared expert

atom/model_ops/fused_moe/utils.py

atom/model_engine/engine_core.py

atom/config.py

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/model_ops/topK.py

Copilot · 2025-12-20T02:31:30Z

atom/model_ops/moe.py

+                if is_fp8 and quant_type is not None
+                else torch.bfloat16
+            )
+            # mori_dtype = torch.bfloat16


Commented-out code should be removed if it's not needed. If it's meant as documentation for future work, convert it to a proper TODO comment explaining why bfloat16 might be preferred.

Suggested change

# mori_dtype = torch.bfloat16

Copilot · 2025-12-20T02:31:30Z

atom/model_ops/moe.py

+        return prepare_finalize
+
+    def maybe_make_prepare_finalize(self) -> FusedMoEPrepareAndFinalize | None:
+        # if True:


Commented-out debug code should be removed. The # if True: line serves no purpose and clutters the codebase.

Suggested change

# if True:

Copilot · 2025-12-20T02:31:30Z

atom/model_ops/moe.py

+    # Note: init_prepare_finalize should only be called by
+    # prepare_communication_buffer_for_model.
+    def init_prepare_finalize(self, layer: torch.nn.Module):
+        # print("init_prepare_finalize")


Debug print statement should be removed from production code.

Suggested change

# print("init_prepare_finalize")

atom/model_engine/engine_core.py

Copilot

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/model_ops/moe.py

atom/model_ops/fused_moe/modular_kernel.py

atom/model_ops/moe.py

atom/model_engine/model_runner.py

atom/model_engine/engine_core.py

Copilot

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/model_ops/moe.py

atom/model_engine/engine_core.py

atom/model_ops/moe.py

Copilot

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-09T08:21:32Z

atom/model_ops/moe.py

 from aiter.jit.utils.chip_info import get_gfx
 from atom.utils import envs

+from atom.utils import envs, mark_spliting_op


Duplicate import of envs on lines 55 and 56. Remove the duplicate import from line 56.

Suggested change

from atom.utils import envs, mark_spliting_op

from atom.utils import mark_spliting_op

Copilot · 2026-01-09T08:21:32Z

atom/model_ops/moe.py

 from aiter.jit.utils.chip_info import get_gfx
 from atom.utils import envs

+from atom.utils import envs, mark_spliting_op


The import mark_spliting_op appears unused in this file. Consider removing it if it's not needed elsewhere in the code.

Suggested change

from atom.utils import envs, mark_spliting_op

Copilot · 2026-01-09T08:21:33Z

atom/model_engine/model_runner.py

+            # num_pad, num_tokens_across_dp = self.get_dp_padding(scheduled_bs)
+            # padded_scheduled_bs = scheduled_bs + num_pad


Remove commented-out code on lines 823-824 if it's no longer needed, or add a TODO comment explaining why it's kept.

Suggested change

# num_pad, num_tokens_across_dp = self.get_dp_padding(scheduled_bs)

# padded_scheduled_bs = scheduled_bs + num_pad

Copilot · 2026-01-09T08:21:33Z

atom/model_engine/model_runner.py

-            hidden_states = self.model(input_ids, positions)
+                hidden_states = self.model(input_ids, positions)
        else:
            graph_bs = context.graph_bs


Remove commented-out synchronization call or add a comment explaining why it's disabled.

Suggested change

graph_bs = context.graph_bs

graph_bs = context.graph_bs

# NOTE: Explicit CUDA synchronization is not required here in normal execution

# because CUDA graph replay and subsequent operations already ensure correct

# ordering. This call is kept commented out for potential debugging use only.

Copilot · 2026-01-09T08:21:33Z

atom/model_engine/engine_core.py

+        # self.input_thread = threading.Thread(
+        #     target=self.process_input_sockets, args=(self.input_address,), daemon=True
+        # )
+        # self.input_thread.start()


Remove large block of commented-out code (lines 118-121) if it's no longer needed, or add a TODO comment explaining the reason for keeping it.

Suggested change

# self.input_thread = threading.Thread(

# target=self.process_input_sockets, args=(self.input_address,), daemon=True

# )

# self.input_thread.start()

Copilot AI review requested due to automatic review settings December 18, 2025 13:28

Copilot AI reviewed Dec 18, 2025

View reviewed changes

ZhangLirong-amd force-pushed the zlr/mori_moe branch from 6747b69 to 2940c7f Compare December 19, 2025 06:31

Copilot AI review requested due to automatic review settings December 19, 2025 06:31

Copilot AI reviewed Dec 19, 2025

View reviewed changes

ZhangLirong-amd force-pushed the zlr/mori_moe branch from 2940c7f to b56a57d Compare December 20, 2025 02:29

Copilot AI review requested due to automatic review settings December 20, 2025 02:30

Copilot AI reviewed Dec 20, 2025

View reviewed changes

ZhangLirong-amd force-pushed the zlr/mori_moe branch from e879b09 to de1cc23 Compare December 22, 2025 03:29

Copilot AI review requested due to automatic review settings December 26, 2025 09:51

Copilot AI reviewed Dec 26, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings January 6, 2026 13:22

ZhangLirong-amd force-pushed the zlr/mori_moe branch from f662cf9 to 18b0e37 Compare January 6, 2026 13:22

Copilot AI reviewed Jan 6, 2026

View reviewed changes

atom/model_ops/moe.py Show resolved Hide resolved

atom/model_engine/engine_core.py Show resolved Hide resolved

atom/model_ops/moe.py Show resolved Hide resolved

ZhangLirong-amd force-pushed the zlr/mori_moe branch from 18b0e37 to 5c34247 Compare January 9, 2026 00:42

ZhangLirong-amd added 14 commits January 9, 2026 16:19

fix DP hang and enable mori in moe

9ae1d54

fix some bug in mori

7210169

fix mori core dump in deepseek and rmsnorm quant error

c2e8667

remove dp metadata reduce and refactor code

db41c0f

support mori splt hidden_size and fix core dump

7e768b7

fix conflict

74d3475

add ready sig for engine mgr and support auto close

3f3c42b

format code

26e7cb4

fix fp8 mla cprodump and format code

0d3c62d

avoid fp8+128head overflow in persistent

5b517bf

get_mla meta overflow fixed by aiter

6595d94

push for acc fix

0488308

support dispatch/combine without mori for cudagraph

a57e1c6

fix some bugs

0e4916b

fix padding issues by transpose scale true

d79c8d8

Copilot AI review requested due to automatic review settings January 9, 2026 08:20

ZhangLirong-amd force-pushed the zlr/mori_moe branch from 5c34247 to d79c8d8 Compare January 9, 2026 08:20

Copilot AI reviewed Jan 9, 2026

View reviewed changes

move has_module

4749843

	# Update moe.hidden_dim to match the padded hidden size for Mori kernels
	# Update moe.hidden_dim to match the padded hidden size used by the MoE computation (including Mori kernels)

	self.use_chunked = get_dp_group().world_size > 1
	self.use_chunked = get_dp_group().world_size > 1 and not envs.ATOM_ENFORCE_EAGER

	# per_Tensor not support num_local_tokens so not use mori
	# per_tensor not support num_local_tokens so not use mori

	# Now mori now supported shared expert
	# Mori does not support shared expert

	from atom.utils import envs, mark_spliting_op
	from atom.utils import mark_spliting_op

		# num_pad, num_tokens_across_dp = self.get_dp_padding(scheduled_bs)
		# padded_scheduled_bs = scheduled_bs + num_pad

-            graph_bs = context.graph_bs
+            graph_bs = context.graph_bs
+            # NOTE: Explicit CUDA synchronization is not required here in normal execution
+            # because CUDA graph replay and subsequent operations already ensure correct
+            # ordering. This call is kept commented out for potential debugging use only.

Fix DP issues in benchmark and support Mori in Moe #72

Are you sure you want to change the base?

Fix DP issues in benchmark and support Mori in Moe #72

Uh oh!

Conversation

ZhangLirong-amd commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

valarLip commented Dec 18, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZhangLirong-amd commented Dec 18, 2025 •

edited

Loading