support mtp stage 2: support deepseek mtp=1 #35

jiayyu · 2025-12-09T08:34:49Z

Motivation

Technical Details

block manager may append multiple blocks
postprocess return a list of tokens
for both deferred output and mtp append multiple placeholders
set_mla_persistent_worker_buffers pass max_q_len as an argument
aiter mla prepare decode metadata support if max_q_len > 1
forward context add optional SpecDecodeMetadata
ops add rejection sampler

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR implements stage 1 support for multiple token processing (MTP) by enabling the model engine to handle multiple decode tokens per step. The changes modify the core engine to track and process variable numbers of tokens per sequence rather than assuming exactly one token per decode step.

Key changes:

Modified the postprocessing pipeline to return multiple tokens per sequence (dict[int, list[int]] instead of dict[int, int])
Enhanced block allocation to reserve space for lookahead slots when in deferred output mode
Updated position and context length calculations to account for previously accepted tokens in deferred output mode

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
atom/model_engine/scheduler.py	Updated postprocess to handle multiple tokens per sequence, modified block allocation to reserve lookahead slots, and removed placeholder token logic
atom/model_engine/model_runner.py	Changed return types from single token to token lists, added tracking of accepted token counts per sequence
atom/model_engine/block_manager.py	Added num_lookahead_slots parameter to allocation methods to reserve space for multiple tokens
atom/model_ops/attentions/backends.py	Updated block table preparation to use actual context lengths when in deferred output mode
atom/model_ops/attentions/aiter_mla.py	Added logic to calculate actual context lengths and positions based on deferred tokens from previous batch
atom/model_ops/attentions/aiter_attention.py	Similar context length calculation updates for standard attention mechanism

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/model_engine/block_manager.py

atom/model_engine/scheduler.py

atom/model_ops/attentions/aiter_attention.py

atom/model_engine/model_runner.py

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/model_engine/scheduler.py

atom/model_engine/model_runner.py

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/model_engine/scheduler.py

atom/model_ops/attentions/aiter_mla.py

atom/model_engine/model_runner.py

Copilot · 2025-12-12T05:15:30Z

atom/model_ops/attentions/aiter_mla.py

-            )
-        ]
+        if max_q_len > 1:
+            context_lens = np.array(batch.context_lens) - var["mtp_k"]


The computation 'batch.context_lens - var["mtp_k"]' is performed but the result overwrites context_lens which is also used in the else block at line 195. This could be confusing. Consider using a more descriptive variable name like 'adjusted_context_lens' for clarity.

Copilot · 2025-12-12T05:15:30Z

atom/model_engine/scheduler.py

+        num_placeholder = (
+            2 * self.mtp_k if is_deferred_out and self.use_spec else
+            1              if is_deferred_out else
+            self.mtp_k     if self.use_spec else
+            0
+        )


The nested ternary expression for num_placeholder is difficult to read. Consider using an if-elif-else structure or extracting this logic into a separate method with a descriptive name for better maintainability.

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-22T02:12:54Z

atom/spec_decode/eagle.py

        runner=None,
    ):
        self.config = atom_config
        self.speculative_config = self.config.speculative_config


The variable mtp_k is renamed but its meaning remains ambiguous. Based on the context, this represents the number of speculative tokens, not k in the traditional MTP sense. Consider renaming to num_speculative_tokens for clarity, or add a comment explaining the naming convention.

Suggested change

self.speculative_config = self.config.speculative_config

self.speculative_config = self.config.speculative_config

# Number of speculative tokens used in MTP; kept as `mtp_k` for backward compatibility.

Copilot · 2025-12-22T02:12:54Z

atom/spec_decode/eagle.py

+        # Generate the remaining draft tokens.
+        draft_token_ids_list = [draft_token_ids]
+
+        for _ in range(self.mtp_k - 1):
+            pass  # TODO: support multiple num_speculative_tokens
+
+        # [batch_size, num_speculative_tokens]
+        draft_token_ids = torch.stack(draft_token_ids_list, dim=1)
+        return draft_token_ids
+


The TODO comment is misleading - the function already handles mtp_k speculative tokens via the loop. Either implement the remaining logic or update the comment to clarify what specific functionality is missing.

Suggested change

# Generate the remaining draft tokens.

draft_token_ids_list = [draft_token_ids]

for _ in range(self.mtp_k - 1):

pass # TODO: support multiple num_speculative_tokens

# [batch_size, num_speculative_tokens]

draft_token_ids = torch.stack(draft_token_ids_list, dim=1)

return draft_token_ids

# Multiple draft tokens (mtp_k > 1) are not yet supported. Fail fast instead of

# returning an incorrectly shaped tensor or silently misbehaving.

raise NotImplementedError(

f"Multiple speculative tokens (mtp_k={self.mtp_k}) are not supported in EagleSpecDecoder yet."

)

Copilot · 2025-12-22T02:12:55Z

atom/model_ops/rejection_sampler.py

+            metadata.num_draft_tokens,
+            metadata.max_spec_len,
+            metadata.cu_num_draft_tokens,
+            None,


The draft_probs parameter is always passed as None but still has validation logic and is used in assertions. Either implement probability-based rejection sampling or remove this unused parameter and its associated validation code.

Copilot · 2025-12-22T02:12:55Z

atom/model_engine/scheduler.py

+        num_placeholder = (
+            2 * self.mtp_k if is_deferred_out and self.use_spec else
+            1              if is_deferred_out else
+            self.mtp_k     if self.use_spec else
+            0
+        )


The multi-line conditional expression for num_placeholder is difficult to read and error-prone. Consider using a traditional if-elif-else structure or extracting this logic into a helper method with clear comments explaining each case.

Suggested change

num_placeholder = (

2 * self.mtp_k if is_deferred_out and self.use_spec else

1 if is_deferred_out else

self.mtp_k if self.use_spec else

0

)

# Determine how many placeholder tokens to use based on deferred output

# and speculative decoding configuration.

if is_deferred_out and self.use_spec:

num_placeholder = 2 * self.mtp_k

elif is_deferred_out:

num_placeholder = 1

elif self.use_spec:

num_placeholder = self.mtp_k

else:

num_placeholder = 0

Copilot · 2025-12-22T02:12:55Z

atom/model_engine/scheduler.py

            else:
-                if not seq.ignore_eos and token_id == self.eos_token_id:
+                # Check the last token in the list for EOS
+                if token_ids and not seq.ignore_eos and self.eos_token_id in token_ids:


The EOS check now looks for eos_token_id anywhere in the token_ids list, but the comment on line 236 says 'Check the last token in the list for EOS'. This could cause premature sequence termination if EOS appears in draft tokens. The check should verify only the last accepted token: token_ids[-1] == self.eos_token_id.

Suggested change

if token_ids and not seq.ignore_eos and self.eos_token_id in token_ids:

if token_ids and not seq.ignore_eos and token_ids[-1] == self.eos_token_id:

Copilot · 2025-12-22T02:12:56Z

atom/model_engine/model_runner.py

        """for decode: input ids are from prev_sampled_token_ids"""
-        locations, is_all_alive = self.get_prev_alive_locations(batch)
-        num_deferred_tokens = len(locations)
+        alive_seq_indices, num_deferred_tokens, is_all_alive = self.get_prev_alive_locations(batch)


The variable alive_seq_indices is misleading - it contains indices into the previous batch, not sequence IDs. Consider renaming to prev_batch_indices or deferred_seq_indices to clarify what these indices reference.

atom/model_engine/model_runner.py

Copilot

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-08T10:01:44Z

atom/spec_decode/eagle.py

        self.speculative_config = self.config.speculative_config
-        self.num_speculative_tokens = self.speculative_config.num_speculative_tokens
-        self.mtp_k = self.num_speculative_tokens + 1
+        self.mtp_k = self.speculative_config.num_speculative_tokens


The variable mtp_k is initialized differently than in the original code. Previously it was num_speculative_tokens + 1, now it's just num_speculative_tokens. Ensure this change is intentional and verify all usages of mtp_k account for this modification, particularly in early exit conditions and loop bounds.

Suggested change

self.mtp_k = self.speculative_config.num_speculative_tokens

self.mtp_k = self.speculative_config.num_speculative_tokens + 1

Copilot · 2026-01-08T10:01:44Z

atom/model_engine/scheduler.py

            new_tokens = []
-            if is_deferred_out:
-                seq.token_ids[-1] = token_id
+            if is_deferred_out or (self.use_spec and self.eos_token_id == seq.token_ids[-1]):


The condition combines two distinct concerns. The second part checking for EOS token during spec decode is unclear. Consider adding a comment explaining why EOS tokens trigger this branch or extracting to a named variable like is_spec_with_eos.

Suggested change

if is_deferred_out or (self.use_spec and self.eos_token_id == seq.token_ids[-1]):

# During speculative decoding, if the last committed token is EOS, we also

# need to replace the placeholder tokens with the sampled tokens.

is_spec_with_eos = self.use_spec and self.eos_token_id == seq.token_ids[-1]

if is_deferred_out or is_spec_with_eos:

Copilot · 2026-01-08T10:01:45Z

atom/model_engine/model_runner.py

 class tokenIDProcessor:

-    def __init__(self, max_num_batched_tokens: int, device: torch.device):
+    def __init__(self, max_num_batched_tokens: int, device: torch.device, use_spec: bool, num_speculative_tokens):


The parameter num_speculative_tokens lacks a type annotation. Add : int for consistency with other parameters.

Suggested change

def __init__(self, max_num_batched_tokens: int, device: torch.device, use_spec: bool, num_speculative_tokens):

def __init__(self, max_num_batched_tokens: int, device: torch.device, use_spec: bool, num_speculative_tokens: int):

Copilot · 2026-01-08T10:01:45Z

atom/model_engine/model_runner.py

+        if self.config.speculative_config and get_pp_group().is_last_rank:
+            self.drafter = EagleProposer(self.config, self.device, self)
+            self.rejection_sampler = RejectionSampler()
+        num_speculative_tokens = self.drafter.mtp_k if hasattr(self, "drafter") else 0


The initialization order is fragile - self.drafter and self.rejection_sampler are conditionally created, then immediately checked with hasattr(self, "drafter"). Consider extracting the speculative token count logic into the conditional block where drafter is created to make the dependency explicit.

Copilot · 2026-01-08T10:01:45Z

atom/model_engine/model_runner.py

+            num_rejected_tokens = [
+                n + 1 - len(sampled_token_ids[i]) if n > 0 else 0
+                for i, n in enumerate(num_draft_tokens)
+            ]


The list comprehension assumes sampled_token_ids keys match the enumeration indices of num_draft_tokens, but sampled_token_ids is a dictionary keyed by sequence IDs. This will cause a KeyError if the sequence IDs don't align with 0-indexed positions. Access sampled_token_ids using the actual sequence IDs from the batch instead.

Suggested change

num_rejected_tokens = [

n + 1 - len(sampled_token_ids[i]) if n > 0 else 0

for i, n in enumerate(num_draft_tokens)

]

# Use actual sequence IDs from the batch to index sampled_token_ids

sequence_ids = [seq.seq_id for seq in batch.sequences]

num_rejected_tokens: list[int] = []

for seq_id, n in zip(sequence_ids, num_draft_tokens):

if n > 0:

sampled = sampled_token_ids.get(seq_id, [])

num_rejected_tokens.append(n + 1 - len(sampled))

else:

num_rejected_tokens.append(0)

Copilot · 2026-01-08T10:01:46Z

atom/model_ops/attentions/aiter_mla.py

-        ]
+        if max_q_len > 1:
+            context_lens = np.array(batch.context_lens) - var["mtp_k"]
+            positions = [pos for le in context_lens for pos in (le - var["mtp_k"], le)]


The variable name le is ambiguous. Use a more descriptive name like context_len to improve readability.

Suggested change

positions = [pos for le in context_lens for pos in (le - var["mtp_k"], le)]

positions = [

pos

for context_len in context_lens

for pos in (context_len - var["mtp_k"], context_len)

]

Copilot · 2026-01-08T10:01:46Z

atom/model_engine/block_manager.py

    def may_append(self, seq: Sequence):
        block_table = seq.block_table
        last_block = self.blocks[block_table[-1]]
        if len(seq) % self.block_size == 1 or self.block_size == 1:


The calculation (len(seq) + seq.block_size - 1) // seq.block_size is the standard ceiling division pattern but could be more explicit. Consider adding a comment explaining this calculates the required number of blocks, or extract to a variable like total_blocks_needed.

Suggested change

if len(seq) % self.block_size == 1 or self.block_size == 1:

if len(seq) % self.block_size == 1 or self.block_size == 1:

# Compute the total number of blocks required to store the current sequence (ceiling division).

Copilot · 2026-01-08T10:01:46Z

atom/model_engine/sequence.py

        self.num_tokens += 1

+    def __repr__(self):
+        return f"Sequence ({self.id=}, {self.token_ids=}, {self.spec_token_ids})"


The __repr__ method uses f-string debug format for id and token_ids (with =) but not for spec_token_ids. For consistency, either use {self.spec_token_ids=} or remove the = from the other fields.

Suggested change

return f"Sequence ({self.id=}, {self.token_ids=}, {self.spec_token_ids})"

return f"Sequence ({self.id=}, {self.token_ids=}, {self.spec_token_ids=})"

Copilot AI review requested due to automatic review settings December 9, 2025 08:34

Copilot AI reviewed Dec 9, 2025

View reviewed changes

jiayyu force-pushed the ds_mtp1 branch from 4652efd to 2f7fa89 Compare December 10, 2025 06:27

Copilot AI review requested due to automatic review settings December 10, 2025 06:27

Copilot AI reviewed Dec 10, 2025

View reviewed changes

atom/model_engine/scheduler.py Outdated Show resolved Hide resolved

atom/model_engine/scheduler.py Outdated Show resolved Hide resolved

atom/model_engine/model_runner.py Outdated Show resolved Hide resolved

jiayyu changed the title ~~Support mtp stage 1: model engine supports multiple decode tokens~~ Support mtp stage 1: model runner postprocess returns multiple tokens Dec 10, 2025

jiayyu changed the title ~~Support mtp stage 1: model runner postprocess returns multiple tokens~~ support mtp stage 2: model runner postprocess returns multiple tokens Dec 10, 2025

jiayyu marked this pull request as draft December 10, 2025 07:40

jiayyu force-pushed the ds_mtp1 branch from 2f7fa89 to 19f135a Compare December 11, 2025 10:13

jiayyu changed the title ~~support mtp stage 2: model runner postprocess returns multiple tokens~~ support mtp stage 2: support deepseek mtp=1 Dec 12, 2025

jiayyu force-pushed the ds_mtp1 branch from 4b96b41 to a509fad Compare December 12, 2025 04:06

jiayyu marked this pull request as ready for review December 12, 2025 05:13

Copilot AI review requested due to automatic review settings December 12, 2025 05:13

Copilot AI reviewed Dec 12, 2025

View reviewed changes

jiayyu added 4 commits December 22, 2025 09:58

wip

24d5b8b

clean log

17ca9ff

clean unused variable

6473908

fix comments

8dd8977

jiayyu force-pushed the ds_mtp1 branch from 4091d01 to 8dd8977 Compare December 22, 2025 02:11

Copilot AI review requested due to automatic review settings December 22, 2025 02:11

Copilot AI reviewed Dec 22, 2025

View reviewed changes

jiayyu added 3 commits December 22, 2025 10:51

fix comments

925e445

save the code

61c6c0b

num_concurrent > 1

083f0b6

Copilot AI review requested due to automatic review settings January 8, 2026 10:00

Copilot AI reviewed Jan 8, 2026

View reviewed changes

	self.speculative_config = self.config.speculative_config
	self.speculative_config = self.config.speculative_config
	# Number of speculative tokens used in MTP; kept as `mtp_k` for backward compatibility.

-        num_placeholder = (
-* self.mtp_k if is_deferred_out and self.use_spec else
-if is_deferred_out else
-            self.mtp_k     if self.use_spec else
-        )
+        # Determine how many placeholder tokens to use based on deferred output
+        # and speculative decoding configuration.
+        if is_deferred_out and self.use_spec:
+            num_placeholder = 2 * self.mtp_k
+        elif is_deferred_out:
+            num_placeholder = 1
+        elif self.use_spec:
+            num_placeholder = self.mtp_k
+        else:
+            num_placeholder = 0

	if token_ids and not seq.ignore_eos and self.eos_token_id in token_ids:
	if token_ids and not seq.ignore_eos and token_ids[-1] == self.eos_token_id:

	self.mtp_k = self.speculative_config.num_speculative_tokens
	self.mtp_k = self.speculative_config.num_speculative_tokens + 1

-            if is_deferred_out or (self.use_spec and self.eos_token_id == seq.token_ids[-1]):
+            # During speculative decoding, if the last committed token is EOS, we also
+            # need to replace the placeholder tokens with the sampled tokens.
+            is_spec_with_eos = self.use_spec and self.eos_token_id == seq.token_ids[-1]
+            if is_deferred_out or is_spec_with_eos:

	def __init__(self, max_num_batched_tokens: int, device: torch.device, use_spec: bool, num_speculative_tokens):
	def __init__(self, max_num_batched_tokens: int, device: torch.device, use_spec: bool, num_speculative_tokens: int):

-            num_rejected_tokens = [
-                n + 1 - len(sampled_token_ids[i]) if n > 0 else 0
-                for i, n in enumerate(num_draft_tokens)
-            ]
+            # Use actual sequence IDs from the batch to index sampled_token_ids
+            sequence_ids = [seq.seq_id for seq in batch.sequences]
+            num_rejected_tokens: list[int] = []
+            for seq_id, n in zip(sequence_ids, num_draft_tokens):
+                if n > 0:
+                    sampled = sampled_token_ids.get(seq_id, [])
+                    num_rejected_tokens.append(n + 1 - len(sampled))
+                else:
+                    num_rejected_tokens.append(0)

	if len(seq) % self.block_size == 1 or self.block_size == 1:
	if len(seq) % self.block_size == 1 or self.block_size == 1:
	# Compute the total number of blocks required to store the current sequence (ceiling division).

	return f"Sequence ({self.id=}, {self.token_ids=}, {self.spec_token_ids})"
	return f"Sequence ({self.id=}, {self.token_ids=}, {self.spec_token_ids=})"

support mtp stage 2: support deepseek mtp=1 #35

Are you sure you want to change the base?

support mtp stage 2: support deepseek mtp=1 #35

Uh oh!

Conversation

jiayyu commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

jiayyu commented Dec 9, 2025 •

edited

Loading