Skip to content

Conversation

@jianzs
Copy link
Collaborator

@jianzs jianzs commented Nov 18, 2025

What this PR does / why we need it?

Previously, the dummy run executed compute_logits only once, regardless of num_speculative_tokens. This caused execute_model to hang on compute_logits when lm head tensor parallelism exceeded 1. The fix ensures compute_logits executes correctly during dummy run, matching num_speculative_tokens.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an issue with speculative decoding (MTP) when tensor parallelism is used on the language model head. The core of the fix is to ensure the dummy run correctly simulates the multiple compute_logits calls that occur in a real run. While the fix is correctly applied for MtpProposer, it seems to be incomplete for EagleProposer, which could lead to the same issue in that scenario. Additionally, a refactoring in model_runner_v1.py appears to have introduced an AttributeError by calling a non-existent method on the drafter object. I've provided critical comments and suggestions for both issues.

hidden_states[dummy_indices])

def dummy_drafter_compute_logits(hidden_states):
return self.drafter.compute_logits(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The dummy_drafter_compute_logits function calls self.drafter.compute_logits, but the compute_logits method is on the model attribute of the drafter object, not on the drafter itself. This will result in an AttributeError. The call should be self.drafter.model.compute_logits.

Suggested change
return self.drafter.compute_logits(
return self.drafter.model.compute_logits(

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Collaborator

@whx-sjtu whx-sjtu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We firstly fixed hanging issue of running MTP=1 with llm head tp in PR #3915. This PR refactors it to run dummy_compute_logits in drafter's dummy_run and further fixes MTP > 1 scenario. LGTM.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@zouyida2052
Copy link
Contributor

I've tested it on deepseek and it proves to be useful, please make ci happy

Previously, the dummy run executed compute_logits only once, regardless of num_speculative_tokens. This caused execute_model to hang on compute_logits when lm head tensor parallelism exceeded 1. The fix ensures compute_logits executes correctly during dummy run, matching num_speculative_tokens.

Signed-off-by: Jade Zheng <[email protected]>
Signed-off-by: Jade Zheng <[email protected]>
Signed-off-by: Jade Zheng <[email protected]>
@jianzs jianzs added ready read for review ready-for-test start test by label for PR labels Nov 21, 2025
Signed-off-by: Jade Zheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants