Skip to content

Conversation

@windreamer
Copy link
Collaborator

@windreamer windreamer commented Nov 28, 2025

Motivation

When serving models with tensor parallelism (--tp >= 2), enabling guided decoding (e.g., response_format: {"type": "json_schema"}) causes segmentation faults。

Root Cause: The guided decoding state management layers (GuidedDecodeMaskLayer and GuidedDecodeUpdateLayer) were being instantiated on all tensor parallelism (TP) ranks. Each rank independently modified shared decoding state, causing race conditions and memory corruption. In distributed inference, only rank 0 should orchestrate guided decoding logic while other ranks perform parallel computation.

Modification

Eliminate shared mutable state by allocating an independent GrammarMatcher instance per TP rank. Enhance the decoding pipeline with rank awareness so each thread operates on its own isolated matcher, ensuring thread safety while maintaining deterministic behavior across ranks.

Fixes #4152

@windreamer windreamer self-assigned this Nov 28, 2025
@windreamer windreamer marked this pull request as ready for review November 28, 2025 07:37
@irexyc
Copy link
Collaborator

irexyc commented Nov 28, 2025

The sampled token are not broadcasted from tp_rank0, so all ranks should do the same sampling process to make sure the next token is same on all ranks.

I think the problem may be the state of GrammarMatcher can not be shared by all ranks. Currently, all ranks share the same std::shared_ptr<xgrammar::GrammarMatcher> in the request.

@windreamer
Copy link
Collaborator Author

GrammarMatcher

OK so we need to copy the GrammarMatcher instead ?

@windreamer windreamer marked this pull request as draft November 28, 2025 08:00
@windreamer
Copy link
Collaborator Author

The sampled token are not broadcasted from tp_rank0, so all ranks should do the same sampling process to make sure the next token is same on all ranks.

I think the problem may be the state of GrammarMatcher can not be shared by all ranks. Currently, all ranks share the same std::shared_ptr<xgrammar::GrammarMatcher> in the request.

I believe that a quick fix should be making GuidedDecodeUpdateLayer only executed in rank 0. As we only modify GrammarMatcher here. But I have no idea how to ensure we call GuidedDecodeUpdateLayer::Forward only when all ranks finish there sampling.

@irexyc
Copy link
Collaborator

irexyc commented Nov 28, 2025

I believe that a quick fix should be making GuidedDecodeUpdateLayer only executed in rank 0.

All ranks should have the same next token and the next token are computed by dynamic decoding, so we should make sure the dynamic decoding process are same on all ranks.

I think we can construct n_ranks of matchers here like r->matchers = ... https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/engine/model_request.cc#L131C9-L131C19

and chose the correspond matcher here like matchers_.push_back(r->matchers[rank_]);
https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/layers/sampling_layers/GuidedDecodeMaskLayer.cc#L36
https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/layers/sampling_layers/GuidedDecodeUpdateLayer.cc#L32

@windreamer windreamer force-pushed the fix_guided_decoding_tp branch from 8ada1ea to e7a7055 Compare November 28, 2025 10:05
@windreamer windreamer marked this pull request as ready for review November 28, 2025 10:11
@windreamer windreamer force-pushed the fix_guided_decoding_tp branch from e7a7055 to adb0148 Compare November 28, 2025 10:27
@windreamer windreamer requested a review from lvhan028 November 29, 2025 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Request cause core dump with --tp >= 2.

2 participants