BUGFIX: Handle layer name inconsistencies in pipeline parallel training #29789

penfever · 2025-12-01T15:12:02Z

Purpose

Fix vLLM v1 engine to properly handle pipeline parallelism (PP) by correctly managing KV cache groups when layers are distributed across different ranks.

Currently, when using PP > 1, the KV cache configuration is created globally with all layer names, but each PP rank only has a subset of layers. This causes failures when:

init_attn_backend tries to access layers that don't exist on the current rank
_allocate_kv_cache creates tensors for layers that aren't local
build_attn_metadata attempts to build metadata for non-existent layers

This PR introduces:

resolve_layers_from_vllm_config() - returns both found layers AND missing layer names for better visibility
_prune_kv_cache_group_layers() - prunes KV cache groups to only include layers local to the current rank
Proper handling of None attention metadata builders for empty groups
Debug logging for skipped remote layers

Test Plan

Run the new unit test

pytest tests/config/test_vllm_layers.py -v

Run with pipeline parallelism (requires multi-GPU)

vllm serve --pipeline-parallel-size 2

Test Result

Pre-commit checks pass (ruff, mypy, typos, clang-format)
Unit tests pass for layer resolution logic

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Benjamin Feuer <[email protected]>

chatgpt-codex-connector · 2025-12-01T15:20:04Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Benjamin Feuer added 12 commits December 1, 2025 10:03

vllm pp init

2b31b4e

Signed-off-by: Benjamin Feuer <[email protected]>

fix attn utils and model runner to handle layer name mismatches

46a76ae

Signed-off-by: Benjamin Feuer <[email protected]>

pruned kv cache handling

70cf58e

Signed-off-by: Benjamin Feuer <[email protected]>

pruned kv cache handling

0b2f160

Signed-off-by: Benjamin Feuer <[email protected]>

pruned kv cache handling

4da8706

Signed-off-by: Benjamin Feuer <[email protected]>

kv cache fix

e90cac6

Signed-off-by: Benjamin Feuer <[email protected]>

kv cache fix

f028445

Signed-off-by: Benjamin Feuer <[email protected]>

fix worker base

47e559a

Signed-off-by: Benjamin Feuer <[email protected]>

fixes to lenient check

949d365

Signed-off-by: Benjamin Feuer <[email protected]>

all changes

487bb48

Signed-off-by: Benjamin Feuer <[email protected]>

revert changes to importing triton utils

5f109fb

Signed-off-by: Benjamin Feuer <[email protected]>

fix mypy errors

3723206

Signed-off-by: Benjamin Feuer <[email protected]>

mergify bot added nvidia v1 labels Dec 1, 2025

github-project-automation bot added this to NVIDIA Dec 1, 2025

penfever marked this pull request as ready for review December 1, 2025 15:19

penfever requested review from LucasWilkinson, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners December 1, 2025 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUGFIX: Handle layer name inconsistencies in pipeline parallel training #29789

BUGFIX: Handle layer name inconsistencies in pipeline parallel training #29789

penfever commented Dec 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

BUGFIX: Handle layer name inconsistencies in pipeline parallel training #29789

Are you sure you want to change the base?

BUGFIX: Handle layer name inconsistencies in pipeline parallel training #29789

Conversation

penfever commented Dec 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Run the new unit test

Run with pipeline parallelism (requires multi-GPU)

Test Result

Uh oh!

chatgpt-codex-connector bot commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

penfever commented Dec 1, 2025 •

edited by github-actions bot

Loading