Skip to content

fix: cherry-pick combined projection fixes (#1324, #1357) into r0.2.1#1388

Open
HuiyingLi wants to merge 2 commits intor0.2.1from
cherry-pick-1324-1357-r0.2.1
Open

fix: cherry-pick combined projection fixes (#1324, #1357) into r0.2.1#1388
HuiyingLi wants to merge 2 commits intor0.2.1from
cherry-pick-1324-1357-r0.2.1

Conversation

@HuiyingLi
Copy link
Contributor

Summary

These fixes correct the QKV and gate_up weight layout from naive concatenation to KV-head-grouped interleaving (QKV) and row interleaving (gate_up), ensuring each TP rank receives complete head groups under ColwiseParallel sharding.

Test plan

Status: NOT YET VERIFIED — needs to be validated before merge.

  • Verify Qwen2.5-7B SFT with TP=2 produces correct initial loss (~6) instead of NaN
  • Verify Llama3.2-3B SFT with TP=2

Signed-off-by: HuiyingLi willwin.lee@gmail.com
Signed-off-by: Claude Opus 4.6 (1M context) noreply@anthropic.com

* fix: compbined projection

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

* lint

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

* address comment

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

* add tp output parity tests

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

---------

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ort (#1357)

* fix: FSDP pre-shard combined projections on dim 1 for Qwen2.5-7B support

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

* revert recipe change

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

* lint

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

---------

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@HuiyingLi
Copy link
Contributor Author

Hi @ZhiyuLi-Nvidia , I cherrypicked these two commits. However the TP2 cases still see high loss. Do you know anything else missing? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants