Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunk the sequence into multi distributed gpus, after get_sp_group().all_gather() return different length sequence #471

Open
yinian-lw opened this issue Mar 6, 2025 · 1 comment

Comments

@yinian-lw
Copy link

yinian-lw commented Mar 6, 2025

USP is used in my exp, ulysses_size=8, cogvideox model

if sp_size > 1:
        # Context Parallel
        # (bs, head_cnt, seq_len, head_size) -> (bs, seq_len/P, head_cnt, head_size)
        query = torch.chunk(query.transpose(1, 2), sp_size, dim=1)[sp_rank]
        key = torch.chunk(key.transpose(1, 2), sp_size, dim=1)[sp_rank]
        value = torch.chunk(value.transpose(1, 2), sp_size, dim=1)[sp_rank]

        hidden_states = xFuserLongContextAttention()(
                None,
                query=query,
                key=key,
                value=value,)
        import os
        local_rank = int(os.getenv("LOCAL_RANK", 0))
        if local_rank == sp_rank:
            print(f"local_rank:{local_rank}, hidden_states.shape: {hidden_states.shape}")
            print(f"local_rank:{local_rank}, query.shape: {query.shape}")
         # (bs, seq_len/P, head_cnt, head_size) -> (bs, seq_len, dim)
        hidden_states = get_sp_group().all_gather(hidden_states.contiguous(), dim=1)
        if local_rank == 0:
            print(f"after gather, hidden_states.shape: {hidden_states.shape}")
        hidden_states = hidden_states.reshape(batch_size, -1, attn.heads * head_dim)

Before chunk, sequence length was 57826(226text + 57600img).
After xFuserLongContextAttention() and get_sp_group().all_gather(), sequence length becomes 57832(226text + 57606img).

print the query& shape

local_rank:0, hidden_states.shape: torch.Size([1, 7229, 48, 64])
local_rank:6, hidden_states.shape: torch.Size([1, 7229, 48, 64])
local_rank:2, hidden_states.shape: torch.Size([1, 7229, 48, 64])
local_rank:7, hidden_states.shape: torch.Size([1, 7223, 48, 64])
local_rank:5, hidden_states.shape: torch.Size([1, 7229, 48, 64])
local_rank:4, hidden_states.shape: torch.Size([1, 7229, 48, 64])
local_rank:1, hidden_states.shape: torch.Size([1, 7229, 48, 64])
local_rank:3, hidden_states.shape: torch.Size([1, 7229, 48, 64])

local_rank:0, query.shape: torch.Size([1, 7229, 48, 64])
local_rank:6, query.shape: torch.Size([1, 7229, 48, 64])
local_rank:7, query.shape: torch.Size([1, 7223, 48, 64])
local_rank:2, query.shape: torch.Size([1, 7229, 48, 64])
local_rank:5, query.shape: torch.Size([1, 7229, 48, 64])
local_rank:4, query.shape: torch.Size([1, 7229, 48, 64])
local_rank:1, query.shape: torch.Size([1, 7229, 48, 64])
local_rank:3, query.shape: torch.Size([1, 7229, 48, 64])

after gather, hidden_states.shape: torch.Size([1, 57832, 48, 64])
@yinian-lw yinian-lw changed the title chunk the sequence into multi distributed gpus, after xFuserLongContextAttention() return different length sequnce chunk the sequence into multi distributed gpus, after xFuserLongContextAttention() return different length sequence Mar 6, 2025
@yinian-lw yinian-lw changed the title chunk the sequence into multi distributed gpus, after xFuserLongContextAttention() return different length sequence chunk the sequence into multi distributed gpus, after get_sp_group().all_gather() return different length sequence Mar 6, 2025
@feifeibear
Copy link
Collaborator

https://github.com/xdit-project/xDiT/blob/main/docs/performance/cogvideo.md

you can not use ulysses_degree=8 for cogvideo. Please refer to parallel config in the above report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants