chunk the sequence into multi distributed gpus, after get_sp_group().all_gather() return different length sequence #471

yinian-lw · 2025-03-06T11:16:32Z

USP is used in my exp, ulysses_size=8, cogvideox model

if sp_size > 1:
        # Context Parallel
        # (bs, head_cnt, seq_len, head_size) -> (bs, seq_len/P, head_cnt, head_size)
        query = torch.chunk(query.transpose(1, 2), sp_size, dim=1)[sp_rank]
        key = torch.chunk(key.transpose(1, 2), sp_size, dim=1)[sp_rank]
        value = torch.chunk(value.transpose(1, 2), sp_size, dim=1)[sp_rank]

        hidden_states = xFuserLongContextAttention()(
                None,
                query=query,
                key=key,
                value=value,)
        import os
        local_rank = int(os.getenv("LOCAL_RANK", 0))
        if local_rank == sp_rank:
            print(f"local_rank:{local_rank}, hidden_states.shape: {hidden_states.shape}")
            print(f"local_rank:{local_rank}, query.shape: {query.shape}")
         # (bs, seq_len/P, head_cnt, head_size) -> (bs, seq_len, dim)
        hidden_states = get_sp_group().all_gather(hidden_states.contiguous(), dim=1)
        if local_rank == 0:
            print(f"after gather, hidden_states.shape: {hidden_states.shape}")
        hidden_states = hidden_states.reshape(batch_size, -1, attn.heads * head_dim)

Before chunk, sequence length was 57826(226text + 57600img).
After xFuserLongContextAttention() and get_sp_group().all_gather(), sequence length becomes 57832(226text + 57606img).

print the query& shape

local_rank:0, hidden_states.shape: torch.Size([1, 7229, 48, 64])
local_rank:6, hidden_states.shape: torch.Size([1, 7229, 48, 64])
local_rank:2, hidden_states.shape: torch.Size([1, 7229, 48, 64])
local_rank:7, hidden_states.shape: torch.Size([1, 7223, 48, 64])
local_rank:5, hidden_states.shape: torch.Size([1, 7229, 48, 64])
local_rank:4, hidden_states.shape: torch.Size([1, 7229, 48, 64])
local_rank:1, hidden_states.shape: torch.Size([1, 7229, 48, 64])
local_rank:3, hidden_states.shape: torch.Size([1, 7229, 48, 64])

local_rank:0, query.shape: torch.Size([1, 7229, 48, 64])
local_rank:6, query.shape: torch.Size([1, 7229, 48, 64])
local_rank:7, query.shape: torch.Size([1, 7223, 48, 64])
local_rank:2, query.shape: torch.Size([1, 7229, 48, 64])
local_rank:5, query.shape: torch.Size([1, 7229, 48, 64])
local_rank:4, query.shape: torch.Size([1, 7229, 48, 64])
local_rank:1, query.shape: torch.Size([1, 7229, 48, 64])
local_rank:3, query.shape: torch.Size([1, 7229, 48, 64])

after gather, hidden_states.shape: torch.Size([1, 57832, 48, 64])

The text was updated successfully, but these errors were encountered:

feifeibear · 2025-03-09T06:04:32Z

https://github.com/xdit-project/xDiT/blob/main/docs/performance/cogvideo.md

you can not use ulysses_degree=8 for cogvideo. Please refer to parallel config in the above report.

yinian-lw changed the title ~~chunk the sequence into multi distributed gpus, after xFuserLongContextAttention() return different length sequnce~~ chunk the sequence into multi distributed gpus, after xFuserLongContextAttention() return different length sequence Mar 6, 2025

yinian-lw changed the title ~~chunk the sequence into multi distributed gpus, after xFuserLongContextAttention() return different length sequence~~ chunk the sequence into multi distributed gpus, after get_sp_group().all_gather() return different length sequence Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chunk the sequence into multi distributed gpus, after get_sp_group().all_gather() return different length sequence #471

chunk the sequence into multi distributed gpus, after get_sp_group().all_gather() return different length sequence #471

yinian-lw commented Mar 6, 2025 •

edited

Loading

feifeibear commented Mar 9, 2025

chunk the sequence into multi distributed gpus, after get_sp_group().all_gather() return different length sequence #471

chunk the sequence into multi distributed gpus, after get_sp_group().all_gather() return different length sequence #471

Comments

yinian-lw commented Mar 6, 2025 • edited Loading

feifeibear commented Mar 9, 2025

yinian-lw commented Mar 6, 2025 •

edited

Loading