Skip to content

Bad throughput with GLU #110

Open
Open
@Muennighoff

Description

@Muennighoff

I'm training models with the below specs but seeing major throughput drop when switching to GLU - Do you know why? / Ideas what I could investigate? Thanks a lot! cc @mvpatel2000 @tgale96

active params: 1,011,613,696 (for glu: 1,280,049,152)
total params: 4,769,710,080 (for glu: 6,917,193,728)
8 H100s, 1 node
FSDP SHARD_GRAD_OP
mlp_impl=grouped
n_experts=8
k=1
micro_bs=1
global_bs=512
no megablocks expert/weight parallelism

With mlp_type=mlp & activation_fn=gelu I get 17000 tokens per second per device.

With mlp_type=glu & activation_fn=silu I get 1000 tokens per second per device.

A small drop is expected as it's slightly more params due to glu, but probably not this large? Switching away from grouped or trying the memory optimized mlp did not help. 🤔

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions