Open
Description
I'm training models with the below specs but seeing major throughput drop when switching to GLU - Do you know why? / Ideas what I could investigate? Thanks a lot! cc @mvpatel2000 @tgale96
active params: 1,011,613,696 (for glu: 1,280,049,152)
total params: 4,769,710,080 (for glu: 6,917,193,728)
8 H100s, 1 node
FSDP SHARD_GRAD_OP
mlp_impl=grouped
n_experts=8
k=1
micro_bs=1
global_bs=512
no megablocks expert/weight parallelism
With mlp_type=mlp
& activation_fn=gelu
I get 17000 tokens per second per device.
With mlp_type=glu
& activation_fn=silu
I get 1000 tokens per second per device.
A small drop is expected as it's slightly more params due to glu, but probably not this large? Switching away from grouped or trying the memory optimized mlp did not help. 🤔
Metadata
Metadata
Assignees
Labels
No labels