You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm training models with the below specs but seeing major throughput drop when switching to GLU - Do you know why? / Ideas what I could investigate? Thanks a lot! cc @mvpatel2000@tgale96
active params: 1,011,613,696 (for glu: 1,280,049,152)
total params: 4,769,710,080 (for glu: 6,917,193,728)
8 H100s, 1 node
FSDP SHARD_GRAD_OP
mlp_impl=grouped
n_experts=8
k=1
micro_bs=1
global_bs=512
no megablocks expert/weight parallelism
With mlp_type=mlp & activation_fn=gelu I get 17000 tokens per second per device.
With mlp_type=glu & activation_fn=silu I get 1000 tokens per second per device.
A small drop is expected as it's slightly more params due to glu, but probably not this large? Switching away from grouped or trying the memory optimized mlp did not help. 🤔
The text was updated successfully, but these errors were encountered:
@Muennighoff what is your memory usage at? I would guess your memory allocator is thrashing -- this is a common problem close to limit when using dropless MoEs and leads to steep degradation in performance (as opposed to OOM).
I'm training models with the below specs but seeing major throughput drop when switching to GLU - Do you know why? / Ideas what I could investigate? Thanks a lot! cc @mvpatel2000 @tgale96
With
mlp_type=mlp
&activation_fn=gelu
I get 17000 tokens per second per device.With
mlp_type=glu
&activation_fn=silu
I get 1000 tokens per second per device.A small drop is expected as it's slightly more params due to glu, but probably not this large? Switching away from grouped or trying the memory optimized mlp did not help. 🤔
The text was updated successfully, but these errors were encountered: