Bad throughput with GLU

I'm training models with the below specs but seeing major throughput drop when switching to GLU - Do you know why? / Ideas what I could investigate? Thanks a lot! cc @mvpatel2000 @tgale96 
```
active params: 1,011,613,696 (for glu: 1,280,049,152)
total params: 4,769,710,080 (for glu: 6,917,193,728)
8 H100s, 1 node
FSDP SHARD_GRAD_OP
mlp_impl=grouped
n_experts=8
k=1
micro_bs=1
global_bs=512
no megablocks expert/weight parallelism
```
With `mlp_type=mlp` & `activation_fn=gelu` I get **17000** tokens per second per device.

With `mlp_type=glu` & `activation_fn=silu` I get **1000** tokens per second per device.

A small drop is expected as it's slightly more params due to glu, but probably not this large? Switching away from grouped or trying the memory optimized mlp did not help. 🤔

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bad throughput with GLU #110

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bad throughput with GLU #110

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions